F-CPU design team, (C) Yann Guidon, nov. 11 2000

 

  Draft/Presentation/RFC for the G-chip  

Introduction

This document presents the early stages and architecture of a companion chip to the F-CPU (FC0/F0) and similar designs. It discusses several aspects of the design but is not a firm specification, it should be commented and enhanced. Stay tuned.

 

Design Goals

I was seeking a painless way to interconnect 4 F-CPU to each others. More precisely, i wanted to solve the recurring problem of the "chipset" (a well-known syndrom in the PC world). The way i see the F0/F1, the external chips should not be a bottleneck for the core.

Let's go back to the F-CPU chip that "should" be implemented (if another external architecture is not used). It was designed to reduce the memory bottleneck, or "plague of the computer industry". The "F-chip" (the F-CPU chip that we consider) has two main external data ports :

  • one port is a dedicated, private, cachable and optimized SDRAM interface, up to two 128-bit wide banks. This means that we can access up to eight internal banks (4 per slot) and we can use the "stream hint bits" to their fullest. This interface is used most of the time when we need fast memory access (outside the F chip). All user data should be stored there.

  • one port is a 32-bit multiplexed I/O port. It can also be used for debugging the chip, accessing internal registers or private memory. It's an asynchronous port, meaning that it adapts its bandwidth to the external chip. In a multi-CPU system, it serves as unique port for communicating with other chips because all accesses through this port are not cachable. More precisely, the local F-chip doens't cache the remote I/O accesses, while a remote chip can access any local data without bothering about the location in the cache hierarchy. It behaves just like another Load/Store unit and the cache hierarchy is therefore transparent to the remote accesses.

    Note that in the near future, as technology and prices will allow, both sizes will be doubled at the same time : SDRAM bus to 256 bits and I/O bus (that we'll soon name as "F-bus") to 64 bits (breaking the 2^(32+5)=128GB bareer). Wider buses should even be considered in the far future, so the bus widths written here are just orders of magnitude that are relative to each others, not absolute widths. Here, the I/O bus is one quarter of the SDRAM bus width, and the bandwidth scales the same way.

    This organisation, split into two main functions, is very convenient : external I/O is slow and not cachable while private access is fast and cachable. In a F-CPU system, we only need SDRAM chips and some TTL parts and a few EPROMs to make it run full steam. Then, the F-bus is used to communicate with the I/Os (HDD, PCI bridge, whatever) and other F-CPUs, without having to deal with MESI or bandwidth sharing.

    That's where the problems arise. In a multi-F-chip configuration, the I/O bottleneck is a major pain. This study dealing with a 4xF-chip system, we see that a "bus" configuration is not a good idea in the case of a I/O intensive workload.

    Only one f-chip can speak at a time, the effective bandwith on the local system is 1/4 the I/O bandwidth. That is BAD. You'll laugh, but that's what there is in a multi-Intel PC. This explains why they are only good at extreme CPU-intensive tasks that fit in L1/L2. Where they fail is when they have to access the main memory. In our case, it can be desired to add a huge main memory and we see that there is something like "a little problem" : the bandwidth is very limited everywhere.

    The good sense says that each chip can have 8 bits of actual bandwidth. It is desired that the overall bandwidth scales with the main memory bandwidth and the number of CPUs. That is : when we add a CPU, we add memory and bandwidth. This is also the reason why the F-bus can be only a point-to-point bus, thus keeping the bandwidth proportional only with the bus width.

     

    A First Solution

    In the case of a system that scales from 1 to 4 CPUS, we add bandwidth, memory capacity and CPU power with a new module that consists of the CPU, its private memory, a communication and memory controller, and a large SDRAM bank.

    The number of necessary links is (4*(4-1))/2=3, so each f-chip can talk to each other. The boards can be stacked in a mezzanine fashion and the bandwidth is regularly spread across the system. On top of that, we can add a SCSI (or IDE if you're lazy) interface so the system is not stuck by any swapping (one of the problems of the ASCI architecture). A single board would look like that :

    The real layout will be determined later because the 2D placement is less simple than that : the local SDRAM banks must be split into 64-bits and surround the F-chip (more about this later). Here, we assume that we can have the technology for (+/-) 400-pin BGA chips and around 5 or 6 PCB layers. The boards are stacked like this :

     

    Routing :

    Now, how are the boards connected to each others ? Each board must have its dedicated path to each other, yet the boards must have the same layout. It's not an easy task because the boards differ at least by the implantation of the mezzanine connector. A more elegant solution can be found but we'll not bother yet. Here's what the signals should do :

    Now, the mapping/correspondence between the ports (1,2,3) of the board and the columns (1 to 5) of the mezzanine connector is another problem that depends a lot on the available type of connector and PCB technology. I don't know at all if there's any optimal solution.

    It appears, though, that the G-chip should be able to re-route the data internally, depending on its board number. The port number differs if we want to send data to board 1, it depends if we're on board 2, 3 or 4. Internal dynamical routing appears as a features that allows far better flexibility in the board/system design.

    Another solution to the routing problem is the use of a backplane connector with adapted tracks. It is uncertain whether the density allowed by the corresponding type of connectors is adapted to the traffic. A single CompactPCI connector (up to 125 pins) can route 3*32=96 data pins but this adds a new kind of device to design. A backplane connector can add new unnecessary troubles and increase the overall price. OTOH, mezzanine connectors, even though not easy to configure, spread the signals and the banwidth across the board and reduce the trace lentgths.

    One problem with bidirectional buses is the "turnaround" cyccles : it takes "some time" to switch the buffers from output to input and vice versa. Unfortunately, the topologies allowed by read-only and write-only ports are less interesting. One example with 3 ports is a total interconnexion of 3 chips : each has two read and one write port.

    Routing is really straight-forward : put all ports as read only except the port that is equal to the board number. Unfortunately, this means that for 4 CPU we need 4 ports, and the fanout is increased.

     

    Here's where it's getting interesting...

    So we stick to the first solution : 3 intercom buses, 1 bus to the F-chip, plus the whole additional interfaces (SDRAM, SCSI...). Now the striking idea is that the bus with the F-chip and the intercom buses can be the same : a F-bus. This means that it makes no difference whether the G-chip is connected to a F-chip or another G-chip, whatever the port.

    It makes a big difference because the allowed topologies are much more flexible ! Depending on the system and the constraints, we can plug 4 F-chips to a single G-chip, or make much more complex, scalable and routed topologies. The G-chip can act as a switch or crossbar in a multidimensional network, a switch that also controls memory and HDD, so it acts like a multiported memory controller and a router.

    Note that the above configurations are "good" because they preserve the overall memory bandwidth across the whole system. All the CPU can talk at the same time without being (too much) slowed down by conflicts.

    The G-chip can implement the routing algorithm described above, inside its crossbar. The Flash EEPROM is used to boot the F-chip. This additional feature as well as the SCSI interface could also be reconfigured so they can serve as control ports, bus interface (multicast with SCSI signaling), or whatever we can think of. The G-chip becomes a building brick for the whole system, without sacrificing the performance.

     

    F-bus definition

    Now we can think more about the F-bus protocol and format. It should be as simple and straight-forward as possible, and use the fewest control signals.

  • It is a point-to-point, bidirectional bus. The turnaround cycles are not really harmful because the bursts are here to catch up with the latency. It requires some bits for the direction of the transfer.

  • It is an asynchronous bus, because it must adapt its flow to the remote chip. It should be able to talk with TTL chips as well as devices that are faster than itself. It requires a bidirectional acknowledge bit and some internal FIFOs to speedup bursts.

  • It is a bursted bus : the address is sent first (with a 32-byte granularity) and the data follow (in the correct direction). There is an A/D (address/data) pin. The burst can be sent out of order, so we need 5 bits (could be reduced ?) to indicate which word is sent. The transfer can be aborted by a change of the A/D line, so we can fetch one word at a time if we only need one only (ie : configuration registers, I/O devices...).

  • It could be a "transactional bus". that means that it can send an address in the middle of a data block, in order to prefetch some data. 2 or 3 bits are necessary to tag the transaction. This is an optional feature.

  • Addresses can be sent in any direction, so we can debug a chip or access internal registers from another chip.

  • The bus grant follows a round-robin protocol : one end controls the bus when it can control the control signals and write data on the bus. When it has to read from the outside, it looses the control and the rest of the transaction is controlled by the remote chip, until it has nothing more to write. A maximum burst length should be respected (16 cycles ?) so each side can keep in touch with the others.

  • Control signals should be bidirectional whenever possible to reduce their number. Anyway, it also should be interfaceable with dumb TTL circuits, so the protocol should be as simple as possible.

  • SEC/ECC is required on the data bus but can be forgotten in some cases (slow or old devices that don't support or need it, ie EPROMs or I/O devices). When F-chips communicate together, they are highly encouraged to protect the data against transmission errors. So we need 5 or 6 bidirectional ECC bits.

    That's all what comes to my mind for this subject at this moment. You understand the principle : when you're the bus master, you send the address, wait for acknowledge, then sample the data at each rising ack signal. Double clocking (falling and rising edge) is not yet in preparation.

     

    Conclusions

    All this is preliminary. This case study is based on certain assumptions like the availability of a working F-bus, of a certain minimal technology and the corresponding budget, and the target system is not the usual PC that sits on your desk. It is oriented towards workstations or servers that scale from 1 to 8 CPUs.

    We notice anyway that the availability of the G-chip also enables the design of lower-cost systems that range from 1 to 4 CPUs with less parallelism (1 HDD and 1 global RAM), simply by adding a F-CPU module with its private memory. The "low-end" market will, (as expected one year ago) comprise single-F-chips boards (with its private memory) and G-chip boards, offering modularity and scalability.

    The design of the G-chip is not a technological challenge but an important factor for the success of the architecture. This means that it must be as carefully designed as the F-CPU itself. We can expect that almost as many G-chips will be used as F-chips, they can therefore be funded on the same waffer and use more or less the same process and packaging, thus reducing the production costs.

    Since this kind of system is not going to exist in the next months or years, the control of the system (cluster or node) is not yet investigated. Fault Tolerant Computing is not considered, either. The internal routing capabilities of the F-chip can help remap a F-chip's address to another (like in some other // computers) but no hotswap is imagined yet. More severe, interconnexion between CPU clusters is not yet forecast : as noted before, the overall bandwidth is a critical factor for computational intensive tasks as well as more "trivial" operations. A bad preparation or wrong orientation would virtually "isolate" a cluster. Fortunately, the F-bus helps homogenise the architecture so any kind of chip can talk to another.

    We can also imagine that, when the technology will allow enough pads on the chips, the F- and G- chips can be merged. In the far future, this would make a F-CPU look like a sort of "transputer" but the other way is to enlarge the data buses and keep the organisation proposed in this draft, simply doubling each bus width.

     

    that's all for today, folks.