Draft/Presentation/RFC for the G-chip |
This document presents the early stages and architecture of a companion chip to the F-CPU (FC0/F0) and similar designs. It discusses several aspects of the design but is not a firm specification, it should be commented and enhanced. Stay tuned.
I was seeking a painless way to interconnect 4 F-CPU to each others. More precisely, i wanted to solve the recurring problem of the "chipset" (a well-known syndrom in the PC world). The way i see the F0/F1, the external chips should not be a bottleneck for the core.
Let's go back to the F-CPU chip that "should" be implemented (if another external architecture is not used). It was designed to reduce the memory bottleneck, or "plague of the computer industry". The "F-chip" (the F-CPU chip that we consider) has two main external data ports :
Note that in the near future, as technology and prices will allow, both sizes will be doubled at the same time : SDRAM bus to 256 bits and I/O bus (that we'll soon name as "F-bus") to 64 bits (breaking the 2^(32+5)=128GB bareer). Wider buses should even be considered in the far future, so the bus widths written here are just orders of magnitude that are relative to each others, not absolute widths. Here, the I/O bus is one quarter of the SDRAM bus width, and the bandwidth scales the same way.
This organisation, split into two main functions, is very convenient : external I/O is slow and not cachable while private access is fast and cachable. In a F-CPU system, we only need SDRAM chips and some TTL parts and a few EPROMs to make it run full steam. Then, the F-bus is used to communicate with the I/Os (HDD, PCI bridge, whatever) and other F-CPUs, without having to deal with MESI or bandwidth sharing.
That's where the problems arise. In a multi-F-chip configuration, the I/O bottleneck is a major pain. This study dealing with a 4xF-chip system, we see that a "bus" configuration is not a good idea in the case of a I/O intensive workload.
Only one f-chip can speak at a time, the effective bandwith on the local system is 1/4 the I/O bandwidth. That is BAD. You'll laugh, but that's what there is in a multi-Intel PC. This explains why they are only good at extreme CPU-intensive tasks that fit in L1/L2. Where they fail is when they have to access the main memory. In our case, it can be desired to add a huge main memory and we see that there is something like "a little problem" : the bandwidth is very limited everywhere.
The good sense says that each chip can have 8 bits of actual bandwidth. It is desired that the overall bandwidth scales with the main memory bandwidth and the number of CPUs. That is : when we add a CPU, we add memory and bandwidth. This is also the reason why the F-bus can be only a point-to-point bus, thus keeping the bandwidth proportional only with the bus width.
In the case of a system that scales from 1 to 4 CPUS, we add bandwidth, memory capacity and CPU power with a new module that consists of the CPU, its private memory, a communication and memory controller, and a large SDRAM bank.
The number of necessary links is (4*(4-1))/2=3, so each f-chip can talk to each other. The boards can be stacked in a mezzanine fashion and the bandwidth is regularly spread across the system. On top of that, we can add a SCSI (or IDE if you're lazy) interface so the system is not stuck by any swapping (one of the problems of the ASCI architecture). A single board would look like that :
The real layout will be determined later because the 2D placement is less simple than that : the local SDRAM banks must be split into 64-bits and surround the F-chip (more about this later). Here, we assume that we can have the technology for (+/-) 400-pin BGA chips and around 5 or 6 PCB layers. The boards are stacked like this :
Now, how are the boards connected to each others ? Each board must have its dedicated path to each other, yet the boards must have the same layout. It's not an easy task because the boards differ at least by the implantation of the mezzanine connector. A more elegant solution can be found but we'll not bother yet. Here's what the signals should do :
Now, the mapping/correspondence between the ports (1,2,3) of the board and the columns (1 to 5) of the mezzanine connector is another problem that depends a lot on the available type of connector and PCB technology. I don't know at all if there's any optimal solution.
It appears, though, that the G-chip should be able to re-route the data internally, depending on its board number. The port number differs if we want to send data to board 1, it depends if we're on board 2, 3 or 4. Internal dynamical routing appears as a features that allows far better flexibility in the board/system design.
Another solution to the routing problem is the use of a backplane connector with adapted tracks. It is uncertain whether the density allowed by the corresponding type of connectors is adapted to the traffic. A single CompactPCI connector (up to 125 pins) can route 3*32=96 data pins but this adds a new kind of device to design. A backplane connector can add new unnecessary troubles and increase the overall price. OTOH, mezzanine connectors, even though not easy to configure, spread the signals and the banwidth across the board and reduce the trace lentgths.
One problem with bidirectional buses is the "turnaround" cyccles : it takes "some time" to switch the buffers from output to input and vice versa. Unfortunately, the topologies allowed by read-only and write-only ports are less interesting. One example with 3 ports is a total interconnexion of 3 chips : each has two read and one write port.
Routing is really straight-forward : put all ports as read only except the port that is equal to the board number. Unfortunately, this means that for 4 CPU we need 4 ports, and the fanout is increased.
So we stick to the first solution : 3 intercom buses, 1 bus to the F-chip, plus the whole additional interfaces (SDRAM, SCSI...). Now the striking idea is that the bus with the F-chip and the intercom buses can be the same : a F-bus. This means that it makes no difference whether the G-chip is connected to a F-chip or another G-chip, whatever the port.
It makes a big difference because the allowed topologies are much more flexible ! Depending on the system and the constraints, we can plug 4 F-chips to a single G-chip, or make much more complex, scalable and routed topologies. The G-chip can act as a switch or crossbar in a multidimensional network, a switch that also controls memory and HDD, so it acts like a multiported memory controller and a router.
Note that the above configurations are "good" because they preserve the overall memory bandwidth across the whole system. All the CPU can talk at the same time without being (too much) slowed down by conflicts.
The G-chip can implement the routing algorithm described above, inside its crossbar. The Flash EEPROM is used to boot the F-chip. This additional feature as well as the SCSI interface could also be reconfigured so they can serve as control ports, bus interface (multicast with SCSI signaling), or whatever we can think of. The G-chip becomes a building brick for the whole system, without sacrificing the performance.
Now we can think more about the F-bus protocol and format. It should be as simple and straight-forward as possible, and use the fewest control signals.
That's all what comes to my mind for this subject at this moment. You understand the principle : when you're the bus master, you send the address, wait for acknowledge, then sample the data at each rising ack signal. Double clocking (falling and rising edge) is not yet in preparation.
All this is preliminary. This case study is based on certain assumptions like the availability of a working F-bus, of a certain minimal technology and the corresponding budget, and the target system is not the usual PC that sits on your desk. It is oriented towards workstations or servers that scale from 1 to 8 CPUs.
We notice anyway that the availability of the G-chip also enables the design of lower-cost systems that range from 1 to 4 CPUs with less parallelism (1 HDD and 1 global RAM), simply by adding a F-CPU module with its private memory. The "low-end" market will, (as expected one year ago) comprise single-F-chips boards (with its private memory) and G-chip boards, offering modularity and scalability.
The design of the G-chip is not a technological challenge but an important factor for the success of the architecture. This means that it must be as carefully designed as the F-CPU itself. We can expect that almost as many G-chips will be used as F-chips, they can therefore be funded on the same waffer and use more or less the same process and packaging, thus reducing the production costs.
Since this kind of system is not going to exist in the next months or years, the control of the system (cluster or node) is not yet investigated. Fault Tolerant Computing is not considered, either. The internal routing capabilities of the F-chip can help remap a F-chip's address to another (like in some other // computers) but no hotswap is imagined yet. More severe, interconnexion between CPU clusters is not yet forecast : as noted before, the overall bandwidth is a critical factor for computational intensive tasks as well as more "trivial" operations. A bad preparation or wrong orientation would virtually "isolate" a cluster. Fortunately, the F-bus helps homogenise the architecture so any kind of chip can talk to another.
We can also imagine that, when the technology will allow enough pads on the chips, the F- and G- chips can be merged. In the far future, this would make a F-CPU look like a sort of "transputer" but the other way is to enlarge the data buses and keep the organisation proposed in this draft, simply doubling each bus width.