Part 1 :
1.1 Description of the F-CPU project
1.2 Frequently Asked Questions
1.2.1 Philosophy
1.2.2 Tools
1.2.3 Architecture
1.2.4 Performance
1.2.5 Compatibility
1.2.6 Cost/Price/Purchasing
1.3 The genesis of the F-CPU Project
1 History
2 The Freedom GNU/GPL'ed architecture
3 Developing the Freedom architecture : issues and challenges
4 Tools
5 Conclusion
6 Appendix A Ideas for a GPL'ed 64-bit high
performance processor design
7 Appendix B Freedom-F1 die area / cost /
packaging physical characteristics
8 Appendix C Legal issues / financial issues
1.4 A bit of F-CPU history
1.4.1 M2M
1.4.2 TTA
1.4.3 Traditional RISC
1.5 The design constraints
The F-CPU group is one of the many projects that try to follow the example shown
by the Linux project, which proved that a non-commercial product can surpass
expensive and proprietary products. The F-CPU group tries to apply this "recipe"
to the Hardware and Computer Design world, starting with the "holly grail" of
any computer architect : the microprocessor.
This utopic project was only a dream at the beginning but after two group splits and much efforts, we have come to a rather stable ground for a really scalable and clean architecture without sacrificing the perforance. Let's hope that the third attempt is the good one and that a prototype will be created anytime soon.
The F-CPU project can be split into several (approximative and not exhaustive)
parts or layers that provide compatibility and interoperability
during the life of the project (from HardWare to SoftWare) :
- F-CPU Peripherals and Interfaces
(bus, chipset, bridges...)
- F-CPU Core Implementations
(individual chips, or revisions) [for example, F1, F2, F3...]
- F-CPU Cores (generations, or families)
[for example, FC0, FC1, etc]
- F-CPU Instruction Set and User-visible ressources
- F-CPU Application Binary Interface
- Operating System (aimed at Linux-likes)
- Drivers
- Applications
Any layer depends directly or indirectly from any other. The most important part is the Instruction Set Architecture, because it can't be changed at will and it is not a material part that can evolve when the technology/cost ratio change. On the other hand, the hardware must provide binary compatibility but the constraints are less important. That is why the instructions should run on a wide range of processor microarchitectures, or "CPU cores" that can be changed or swapped when the budget changes.
Any core family will be binary compatible with each other and execute the same applications, run under the same operating systems and deliver the same results with different instruction scheduling rules, special registers, prices and performances. Each core family can be implemented in several "flavours" like a different number of instructions executed by cycle, different memory sizes, different word sizes, but the software should directly benefit from these features without (much) changes.
This document is a study and working basis for the definition of the F-CPU architecture, aimed at prototyping and first commercial chip generation (codenamed "F1"). This document explains the architectural and technical backgrounds that led to the current state of the "FC0" core as to reduce the amount of basic discussions on the mailing list and introduce the newcomers (or those who come back from vacations) to the most recent concepts that have been discussed.
This manual describes the F-CPU family through its first implementation and core. The FC0 core is not exclusive to the F-CPU project, which can and will use other cores as the project grows and mutates. The FC0 core can be used for almost any similar RISC architecture with some adaptations.
The document will (hopefully) evolve rapidly and incorporate more and more advanced discussions and techniques. This is not a definitive manual, it is open to any modification that the mailing list agrees to make. It is not exhaustive either, and may lag as the personal free time fluctuates. You are very encouraged to contribute to the discussion, because nobody will do it for you.
Some development rules :
Last modified : 31/05/99
modified by Whygee, 9/11/1999
Q1 : What does the F in F-CPU stand for ?
A : It stands for Freedom, which is the original name of the architecture, or Free, in the GNU/GPL sense.
The F does not stand for free in a monetary sense. You
will have to pay for the F1 chip, just as you have to pay
nowadays for a copy of a GNU/Linux distribution on CD-ROMs. Of
course, you're free to take the design and masks to your
favorite fab and have a few batches manufactured for your own
use.
Q2 : Why not call it an O-CPU (where O stands for Open) ?
A : There are some fundamental philosophical differences between the Open Source movement and the original Free Software movement. We abide by the latter, hence the F.
The fact that a piece of code is labeled Open Source doesn't mean that your freedom to use it, understand it and improve upon it is guaranteed. Further discussion of these matters can be found here.
A licence similar to the GPL (GNU Plublic Licence from the
Free Software Foundation) is in creation. Yet, in the absence of a definitive
licence that is adapted to the "hardware Intellectual Property", you can
read the GPL by replacing the word "software" with the word "Intellectual Property".
Specifically, there are at least three levels of freedom that must be preserved at
any cost:
- Freedom to use the Intellectual Property : no restriction must exist to use the IP
of the F-CPU project. This means, no fee to access the data and ALL the necessary
informations to recreate a chip.
- Freedom to reverse-engineer, understand and modify the Intellectual Property at will.
- Freedom to redistribute the IP.
This is NOT public domain. The F-CPU group owns the IP that it produces. It chooses to
make it freely available to anybody by any means.
Q1 : Which EDA tools will you use ?
A : There has been a lot of debate on this subject. It's mainly a war between Verilog and VHDL. We'll probably use a combination of both.
We will first begin with software architecture simulators written in C(++). We could also use some new "free" EDA tools that are appearing. We'll have to use commercial products at one point or another because the chip makers use proprietary software.
Q1 : What's that memory-to-memory architecture I heard about ? Or this TTA engine ? Why not a register-to-register architecture like all other RISC processors ?
A : M2M was an idea that was discussed for the F-CPU
at its beginning. It had several believed advantages over
register-to-register architectures, like very low context switching
latency (no registers to save and restore).
That's what they thought. The SRB mechanism solves this problem
for a classical RISC architecture.
TTA is another architecture that was explored before the current design started.
Q2 : You're thinking about an external FPU ?
A : Maybe.
No. Bandwidth and pin count problems.
Q3 : Why don't you support SMP ?
A : SMP is an Intel-proprietary implementation of Symmetric Multi-Processing.
We'll probably try. If not in F1, in F2 :).
The "F1" will be like a "proof of concept" chip. It will not even support IEEE floating point numbers, so we can't support a classical SMT system from the beginning. Anyway, memory coherency will be enforced on the F1 with an OS-based paging mechanism where only one chip at a time in a system can cache a page : this avoids the bus snoops and the waste of bandwidth.
Q1 : What can we expect in terms of performance from the F1 CPU ?
A : Merced-killer. :-). No seriously, we hope to get some serious performance.
We think we can achieve good performance because we start from scratch (x86 is slower because it has to be compatible with older models). We're intend to have gcc/egcs as the main compiler for the F-CPU and port Linux too.
LINUX and GCC are not the best garanties for performance in themselves. For example, GCC doesn't handle SIMD data. We will certainly create a compiler that is more adapted to the F-CPU and GCC will be used as a "bootstrap" at the beginning.
The FC0 core family is aimed to achieve the best MOPS/MIPS ratio possible, around 1 (and maybe a bit more). The superpipeline garanties that the best clock frequency is reached for any silicon technology. The memory bandwidth can be virtually increased with different hint strategies. So we can predict that a 100MHz chip with 1 instruction decoded at each cycle can easily achieve 100 million operations per second. Which is not bad at all because you can achieve that with an "old" (cheap) silicon technology that couldn't achieve 100MOPS with a x86 architecture. Add to that unconstrained SIMD data width, and you get a picture of the peak MOPS it can reach.
Q1 : Will the F-CPU be compatible with x86 ?
A : No.
There will be NO binary compatibility between the F-CPU and x86 processors.
It should however run Windows emulators that include x86 CPU emulators such as Twin, as well as Windows itself under whole-PC emulators such as Bochs. In either case however you will need to run another operating system, such as GNU/Linux, and emulation will likely be fairly slow.
And what would be the point of using Windblows when you
can run Linux/FreeBSD instead ?
Q2 : Will I be able to plug the F-CPU in a standard Socket 7, Super 7, Slot 1, Slot 2, Slot A motherboard ?
A : It's an ongoing debate.
Great chances are that no early version of the F-CPU will
be available for Socket7 or x86 mother boards.
Reason 1 : BIOS should be rewritten, the chipsets should be
analysed, and there are way too many chipsets/motherboards around.
Reason 2 : socket/pins/bandwidth : the x86 chips are really
"memory bound", the bandwidth is too low, some pins are not useful
for a non-x86 chip and supporting all the functions of the x86 interface
will make the chip (its design and debugging) too complex.
Reason 3 : we don't want to pay the fees for the use of proprietary slots.
ALPHA- or MIPS-like slots will probably be supported, we might include
an EV-4 interface to the F-CPU standard.
Q3 : What OS kernels will the F-CPU support?
A : Linux will be ported first. Other ports may follow. The port of Linux will be developed simultaneously with the F-CPU development.
But first we must have a working software development tool
that simulates the architecture and creates binaries, so we
must first define the F-CPU...
Q4 : What programs will I be able to run on the F-CPU ?
A : We will port gcc/egcs to the F architecture. Basically the F-CPU will run all the software available for a standard GNU/Linux distribution.
GCC is not perfectly adapted to fifth generation CPUs. We will probably adapt it for the F-CPU but making a GCC backend will be enough to compile LINUX/whatever.
Q1 : Will I be able to buy a F-CPU someday ?
A : We hope so.
that's all the point of the project, but be patient and take part
of the discussions !
Q2 : How much will the F-CPU cost ?
A : We don't know. It depends on how many are made.
There was an early slightly optimistic estimate that an F-CPU would cost approximately US$100, if 10000 were made.
This also depends on a lot of factors like the desired performance, the size of the cache memory, the number of pins, and most of all, the possibility to combine all these factors in the available technology.
A lot of things have happened since the following document was written. The motivation
has not changed though, and the method is still the same. The original authors are unreachable
now but we have kept on working more and more seriously on the project. At the time of writing,
several questions asked in the following text have been answered, but now that the group
is structuring itself, the other questions become more important because we really
have to face them : it's not utopy anymore, the fiction slowly becomes reality.
The first generation was a "memory to memory"
(M2M) architecture that disapeared with the original F-CPU team members. It
was believed that context switch time consumed much time, so they mapped memory regions
to the register set, as to switch the registers by changing the base register. I have
not tracked down the reasons why this has been abandonned, I came later in the group.
Anyway, they launched the F-CPU project, with the goals that we now know, and the dream
to create a "Merced Killer". Actually, i believe that we should compete with the ALPHA
directly ;-)
The second generation was a "Transfer Triggered
Architecture" (TTA) where the computations are triggered by transfers between
the different execution units. The instructions mainly consist of the source and
destination "register" numbers, which can also be the input or output ports of the
execution units. As soon as the needed input ports are written to, the operation is performed
and the result is readable on the output port. This architecture has been promoted by
the anonymous AlphaRISC, now known as AlphaGhost. He has done a lot of work on it but
he has left the list and the group lost track of the project without him.
Brian Fuhs explained TTA on the mailing list this way :
TTA stands for Transfer-Triggered Architecture. The basic idea is that you don't tell tell the CPU what to do with your data, you tell it where to put it. Then, by putting your data in the right places, you magically end up with new data in other places that consists of some operation performed on your old data. Whereas in a traditional OTA (operation-triggered architecture) machine, you might say ADD R3, R1, R2, in a TTA you would say MOV R1, add; MOV R2, add; MOV add, R3. The focus of the instruction set (if you can call it that, since a TTA would only have one instruction: MOV) is on the data itself, as opposed to the operations you are performing on that data. You specify only addresses, then map addresses to functions like ADD or DIV.
That's the basic idea. I should start by specifying that I'm focusing on general processing here, and temporarily ignoring things like interrupts. It is possible to handle real-world cases like that, since people have already done so; for now, I'm more interested in the theory Any CPU pipeline can be broken down into three basic stages: fetch and decode, execute, and store. Garbage in, garbage processing, garbage out. :). With OTAs this is all done in hardware. You say ADD R3, R1, R2, and the hardware does the rest. It handles internal communication devices to get data from R1 and R2 to the input of the adder, lets the adder do its thing, then gets the data from the output of the adder back into the register file, in R3. In most modern architectures, it checks for hazards, forwards data so the rest of the pipeline can use it earlier, and might even do more complicated things like reordering instructions. The software only knows 32 bits; the hardware does everything else.
The IF/ID stage of a TTA is very different. All of the burden is placed on software. The instruction is not specified as ADD (something), but as a series of SRC,DEST address pairs. All the hardware needs to do is control internal busses to get the data where it is supposed to go. All verification of hazards, optimal instruction order, etc should be done by the compiler. The key here is that a TTA, to achieve IPC measures comparable to an OTA, must be VLIW: you MUST be able to specify multiple moves in a single cycle, so that you can move all of your source data to the appropriate places, and still move the results back to your register file (or wherever you want them to go). In summary, to do an ADD R3, R1, R2, the hardware will do the following:
TTA OTA --------------------------------------------------------------------- MOV R1, add ADD R3, R1, R2 Move R1->adder Check for hazards MOV R2, add Check for available adder Move R2->adder Select internal busses and move data (adder now does its thing in both cases) MOV add, R3 Check for hazards Move adder->R3 Schedule instruction for retire Select internal busses and move data Retire instruction
The compiler, of course, becomes much more complicated, because it has to do all of the scheduling work, at compile time. But the hardware in a TTA doesn't need to worry about much of anything... About all it does in the simple cases is fetch instructions and generate control signals for all of the busses.
Execution is identical between TTA and OTA. Crunch the bits. Period.
Instruction completion is again simplified in a TTA. If you want correct behavior, make sure your compiler will generate the right sequence of moves. This is compared to a OTA, where you at least have to figure out what write ports to use, etc.
Basically, a TTA and an OTA are functionally identical. The main differences are that a TTA pretty much has to be VLIW, and requires more of the compiler. However, if the "smart compiler and dumb machine" philosophy is really the way to go, TTA should rule. It exposes more of the pipeline to software, reducing the hardware needed and giving the compiler more room to optimize. Of course, there are issues, like code bloat and constant generation, but these can be covered later. The basic ideas have been covered here (albeit in a somewhat rambling fashion... I had this email all composed in my head, and had some very clear explanations, right up until I sat down and started typing). For more information see http://www.cs.uregina.ca/~bayko/design/design.html and http://cardit.et.tudelft.nl/MOVE . These two have a lot more informatin on the details of TTA; I'm still hopeful that we can pull one of these off, and I think it would be good for performance, generality, cost, and simplicity. Plus, it's revolutionary enough that it might turn some heads - and that might get us more of a user (and developer) base, and make the project much more stable.
Send me questions, I know there will be plenty...
Brian
If you want to understand further the TTA concept, the difference is in the philosophy, it's as if you had instructions to code a dataflow machine on-the-fly. Notice also the fact that less registers are needed : registers are required to store the temporary results of operations between instructions of a code sequence. Here, the results are directly stored by the units, there are less "temporary storage" needed, less register pressure.
To envision this difference, think about a data dependency graph : in OTA, an instruction is a node, while in TTA the mov instruction is the branch. Once this is understood, there's not much work to do on an existing (yet simple) compiler to generate TTA instructions.
Let's examine : S = (a+b) * (c-d) for example. a,b,c,d are known "ports", registers or TTA addresses.
a b c d 1\ /2 3\ /4 + - 5\ /6 \ / * |7 S
ADD r5,a,b SUB r6,c,d MUL r7,r5,r6(there are other nasty ways to code this).
In TTA there is one "port" in each unit for each incoming branch. This means that ADD, having two operands, has two ports. There is one result port, which uses the address of one port, but that is used as read, not write. Another detail is that this read port can be static : it holds the result until another operation is triggered. We can code
mv ADD1,a mv SUB1,c mv ADD2,b (this triggers the a+b operation) mv SUB2,d (this triggers the c-d operation) mv MUL1,ADD mv MUL2,SUB (this triggers the * operation)
TTA is not "better", it's not "worse", it's just completely different while the problem will always be the same. If the instructions are 16 bit wide, it takes 96 bits, just as the OTA example would do. In some cases, it can be better as it was shown long ago on the list. TTA has some interesting properties, but unfortunately, in the very near future, it's not probable that a TTA will enter inside a big computer as RISC or CISC do. A TTA core can be as efficient as the ARM core, for example, it suits well to this scale of die size, but too few studies have been made, compared to the existing studies on OTA. Because the solution of its scaling up are not (yet) known, this leads to the discussions that shaked the mailing list near december 1998: the problem of where to map the registers, how would the ports be remapped on the fly, etc. When additional instructions are needed, this jeopardizes the whole balance of the CPU and evolutivity is more constraining than for RISC or OTA in general.
The physical problem of the busses has also been raised : if we have say 8 buses of 64 bits, this makes 512 wires, it takes around one millimeter of width with a .5u process. Of course, we can use a crossbar instead.
As discussed a few times long ago, because of its scalability problems (assignation of the ports and its flexibility), TTA is not the perfect choice for a long-lasting CPU family, while its performance/complexity ratio is good. So, it would be possible that the F-CPU team makes a RISC->TTA translator in front of a TTA core that would not have most of the scalability problems. This would be called the "FC1" (FC0 is the RISC core). Of course, time will show how the TTA ghosts of the F-CPU group will change.
But TTA's problem is probably that it is too specialized, where OTA can change its core and still use the same binaries. It's one of the points that "killed" the previous F-CPU attempt. Each TTA implementation could not be completely compatible with another, because of the instruction format, of the assignation of the "port" and other similar details : the notion of "instruction" is bound with the notion of "register".
I am not trying to prove the advantage of one technique over another, i am trying to show the difference of point of view, that finally treats the same problem. The scalability, that is necessary for such a project, is more important than we thought, and the group finally showed interest for a more classical technology.
The third generation rose from the mailing list
members who naturally studied a basic RISC architecture, like the first generation
MIPS processors or the DLX described by Patterson & Hennessy, the MMIX, the MISC CPUs,
and other similar, simple projects. From a simple RISC project, the design grew
in complexity and won independence from other existing architectures, mainly because
of the lessons learnt from their history and the specific needs of the group, which
led to adapted choices and particular characteristics. This is what we will discuss
in the next parts of this document.
The F-CPU group is rather heterogeneous but each member has the same hope that
the project will come true, because we are convinced that it is not impossible
and therefore feasible. Let's remember the Freedom CPU Project goal :
"To develop and make freely available an architecture, and all other
intellectual property necessary to fabricate one or more
implementations of that architecture, with the following priorities,
in decreasing order of importance:
1. Versatility and usefulness in as wide a range of applications as
possible
2. Performance, emphasizing user-level parallelism and derived
through intelligent architecture rather than advanced silicon process
3. Architecture lifespan and forward compatibility
4. Cost, including monetary and thermal considerations"
We could add as goal #5 : be successful !
This text sums up a lot of aspects of the project : this is "free intellectual property", meaning that anybody can make money with it without worrying, as long as the product complies with the general rules and standards, and all the characteristics are freely available (similarly to the GNU Public Licence). Just like the LINUX OS project, the team members hope that the free availability of this Intellectual Property will benefit everybody by reducing the cost of the products (since most of the intellectual work is already performed), by providing an open and flexible standard that anyone can influence at will without signing a NDA. It is also the testbench of new techniques and the "first CPU" for a lot of "hobbyists" that can build it easily at home. Of course, the other expected result is that the F-CPU will be used in everybody's home computer as well as by the other specialized markets (embedded/real time, portable/wearable computers, parallel machines for scientific number crunching...).
In this situation, it is clear that one chip does not fit all needs. There are economic constraints that also influence the technologic decisions, and everybody can't access the most advanced silicon fabrication units. The reality of the F-CPU "for and by everybody" is more in the realm of the reconfigurable FPGAs, the low-cost sea-of-gates and ASICs that are fabricated in low volumes. Even though the final goal is to use full-custom technologies, there is a strong limitation for the prototyping and the low-volume quantities. The complexity is limited for the early generations and FC0, the estimated transistor count for the first chips would be 1 Million, including some cache memory. This is rather tight, compared to the current CPUs but it's huge if one remembers the ARM core or the early RISC CPUs.
The "Intellectual Property" will be available as VHDL or VERILOG files that anyone can read, compile and modify. A schematic view is also often needed to understand the function of a circuit at the first sight. The processor will therefore exist more in the form of a software description than a hardware circuit. This will help the processor families to evolve faster and better than other commercial ones, and this polymorphism will garantee that anyone finds the core needed in any case. And since the development software will be common to all the chips, freely available through the GPL, porting any software to any platform will be eased to the maximum.
The interoperability of the software on any member of the family is a very strong contraint, and probably the most important design rule of the project : "NO RESSOURCE MUST BE BOUND". This led to create a CPU with an "undetermined" data width. A F-CPU chip can implement a characteristic datawidth of any size above 32 bits. Portable software will respect some simple rules so that it will run as fast as the chip can, independently from algorithmic considerations. In fact, the speed of a certain CPU is determined by the economic constraints, and the designer will build a CPU as wide as the budget and the technology allow. This way, there is no other "roadmap" than the user's needs, since he is its own funder. The project is not bound by technology and is flexible enough to last... as long as we want.