F-CPU Design Team
FCPU MANUAL REV. 0.1

 

       Part 1 :

 

 

 

The F-CPU Project,
description and philosophy

 

 

 

 

 


 
Summary

Next part (2)

       1.1 Description of the F-CPU project
       1.2 Frequently Asked Questions
             1.2.1 Philosophy
             1.2.2 Tools
             1.2.3 Architecture
             1.2.4 Performance
             1.2.5 Compatibility
             1.2.6 Cost/Price/Purchasing
       1.3 The genesis of the F-CPU Project
             1 History
             2 The Freedom GNU/GPL'ed architecture
             3 Developing the Freedom architecture : issues and challenges
             4 Tools
             5 Conclusion
             6 Appendix A Ideas for a GPL'ed 64-bit high performance processor design
             7 Appendix B Freedom-F1 die area / cost / packaging physical characteristics
             8 Appendix C Legal issues / financial issues
       1.4 A bit of F-CPU history
             1.4.1 M2M
             1.4.2 TTA
             1.4.3 Traditional RISC
       1.5 The design constraints

 


 

1.1 Description of the F-CPU project :

 
       The F-CPU group is one of the many projects that try to follow the example shown by the Linux project, which proved that a non-commercial product can surpass expensive and proprietary products. The F-CPU group tries to apply this "recipe" to the Hardware and Computer Design world, starting with the "holly grail" of any computer architect : the microprocessor.

       This utopic project was only a dream at the beginning but after two group splits and much efforts, we have come to a rather stable ground for a really scalable and clean architecture without sacrificing the perforance. Let's hope that the third attempt is the good one and that a prototype will be created anytime soon.

 

       The F-CPU project can be split into several (approximative and not exhaustive) parts or layers that provide compatibility and interoperability during the life of the project (from HardWare to SoftWare) :
      - F-CPU Peripherals and Interfaces (bus, chipset, bridges...)
      - F-CPU Core Implementations (individual chips, or revisions) [for example, F1, F2, F3...]
      - F-CPU Cores (generations, or families) [for example, FC0, FC1, etc]
      - F-CPU Instruction Set and User-visible ressources
      - F-CPU Application Binary Interface
      - Operating System (aimed at Linux-likes)
      - Drivers
      - Applications

       Any layer depends directly or indirectly from any other. The most important part is the Instruction Set Architecture, because it can't be changed at will and it is not a material part that can evolve when the technology/cost ratio change. On the other hand, the hardware must provide binary compatibility but the constraints are less important. That is why the instructions should run on a wide range of processor microarchitectures, or "CPU cores" that can be changed or swapped when the budget changes.

       Any core family will be binary compatible with each other and execute the same applications, run under the same operating systems and deliver the same results with different instruction scheduling rules, special registers, prices and performances. Each core family can be implemented in several "flavours" like a different number of instructions executed by cycle, different memory sizes, different word sizes, but the software should directly benefit from these features without (much) changes.

       This document is a study and working basis for the definition of the F-CPU architecture, aimed at prototyping and first commercial chip generation (codenamed "F1"). This document explains the architectural and technical backgrounds that led to the current state of the "FC0" core as to reduce the amount of basic discussions on the mailing list and introduce the newcomers (or those who come back from vacations) to the most recent concepts that have been discussed.

       This manual describes the F-CPU family through its first implementation and core. The FC0 core is not exclusive to the F-CPU project, which can and will use other cores as the project grows and mutates. The FC0 core can be used for almost any similar RISC architecture with some adaptations.

       The document will (hopefully) evolve rapidly and incorporate more and more advanced discussions and techniques. This is not a definitive manual, it is open to any modification that the mailing list agrees to make. It is not exhaustive either, and may lag as the personal free time fluctuates. You are very encouraged to contribute to the discussion, because nobody will do it for you.

 
Some development rules :

 


 

1.2 Frequently Asked Questions :

 
Last modified : 31/05/99
modified by Whygee, 9/11/1999

1.2.1 Philosophy :

Q1 : What does the F in F-CPU stand for ?

       A : It stands for Freedom, which is the original name of the architecture, or Free, in the GNU/GPL sense.

       The F does not stand for free in a monetary sense. You will have to pay for the F1 chip, just as you have to pay nowadays for a copy of a GNU/Linux distribution on CD-ROMs. Of course, you're free to take the design and masks to your favorite fab and have a few batches manufactured for your own use.
 

Q2 : Why not call it an O-CPU (where O stands for Open) ?

       A : There are some fundamental philosophical differences between the Open Source movement and the original Free Software movement. We abide by the latter, hence the F.

       The fact that a piece of code is labeled Open Source doesn't mean that your freedom to use it, understand it and improve upon it is guaranteed. Further discussion of these matters can be found here.

       A licence similar to the GPL (GNU Plublic Licence from the Free Software Foundation) is in creation. Yet, in the absence of a definitive licence that is adapted to the "hardware Intellectual Property", you can read the GPL by replacing the word "software" with the word "Intellectual Property". Specifically, there are at least three levels of freedom that must be preserved at any cost:
       - Freedom to use the Intellectual Property : no restriction must exist to use the IP of the F-CPU project. This means, no fee to access the data and ALL the necessary informations to recreate a chip.
       - Freedom to reverse-engineer, understand and modify the Intellectual Property at will.
       - Freedom to redistribute the IP.
       This is NOT public domain. The F-CPU group owns the IP that it produces. It chooses to make it freely available to anybody by any means.

 
1.2.2 Tools :

Q1 : Which EDA tools will you use ?

       A : There has been a lot of debate on this subject. It's mainly a war between Verilog and VHDL. We'll probably use a combination of both.

       We will first begin with software architecture simulators written in C(++). We could also use some new "free" EDA tools that are appearing. We'll have to use commercial products at one point or another because the chip makers use proprietary software.

 
1.2.3 Architecture :

Q1 : What's that memory-to-memory architecture I heard about ? Or this TTA engine ? Why not a register-to-register architecture like all other RISC processors ?

       A : M2M was an idea that was discussed for the F-CPU at its beginning. It had several believed advantages over register-to-register architectures, like very low context switching latency (no registers to save and restore).
       That's what they thought. The SRB mechanism solves this problem for a classical RISC architecture.

       TTA is another architecture that was explored before the current design started.
 

Q2 : You're thinking about an external FPU ?

       A : Maybe.
       No. Bandwidth and pin count problems.
 

Q3 : Why don't you support SMP ?

       A : SMP is an Intel-proprietary implementation of Symmetric Multi-Processing.

       We'll probably try. If not in F1, in F2 :).

       The "F1" will be like a "proof of concept" chip. It will not even support IEEE floating point numbers, so we can't support a classical SMT system from the beginning. Anyway, memory coherency will be enforced on the F1 with an OS-based paging mechanism where only one chip at a time in a system can cache a page : this avoids the bus snoops and the waste of bandwidth.

 
1.2.4 Performance :

Q1 : What can we expect in terms of performance from the F1 CPU ?

       A : Merced-killer. :-). No seriously, we hope to get some serious performance.

       We think we can achieve good performance because we start from scratch (x86 is slower because it has to be compatible with older models). We're intend to have gcc/egcs as the main compiler for the F-CPU and port Linux too.

       LINUX and GCC are not the best garanties for performance in themselves. For example, GCC doesn't handle SIMD data. We will certainly create a compiler that is more adapted to the F-CPU and GCC will be used as a "bootstrap" at the beginning.

       The FC0 core family is aimed to achieve the best MOPS/MIPS ratio possible, around 1 (and maybe a bit more). The superpipeline garanties that the best clock frequency is reached for any silicon technology. The memory bandwidth can be virtually increased with different hint strategies. So we can predict that a 100MHz chip with 1 instruction decoded at each cycle can easily achieve 100 million operations per second. Which is not bad at all because you can achieve that with an "old" (cheap) silicon technology that couldn't achieve 100MOPS with a x86 architecture. Add to that unconstrained SIMD data width, and you get a picture of the peak MOPS it can reach.

 
1.2.5 Compatibility :

Q1 : Will the F-CPU be compatible with x86 ?

       A : No.

       There will be NO binary compatibility between the F-CPU and x86 processors.

       It should however run Windows emulators that include x86 CPU emulators such as Twin, as well as Windows itself under whole-PC emulators such as Bochs. In either case however you will need to run another operating system, such as GNU/Linux, and emulation will likely be fairly slow.

       And what would be the point of using Windblows when you can run Linux/FreeBSD instead ?
 

Q2 : Will I be able to plug the F-CPU in a standard Socket 7, Super 7, Slot 1, Slot 2, Slot A motherboard ?

       A : It's an ongoing debate.

       Great chances are that no early version of the F-CPU will be available for Socket7 or x86 mother boards.
       Reason 1 : BIOS should be rewritten, the chipsets should be analysed, and there are way too many chipsets/motherboards around.
       Reason 2 : socket/pins/bandwidth : the x86 chips are really "memory bound", the bandwidth is too low, some pins are not useful for a non-x86 chip and supporting all the functions of the x86 interface will make the chip (its design and debugging) too complex.
       Reason 3 : we don't want to pay the fees for the use of proprietary slots.

       ALPHA- or MIPS-like slots will probably be supported, we might include an EV-4 interface to the F-CPU standard.
 

Q3 : What OS kernels will the F-CPU support?

       A : Linux will be ported first. Other ports may follow. The port of Linux will be developed simultaneously with the F-CPU development.

       But first we must have a working software development tool that simulates the architecture and creates binaries, so we must first define the F-CPU...
 

Q4 : What programs will I be able to run on the F-CPU ?

       A : We will port gcc/egcs to the F architecture. Basically the F-CPU will run all the software available for a standard GNU/Linux distribution.

       GCC is not perfectly adapted to fifth generation CPUs. We will probably adapt it for the F-CPU but making a GCC backend will be enough to compile LINUX/whatever.

 
1.2.6 Cost/Price/Purchasing :

Q1 : Will I be able to buy a F-CPU someday ?

       A : We hope so.
       that's all the point of the project, but be patient and take part of the discussions !
 

Q2 : How much will the F-CPU cost ?

       A : We don't know. It depends on how many are made.

       There was an early slightly optimistic estimate that an F-CPU would cost approximately US$100, if 10000 were made.

       This also depends on a lot of factors like the desired performance, the size of the cache memory, the number of pins, and most of all, the possibility to combine all these factors in the available technology.

 


 

1.3 The genesis of the F-CPU Project :

 
       A lot of things have happened since the following document was written. The motivation has not changed though, and the method is still the same. The original authors are unreachable now but we have kept on working more and more seriously on the project. At the time of writing, several questions asked in the following text have been answered, but now that the group is structuring itself, the other questions become more important because we really have to face them : it's not utopy anymore, the fiction slowly becomes reality.


 
The Freedom CPU Architecture

Andrew D. Balsa w/ many contributions from Rafael Reilova and Richard Gooch.
5 August 1998


A GNU/GPL'ed high-performance 64-bit microprocessor developed in an open, Web-wide collaborative environment.
 

1 History

 
       The idea of a GNU/GPL'ed CPU design sprang in the middle of some email exchanges between three long-time GNU/Linux users (also Linux kernel developers in their spare time) with diverse backgrounds*.

       We were questioning monopolies and how the dominance of an operating system (including the kernel, the Graphical User Interface and the availability of "killer-applications" as well as documentation) was intimately related to the world-wide dominance of a specific, outdated, awkward and inefficient CPU architecture. I guess we all know what I am referring to.

       We also expressed our faith that GNU/Linux is well on its way to provide the basic foundation for a totally Free software environment (in the GNU/GPL sense; please get a copy of the GNU GPL license if you are reading this, or check www.gnu.org). However, this Freedom is limited or rather bound by the proprietary hardware on which it feels most at home to run: the traditional x86-based PC.

       Finally, we were worried that Intel's attitude of not releasing advance information to the Free Software community about its forthcoming Merced architecture would delay the development of a compatible gcc compiler, of a custom version of the Linux kernel, and finally of the vast universe of Free Software tools. It is vaguely rumoured that Linus Torvalds may have received advance information on Merced by signing an Intel NDA, but this would be an individual exception and wouldn't really fit with the spirit of Free Software. On hthe whole, even though Merced will certainly be more modern that the x86 architecture, it will be a step backwards in terms of Freedom, since unlike for the x86, there will most likely never be a Merced clone chip.

       In the previous days, we had been discussing the various models for Free Software development, their advantages and disadvantages. Putting these two discussions together, I quickly drafted an idea and posted it to Rafael and Richard, warning them that this would be good reading while they were compiling XFree86 or a similarly large package... and then they liked it! Here is this crazy, utopic idea, merged with comments, criticism and further ideas from Rafael and Richard:

 

2 The Freedom GNU/GPL'ed architecture

 
       We started with some questions:

  • Why don't we develop a 64-bit CPU and put the design under the GNU General Public License?
  • Why don't we make the development process of this new CPU completely open and transparent, so that the best brains worldwide can contribute with the best ideas (somehow using the same communication mechanisms traditionally used by the Free Software community)?
  • How can we make the CPU development process entirely democratic and truly open, whereas it is usually surrounded by paranoia and secrets?
  • How can we design something that will improve in *technical* *grounds* on what will be available in 2000 from the most advanced CPU architecture team ever put together by any corporation (the Merced)?
       There are really two distinct incredible challenges here:
a) the performance and feasability of the resulting architecture, and
b) the open development process under a GNU/GPL license and the intellectual property rights questions raised by this process.

       Tackling a) first (performance and feasability), we think the Freedom architecture could be more efficient under GNU/Linux compared to other architectures by making it:

       1) More compatible with the gcc compiler.We have the source code to gcc, but most importantly, we have the gcc developers available to help us figure out what features they would like to see in a CPU architecture. Why gcc? Because it is the cornerstone of the entire body of Free Software. Basically, an efficient architecture for gcc will see an increase in efficiency across-the-board on *all* Free Software programs.

       2) Faster in the Linux kernel. Right now, if we take for example the PC architecture, we notice that the Linux kernel has to "work around" (and some would say "work against") various idiosyncrasies of the x86/PC specifications and hardware. We also have to maintain compatibility with outdated x86 chips. And obviously, there is no possibility of implementing some of the often used Linux kernel functions in silicon. A new design, custom fitted to the Linux kernel code, would vastly improve the performance of any kernel-bound applications.

       Further ideas for a possible architecture and implementation can be found in the appendices (as well as the "economics" of the project). Note that we are calling the architecture "Freedom" (for obvious reasons), and its first implementation "F1". Projected end-user cost of an F1 CPU is around $100. Everything is very utopic, we know. :-)

       However, it also seems to us that at this stage, the real challenges for our project are entirely within b): the development process and the intellectual property issues.

 

3 Developing the Freedom architecture : issues and challenges

 
       The Dilbert cartoon says it all, in fact: our project *is* a whole new paradigm! What we are basically proposing is to bring together the competences and creative powers of thousands of individuals over the Web into the design process of an advanced, Free, GNU/GPL'ed 64-bit CPU architecture. And we don't even know if it's possible!

       We know two things for sure:

  1. In the past and present, corporations like Intel, IBM and Motorola are known for having broken down design teams, so that no close groups could be formed that would be able to recreate the entire design (and eventually quit and form their own companies). Recently, Andy Grove has given a new meaning to the word "paranoia" as a management tool. Our proposed Free, open, transparent, collaborative environment counters this trend.
           It is also in a large part related to some new trends in Human Resources management and Organizational theory. In fact, it is very akin to the concept of Virtual Corporations, except that in this case we are rather dealing with a Virtual Non-Profit Organization. In this respect, the Freedom project is also an experiment in Organizational theory, but it's not a gratuitous experiment. Many studies indicate that keeping people in small closed groups, bound by strict NDA and other legal constraints to public silence, and putting a relatively high amount of pressure on these groups, is _not_ the best method to unleash creative powers. It also sometimes leads to buggy designs...
  2. The development of the Linux kernel, by a group of highly talented programmers/system developers is an example that an open, collaborative environment aiming for a GNU/GPL'ed piece of software with a particularly high intellectual/technological value, is possible. Moreover, it can be shown that in some areas, the Linux kernel performs _better_ then its commercial counterparts.
       However, this list of certainties is rather short compared to the list of questions generated by our proposal:
  • How will new ideas be selected or discarded for inclusion in the design, amid the inevitable "noise" of Bad Ideas (tm)? Who will be the judge of what's Good and Bad?
  • Also inevitably, mutually exclusive options/features will appear during the course of development. Again, who will decide on the direction to be chosen?
  • Who will own the final design intellectual property rights? Is the "copyleft" applicable in the case of a CPU design? What about the masks for the first silicon?
  • Will the GPL be sufficient as a legal instrument to protect the design? What changes, if any, will have to be made to the GNU/GPL to adapt it to a chip design?
  • If the design process uses commercial EDA and other tools, in what measure do these proprietary items "taint" our GNU/GPL'ed design? Is it possible to separate the GPL part from the commercial/proprietary parts?
  • What about existing patents? Will the project need any? Will it be able to "buy" any, or pay royalties?
  • Contrarily to a piece of software, partial implementations of the Freedom design will not be possible. The first implementation that will go to silicon *must* be functional and complete. All "holes" in the design must be plugged before the first mask gets drawn. How do we make volunteers accept such a rigid schedule?

       There are some questions raised as a consequence of the possible succes of the Freedom implementation:

  • There are vast possibilities for a GNU/GPL'ed CPU design in the industrial, medical, aeronautical, automotive and other domains. In fact, a Free, stable, high-performance design offers possibilities never before envisioned by hardware designers in various domains. Is this the beginning of a small revolution in e.g. embedded hardware?
  • Will the design sustain itself over the years as the ideal GNU/Linux processor?
  • Can this experiment in open development have other consequences on the electronics industry? Are we really proposing a new paradigm for CPU development? Can this paradigm be applied to other VLSI designs?

 

4 Tools

 
       We all know the saying: "If the only tool one has is a hammer...". We'll need "groupware" tools for the Freedom project, but the word "groupware" has a bad reputation nowadays. We prefer to use "collaborative work tools". Some of them have only come into existence and widespread use in the last decade; I am obviously talking about the Web itself, and its assortment of communication technologies: email, newsgroups, mailing lists, Web sites, SGML/PDF/HTML documentation and editing/translation software. Much of this infrastructure is/has been used to develop GNU/Linux, and is nowadays based on GNU/Linux, BTW.

       But we'll also need new tools, that perhaps don't even exist yet. I think it's worth mentionning that perhaps one the greatest steps in this direction is the WELD project, developped at Berkeley. It could well become the cornerstone of the Freedom project, or conversely, the Freedom project can perhaps be thought of as _the_ ideal, perfect test case for the WELD project.

 

5 Conclusion

 
       The conclusion is simple and obvious:

  • if you are a CPU architect/VLSI engineer, or
  • if you have a good idea on CPU design that you have been toying with for some time and would like to test, or
  • if you just like challenging intellectual propositions _and_ brainstorming interaction:

Please join and help us turn this idea into a reality!

--

*: Richard is an Australian astrophysicist preparing his Ph.D. on astronomic visualization; Rafael is a researcher on EDA tools at the University of Cincinatti. I am an ex-Ph.D. student in Management and an ex-firmware engineer, with a special interest in Ethical problems in multi-cultural environments (I was born in Brazil and am presently living in France). None of us has any formal education in CPU architecture. Rafael comes closest, since he is in VLSI design and EDA tools development, and also developed some new code for CPU recognition in the Linux kernel. Richard developed the Pentium Pro MTRR support in the Linux 2.1.x kernels (as well as other novel kernel routines), and is also a hardware developer. I have the honour of having diagnosed the Cyrix 6x86 "Coma" bug and proposed a workaround for it under GNU/Linux (both were at first rejected by Cyrix Corp.). I am also a long time hardware and firmware developer, and have contributed in various ways to GNU/Linux development (e.g. the Linux Benchmarking HOWTO).

Richard E. Gooch <Richard.Gooch@atnf.csiro.au>

Rafael R. Reilova <rreilova@ececs.uc.edu>

Andrew D. Balsa <andrebalsa@altern.org>

 

6 Appendix A

 
Ideas for a GPL'ed 64-bit high performance processor design

24 July 1998

       This is just a dream, a utopic idea of a free processor design. It's also a list of things I would like to see in a future processor.

  1. This project will need a sponsor if it ever wants to become a reality. Getting first silicon is not going to be free, nor easy.
  2. Choice of a 64-bit datapath, address space: obvious nowadays. Simplifies just about everything.
  3. Huffman encoded instruction set: improves cache/memory -> CPU bandwidth, which is one of the main bottlenecks nowadays. Should be quite simple to add a Huffman encoder to a compiler back-end. All instructions lengths are multiple of byte.
  4. RISC vs. CISC vs. dataflow debate: it's over! Get the advantages of each, disadvantages of none as much as feasible.
  5. 1, 2 or 4 internal 7-stage pipelines.
  6. Speculative execution: 4 branches, 8 instructions deep each.
  7. 64-byte instruction prefetch queue.
  8. 32-byte write buffers.
  9. Microprogram partly in RAM. Must be able to emulate x86 instruction set (assembler source level).
  10. 64-bit TSC w/ multiple interrupt capabilities.
  11. Power saving features.
  12. MMX and 3DNow! emulation.
  13. Fully-static design (clock-stoppable).
  14. F1 implementation: 128 bits external data path, 40 bits external addressing capabilities.
  15. Performance monitoring registers "a la" Pentium.
  16. External FPU, memory mapped (have no idea what it should look like). FPUs can be added to work in parallel (up to 4?). Separate bus. Same bus can handle a graphics coprocessor with its dual-ported memory.
  17. 8KB 4-ported L1 unified cache, with independent line-locking/line-flushing capabilities. Can be thought of as a 1 KB register set.
  18. Separate 64KB each L2 instruction and data caches, running at CPU speed.
  19. Integrated intelligent DMA controller, 32 channels.
  20. Integrated interrupt controller: 30 maskable interrupts, 1 System Management interrupt, 1 non-maskable interrupt.
  21. 0 internal registers! Yep, this is a memory-memory machine. Instruction set recognizes 32 pseudo-registers at any moment.
  22. Interrupts cause automatic register set switch to vectored register set => 0 (zero) context switch latency!
  23. No penalty for instructions that access byte, word, dword data.
  24. Operation in little or big-endian mode "a la" MIPS.
  25. Paging "a la" Intel, with 4k pages + 4M extension.
  26. Also VSPM "a la" Cyrix 6x86, with 1K definable pages.
  27. ARR registers "a la" Cyrix 6x86 (similar to MTRR on Intel PPro): allows defining non-cacheable regions (useful for NUMA, see below).
  28. Internal PLL with software programmable multiplier; can switch from 1x to 2x to 3x to nx in 0.5 increments, on-the-fly.
  29. The MMU should also support object protection "a la" Apple Newton.
  30. Single-bit ECC throughout.
  31. Direct support of 4 1MB dual ported memory regions for NUMA-style multiprocessing (also on FPU bus).
  32. CPU architecture project name: "Freedom". Could also be called "Merced-killer", or "Anti-Merced", or "!Merced", but in fact we are not anti-anything with this project. We are just pro-Freedom and open; what we dislike about the Intel Merced is its proprietary design and restrictive development environment.
       I guess the challenge here is to determine whether a GPL'ed CPU design is feasible. Is open, collaborative development possible WRT CPU design? How does one get the funding to actually put the design on silicon, once it is ready? How can revisions be handled? Are there patents that would inherently block such a development process?

       The idea also is to use gcc as the ideal development compiler for this CPU (unlike Merced). And to be able to port the Linux kernel with a minimal effort on this new processor.

 

7 Appendix B

 
Freedom-F1 die area / cost / packaging physical characteristics / external bus

August 5, 1998

       Just as a reminder, the F1 CPU does _not_ include an FPU or 3DNow! unit (but SIMD integer instructions will be included).

       Recommended maximum size: 122 mm2. This gives us 200 dies/8-inch wafer (see an example of such a wafer on Hennessy and Patterson, page 11).

       Roughly, die yield = 0.5 for our 122 mm2 5-layer 0.25 micron CPU (H&P, page 13, updated to reflect better fabs). This allows more or less 10-11 million transistors, divided as follows: 6-7 million for the caches, 4-5 million for the rest.

       Assume wafer yield = 95%, final test yield = 95%. Testing costs of $500/hour, 20 seconds/CPU.

       Packaging costs = $25-50 (see below)

       Roughly, following H&P, this gives us a unit cost of $75-100/good CPU, tested, boxed in anti-static packaging and shipped to the US, if the Taiwan foundries can keep the wafer processing cost around $3.500.

       Packaging: I am going to propose something surprising, but I think we should use the same packaging as the Celeron CPU, in terms of physical dimensions and CPU placement. Like that we can also use the Celeron heatsink/fans already in the market, and the Celeron mounting hardware.

       PCI set: again I am going to propose a heresy, but I think we could use 100MHz Slot 1 motherboards. First, Intel is not alone anymore manufacturing Slot 1 chipsets: VIA has just released a Slot 1 chipset with excellent performance and the latest goodies in terms of technology (we can get timing info from the VIA chipset datasheets). Second, we don't have to worry about the motherboard/PCI set issue anymore. Third, it's almost impossible to go beyond 100MHz on a standard motherboard, because of RFI issues; so basically 100-112MHz is as good as it gets. Fourth, there will be many people out there with Slot 1 motherboards, willing to upgrade their PII/Celeron CPUs (specially the Celeron). Fifth, these motherboards are nowadays quite cheap, and we get all the benefits of high-volume production. Sixth, this allows easy upgrades of the Freedom CPU to higher speed grades, larger cache versions, FPU-with versions, etc...        Now, if we accept the above, we have to put on the Celeron-style Freedom printed circuit a small EEProm that will contain the Freedom BIOS, the L2 cache and a socket for the FPU. This increases the cost of the CPU, but decreases overall costs, so I still think it's a good move.

       Please check a photograph of the Celeron and tell me if I am just dreaming.

 

8 Appendix C

 
Legal issues / financial issues

August 5, 1998

       We would like to have support from the Free Software Foundation for the Freedom project.

       We are _not_ proposing that the Free Software Foundation build a fab. What we are saying     is : if we go to a foundry in the US or Taiwan, give them a mask, and ask them to run a batch of 0.25 micron, 5 layer 8-inch wafers for us, they'll quote approx. $3K-5K or less even, per wafer, as their price (our cost) for our batch (in the year 2000).

       An approximate cost for a batch of F1 CPUs would theoretically be somewhere between $ 500k and $ 1000K, for 5000-10000 good CPUs.

       Not exactly pocket money, but we could sell those CPUs on a subscription basis. Like this: people who would subscribe would get the Merced-killer for around $100 (compare that to the projected cost of $ 5000/unit for the Merced), on a first-come/first served basis, and any left-over CPUs after the cost of the batch would be covered, could be sold for a slightly higher price to pay for the next batch and further mask development.

       We suggest putting some quotas in the system. Demand is likely to be higher than supply. ;-)

       The Free Software Foundation could coordinate all the legal/financial/logistic aspects of the project (and would be adequately compensated for this work). This, of course, would depend on getting support from Mr. Stallman for this initiative.
 



 

1.4 A bit of F-CPU history :
(And a reflexion on the evolution of the F-CPU through a description of the different proposed architectures)

 
1.4.1 M2M :

 
       The first generation was a "memory to memory" (M2M) architecture that disapeared with the original F-CPU team members. It was believed that context switch time consumed much time, so they mapped memory regions to the register set, as to switch the registers by changing the base register. I have not tracked down the reasons why this has been abandonned, I came later in the group. Anyway, they launched the F-CPU project, with the goals that we now know, and the dream to create a "Merced Killer". Actually, i believe that we should compete with the ALPHA directly ;-)
 

1.4.2 TTA :

 
       The second generation was a "Transfer Triggered Architecture" (TTA) where the computations are triggered by transfers between the different execution units. The instructions mainly consist of the source and destination "register" numbers, which can also be the input or output ports of the execution units. As soon as the needed input ports are written to, the operation is performed and the result is readable on the output port. This architecture has been promoted by the anonymous AlphaRISC, now known as AlphaGhost. He has done a lot of work on it but he has left the list and the group lost track of the project without him.

       Brian Fuhs explained TTA on the mailing list this way :

       TTA stands for Transfer-Triggered Architecture. The basic idea is that you don't tell tell the CPU what to do with your data, you tell it where to put it. Then, by putting your data in the right places, you magically end up with new data in other places that consists of some operation performed on your old data. Whereas in a traditional OTA (operation-triggered architecture) machine, you might say ADD R3, R1, R2, in a TTA you would say MOV R1, add; MOV R2, add; MOV add, R3. The focus of the instruction set (if you can call it that, since a TTA would only have one instruction: MOV) is on the data itself, as opposed to the operations you are performing on that data. You specify only addresses, then map addresses to functions like ADD or DIV.

       That's the basic idea. I should start by specifying that I'm focusing on general processing here, and temporarily ignoring things like interrupts. It is possible to handle real-world cases like that, since people have already done so; for now, I'm more interested in the theory Any CPU pipeline can be broken down into three basic stages: fetch and decode, execute, and store. Garbage in, garbage processing, garbage out. :). With OTAs this is all done in hardware. You say ADD R3, R1, R2, and the hardware does the rest. It handles internal communication devices to get data from R1 and R2 to the input of the adder, lets the adder do its thing, then gets the data from the output of the adder back into the register file, in R3. In most modern architectures, it checks for hazards, forwards data so the rest of the pipeline can use it earlier, and might even do more complicated things like reordering instructions. The software only knows 32 bits; the hardware does everything else.

       The IF/ID stage of a TTA is very different. All of the burden is placed on software. The instruction is not specified as ADD (something), but as a series of SRC,DEST address pairs. All the hardware needs to do is control internal busses to get the data where it is supposed to go. All verification of hazards, optimal instruction order, etc should be done by the compiler. The key here is that a TTA, to achieve IPC measures comparable to an OTA, must be VLIW: you MUST be able to specify multiple moves in a single cycle, so that you can move all of your source data to the appropriate places, and still move the results back to your register file (or wherever you want them to go). In summary, to do an ADD R3, R1, R2, the hardware will do the following:

TTA                                     OTA
---------------------------------------------------------------------
MOV R1, add                     ADD R3, R1, R2
  Move R1->adder                        Check for hazards
MOV R2, add                             Check for available adder
  Move R2->adder                        Select internal busses and move data
                (adder now does its thing in both cases)
MOV add, R3                             Check for hazards
  Move adder->R3                        Schedule instruction for retire
                                        Select internal busses and move data
                                        Retire instruction

The compiler, of course, becomes much more complicated, because it has to do all of the scheduling work, at compile time. But the hardware in a TTA doesn't need to worry about much of anything... About all it does in the simple cases is fetch instructions and generate control signals for all of the busses.

       Execution is identical between TTA and OTA. Crunch the bits. Period.

       Instruction completion is again simplified in a TTA. If you want correct behavior, make sure your compiler will generate the right sequence of moves. This is compared to a OTA, where you at least have to figure out what write ports to use, etc.

       Basically, a TTA and an OTA are functionally identical. The main differences are that a TTA pretty much has to be VLIW, and requires more of the compiler. However, if the "smart compiler and dumb machine" philosophy is really the way to go, TTA should rule. It exposes more of the pipeline to software, reducing the hardware needed and giving the compiler more room to optimize. Of course, there are issues, like code bloat and constant generation, but these can be covered later. The basic ideas have been covered here (albeit in a somewhat rambling fashion... I had this email all composed in my head, and had some very clear explanations, right up until I sat down and started typing). For more information see http://www.cs.uregina.ca/~bayko/design/design.html and http://cardit.et.tudelft.nl/MOVE . These two have a lot more informatin on the details of TTA; I'm still hopeful that we can pull one of these off, and I think it would be good for performance, generality, cost, and simplicity. Plus, it's revolutionary enough that it might turn some heads - and that might get us more of a user (and developer) base, and make the project much more stable.

       Send me questions, I know there will be plenty...

Brian

       If you want to understand further the TTA concept, the difference is in the philosophy, it's as if you had instructions to code a dataflow machine on-the-fly. Notice also the fact that less registers are needed : registers are required to store the temporary results of operations between instructions of a code sequence. Here, the results are directly stored by the units, there are less "temporary storage" needed, less register pressure.

       To envision this difference, think about a data dependency graph : in OTA, an instruction is a node, while in TTA the mov instruction is the branch. Once this is understood, there's not much work to do on an existing (yet simple) compiler to generate TTA instructions.

       Let's examine : S = (a+b) * (c-d) for example. a,b,c,d are known "ports", registers or TTA addresses.

    a   b c   d
    1\ /2 3\ /4
      +     -
      5\   /6
        \ /
         *
         |7
         S

       In OTA, with 3-operands instructions, there is in this case one instruction per "node" (+,-,*). Two temporary registers are needed to store the result of the addition and the substraction (branches 5 and 6). Let's assume that the tree-flattening must preserve superscalarism (well, instructions have latencies), so we code
ADD r5,a,b
SUB r6,c,d
MUL r7,r5,r6
(there are other nasty ways to code this).

       In TTA there is one "port" in each unit for each incoming branch. This means that ADD, having two operands, has two ports. There is one result port, which uses the address of one port, but that is used as read, not write. Another detail is that this read port can be static : it holds the result until another operation is triggered. We can code

mv ADD1,a
mv SUB1,c
mv ADD2,b (this triggers the a+b operation)
mv SUB2,d (this triggers the c-d operation)
mv MUL1,ADD
mv MUL2,SUB (this triggers the * operation)

       TTA is not "better", it's not "worse", it's just completely different while the problem will always be the same. If the instructions are 16 bit wide, it takes 96 bits, just as the OTA example would do. In some cases, it can be better as it was shown long ago on the list. TTA has some interesting properties, but unfortunately, in the very near future, it's not probable that a TTA will enter inside a big computer as RISC or CISC do. A TTA core can be as efficient as the ARM core, for example, it suits well to this scale of die size, but too few studies have been made, compared to the existing studies on OTA. Because the solution of its scaling up are not (yet) known, this leads to the discussions that shaked the mailing list near december 1998: the problem of where to map the registers, how would the ports be remapped on the fly, etc. When additional instructions are needed, this jeopardizes the whole balance of the CPU and evolutivity is more constraining than for RISC or OTA in general.

       The physical problem of the busses has also been raised : if we have say 8 buses of 64 bits, this makes 512 wires, it takes around one millimeter of width with a .5u process. Of course, we can use a crossbar instead.

       As discussed a few times long ago, because of its scalability problems (assignation of the ports and its flexibility), TTA is not the perfect choice for a long-lasting CPU family, while its performance/complexity ratio is good. So, it would be possible that the F-CPU team makes a RISC->TTA translator in front of a TTA core that would not have most of the scalability problems. This would be called the "FC1" (FC0 is the RISC core). Of course, time will show how the TTA ghosts of the F-CPU group will change.

       But TTA's problem is probably that it is too specialized, where OTA can change its core and still use the same binaries. It's one of the points that "killed" the previous F-CPU attempt. Each TTA implementation could not be completely compatible with another, because of the instruction format, of the assignation of the "port" and other similar details : the notion of "instruction" is bound with the notion of "register".

       I am not trying to prove the advantage of one technique over another, i am trying to show the difference of point of view, that finally treats the same problem. The scalability, that is necessary for such a project, is more important than we thought, and the group finally showed interest for a more classical technology.

 

1.4.3 Traditional RISC :

 
       The third generation rose from the mailing list members who naturally studied a basic RISC architecture, like the first generation MIPS processors or the DLX described by Patterson & Hennessy, the MMIX, the MISC CPUs, and other similar, simple projects. From a simple RISC project, the design grew in complexity and won independence from other existing architectures, mainly because of the lessons learnt from their history and the specific needs of the group, which led to adapted choices and particular characteristics. This is what we will discuss in the next parts of this document.

 


 

1.5 The design constraints :

 
       The F-CPU group is rather heterogeneous but each member has the same hope that the project will come true, because we are convinced that it is not impossible and therefore feasible. Let's remember the Freedom CPU Project goal :

       "To develop and make freely available an architecture, and all other intellectual property necessary to fabricate one or more implementations of that architecture, with the following priorities, in decreasing order of importance:
       1. Versatility and usefulness in as wide a range of applications as possible
       2. Performance, emphasizing user-level parallelism and derived through intelligent architecture rather than advanced silicon process
       3. Architecture lifespan and forward compatibility
       4. Cost, including monetary and thermal considerations"

We could add as goal #5 : be successful !

       This text sums up a lot of aspects of the project : this is "free intellectual property", meaning that anybody can make money with it without worrying, as long as the product complies with the general rules and standards, and all the characteristics are freely available (similarly to the GNU Public Licence). Just like the LINUX OS project, the team members hope that the free availability of this Intellectual Property will benefit everybody by reducing the cost of the products (since most of the intellectual work is already performed), by providing an open and flexible standard that anyone can influence at will without signing a NDA. It is also the testbench of new techniques and the "first CPU" for a lot of "hobbyists" that can build it easily at home. Of course, the other expected result is that the F-CPU will be used in everybody's home computer as well as by the other specialized markets (embedded/real time, portable/wearable computers, parallel machines for scientific number crunching...).

       In this situation, it is clear that one chip does not fit all needs. There are economic constraints that also influence the technologic decisions, and everybody can't access the most advanced silicon fabrication units. The reality of the F-CPU "for and by everybody" is more in the realm of the reconfigurable FPGAs, the low-cost sea-of-gates and ASICs that are fabricated in low volumes. Even though the final goal is to use full-custom technologies, there is a strong limitation for the prototyping and the low-volume quantities. The complexity is limited for the early generations and FC0, the estimated transistor count for the first chips would be 1 Million, including some cache memory. This is rather tight, compared to the current CPUs but it's huge if one remembers the ARM core or the early RISC CPUs.

       The "Intellectual Property" will be available as VHDL or VERILOG files that anyone can read, compile and modify. A schematic view is also often needed to understand the function of a circuit at the first sight. The processor will therefore exist more in the form of a software description than a hardware circuit. This will help the processor families to evolve faster and better than other commercial ones, and this polymorphism will garantee that anyone finds the core needed in any case. And since the development software will be common to all the chips, freely available through the GPL, porting any software to any platform will be eased to the maximum.

       The interoperability of the software on any member of the family is a very strong contraint, and probably the most important design rule of the project : "NO RESSOURCE MUST BE BOUND". This led to create a CPU with an "undetermined" data width. A F-CPU chip can implement a characteristic datawidth of any size above 32 bits. Portable software will respect some simple rules so that it will run as fast as the chip can, independently from algorithmic considerations. In fact, the speed of a certain CPU is determined by the economic constraints, and the designer will build a CPU as wide as the budget and the technology allow. This way, there is no other "roadmap" than the user's needs, since he is its own funder. The project is not bound by technology and is flexible enough to last... as long as we want.

 


part1.html nov.16 by Whygee