Freedom CPU Project
F-CPU Design Team
Request For Comment
July 8, 1999 by Whygee
F1 CORE DRAFT PROPOSAL REV. 1 |
---|
What, why :
This document is the first study and working basis for the third generation of F1 architecture. This document explains the architectural and technical backgrounds that led to the current state of the F1 core, as to reduce the amount of basic discussions on the mailing list and introduce the newcomers tothe most recent concepts that have been discussed. The draft will (hopefully) evolve rapidly and incorporate more advanced discussions and techniques. This is not a definitive draft, it is open to any modification that the mailing list agrees to make.
This document is independent from the F-CPU Architecture Guide which describes the instruction set and the programming rules from the high level point of view. This draft describes the way the instructions are executed, rather independently from the instruction formats, fields or conventions, because only the instruction decoder is concerned.
A bit of F-CPU history :
The first generation was a "memory to memory" (M2M) architecture that disapeared with the original F-CPU team members. It was believed that context switch time consumed much time, so they mapped memory regions to the register set, as to switch the registers by changing the base register.I have not tracked down the reasons why this has been abandonned, I came later in thegroup.
The second generation was a "Transfer Triggered Architecture" (TTA) where the computations are triggered by transfers between the different execution units. The instructions mainly consist of the source and destination "register" numbers, which can also be the input or output ports ofthe execution units. As soon as the needed input ports are written to, the operation is performed and the result is readable on the output port. This architecture has been promoted by the anonymous AlphaRISC, now known as AlphaGhost. He has done a lot of work onit but he has left the list and the group lost track of the project without him.
The third generation rose from the mailing list members who naturally studied a basic RISC architecture, like the first generation MIPS processors or the DLX described by Patterson & Hennessy, the MMIX, the MISC CPUs, and other similar, simple projects. From a simple RISC project, the design grew in complexity and won independence from other existing architectures, mainly because of the lessons learnt from their history and the specific needs of the group, which led to specific choices and particular characteristics. This is what we will discuss here.
The main characteristics :
The instructions are 32-bit wide. This is
a heritage of the traditional RISC processors, and the benefits of fixed size instructions
are not discussed anymore, except for certain niche applications. Even the microcontroller
market is invaded by RISC cores with fixed size intructions.
The instruction size can be discussed a bit more anyway.
It is clear that a 16-bit word can't contain enough space to code 3-operand instructions
involving tens of registers and operation codes. There are some 24- and 48-bit
instruction processors, but they are limited to niche markets (like DSP) and they don't fit
in even-sized cache lines. If we access memory on a byte basis, this becomes too complex.
Because the F-CPU is mainly a 64-bit processor, 64-bit instructions have been proposed,
where two instructions are packed, but this is similar to 2 32-bit instructions which can
be atomic, while 64-bit pairs can't be split. There is also the Merced (IA64) that has
128-bit instruction words, each containing 3 opcodes and register dependency informations.
Since we use a simple scoreboard, and because IA64-like (VLIW) compilers are very tricky
to program, we let the CPU core decide wether to block the pipeline or not when needed,
thus allowing a wide range of implementations to execute the same simple instructions.
Since the F1 microarchitecture was not clearly defined at the beginning of theproject,
the instructions had to execute on a wide range of processor types (pipelined,superscalar,
out-of-order, VLIW, whatever the future will create). A fixed-sized, 32-bit instruction set
seems to be the best choice for simplicity and scalability in the future.
Register 0 is read-as-zero/unmodifiable". This is
another classical "RISC" feature that is meant to ease coding and reduce the opcode count.
This was valuable for earlier processors but current technologies need specific hints
about what the instruction does. It is dumb today to code "SUB R1,R1,R1" to clear R1
because it needs to fetch R1, perform a 64-bit substraction and write the result,
while all we wanted to do is simply clear R1. This latency was hidden on the early MIPS
processors but current technologies suffer from this kind of coding technique,because
every step contributing to perform the operation is costly. If we want to speedup these
instructions, the instruction decoder gets more complex.
So, while the R0=0 convention is kept, there is more emphasis on specific instructions.
For example, "SUB R3,R1,R2" which compares R1 and R2, generaly to know if greater or equal,
can be replaced in F1 by "CMP R3,R1,R2" because CMP does use a special comparison unit which
has less latency than a substraction. "MOV R1,R0" clears R1 with no latency because the
value of R0 is already known.
The F-CPU has 64 registers, while RISC processors traditionally have 32 registers. More than a religion war, this subject provesthat the design choices are deeply influenced by a lot of parameters (this looks like a thread on comp.arch). Let's look at them:
- "It has been proved that 8 registers are plain enough for most algorithms." is a
deadbrain argument that appears sometimes. Let's see why and how this conclusion has
been made :
- it is an OLD study,
- it has been based on schoolbook algorithm examples,
- memory was less constrianing than today (even though magnetic cores was
slow) and memory to memory instructions were common,
- chips had less room than today (tens of thousands vs. tens of million),
- we ALWAYS use algorithms that are "special" because each program is a modification
and an adaptation of common cases to special cases, (we live in a real world, didn't
you know ?)
- who has ever programmed x86 processors in assembly langage knows how painfulit is...
The real reason for having a lot of registers is to reduce the need to store and load from
memory. We all know that even with several cache memory levels, classical architectures
are memory-starved, so keeping more variables close to the execution units reduces the
execution latency.
- "IF there are too much registers there is no room for coding instructions" :that is where the design of processors is an art of balance and common sense. And we are artists, aren't we ?
- "The more there are registers, the longer it takes to switch between tasks or acknowlege
interrupts" is another reason that is discussed a lot. Then, i wonder why Intel has put
128 registers in IA64 ???
It is clear anyway that *FAST* context switch is an issue for a lot of well-known reasons.
Several techniques exist and are well known, like register windows (a la SPARC), register bank
switching (like in DSPs) or memory-to-memory architectures (not much known), but
none of them can be used in a simple design and a first proto, where transistor count
and complexity are an issue.
In the discussions of the mailing lists, it appeared that:
- most of the time is actually spent in the scheduler's
code (if we're discussing about OS speed) so the register backup issue is likethe tree that
hides the forest,
- the number of memory bursts caused by acontext switch
or an interrupt wastes most of the time when the memory bandwidth is limited (common sense
and performance measurements on a P2 will do the rest if you're not convinced)
- A smart programmer will interleave register backup code
with IRQ handler code, because an instruction usually needs one destination and two sources,
so if the CPU executes one instruction per cycle there is NO need to switch all the register
set in one cycle. In fewer words, no need of register banks.
These facts led to design the "Smooth Register Backup", a hardware technique which replaces
the software at interleaving the backup code with the computation code.
A code like this:
IRQ_HANDLER:
clear R1 ; cycle 1
load R2,[imm] ; cycle 2
load R3,[imm] ; cycle 3
OP R1,R2,R3 ; cycle 4
OP R2,R3,R0 ; cycle 5
store R2,[R3] ; cycle 6
....
can be a common code that would be the beginning of an IRQ handler.
Whatever the register number is, we only have to save R1 before cycle 1, R2 before cycle 2
and R3 before cycle 3. This would take 3 instructions that would be interleaved like this:
IRQ_HANDLER:
store R1,[imm]
clear R1 ; cycle 1
store R2,[imm]
load R2,[imm] ; cycle 2
store R3,[imm]
load R3,[imm] ; cycle 3
OP R1,R2,R3 ; cycle 4
OP R2,R3,R0 ; cycle 5
store R2,[R3] ; cycle 6
....
The "Smooth Register Backup" is a simple hardware mechanism that automaticallysaves
registers from the previous thread so no backup code need being interleaved.
It is based on a simple scoreboard technique, a "find first" algorithm and needs a flag
per register (set when the register has been saved, reset if not). It is compeltely
transparent to the user and the application programer, so it can be changed infuture
processor generations with few changes on the OS. This technique will be described
deeply later.
The conclusion of these discussions is that 64 registers are not too much.
The other problem is : is 64 enough ?
Since the IA64 has 128 registers, and superscalar processors need more register ports,
having more registers keeps the register port number from increasing. As a rule of thumb,
a processor would need (instructions per cycle)x(pipeline depth)x3 registers to avoid
register stalls on a code sequence without register dependencies. And since
the pipeline depth and the instructions per cycle both increase to get more performance,
the register set's size increases. 64 registers would allow a 4-issue superscalar CPU
to have 5 pipeline stages, which seems like complex enough. Later implementation
will probably use register renaming and out-of-order techniques to get more performance
out of common code, but 64 registers are yet enough.
The F-CPU is a variable-size processor.
This is a controversial side of the project that has been finally accepted recently.
There are mainly two reasons behind this choice :
- As processors and families evolve, the data width becomes too tight. Adapting
the data width on a case-by-case basis led to the complexities of the x86 or the VAX
which are considered as good examples of how awful an architecture can become.
- We often need to process data of different sizes in the same time, such as pointers,
characters, floating point and integer numbers (for example in a floating-point to ASCII
function). Treating every data with the same big size is not an optimal solution because
we will spare registers if several characters or integers can be packed into one register
which would be rotated to access each subpart.
We need *from the beginning* a good way to adapt on the fly the size of the data we handle.
And we know that the width of the data to process will increase a lot in the future,
because it's almost the only way to increase performance. We can't count on the regular
performance increase provided by the new silicon processes because they are expensive
and we don't know if it will continue. The best example of this data parallelism is SIMD
programming, like in the MMX, KNI, AlphaPC or SPARC instruction sets where oneinstruction
performs several operations. From 64, it evolves to 128 and 256 bits per instruction, and
nothing keeps this width from increasing, while this increase gives more performance.
Of course, we are not building a RSA-breaking CPU, and 512-bit integers are almost never
needed. The performance lies in the parallelism, not the width. For example, it would
be very useful to parallely compare characters, for example during substring search :
the performance of such a program would be directly proportional to the width of
the data that the CPU can handle.
The next question is : how wide ?
Because fixed sizes give rise to problems at one time or another, deciding of an arbitrarily
big size is not a good solution. And, as seen in the example of substring search, the wider
the better, so the solution is : not deciding the width of the data we treat before
execution.
The idea is that software should run as fast as possible on every machine, whatever the
family or generation is. The chip maker decides of the width it can fund, but this choice
is independent from the programming model, because it can also take into account : the price,
the technology, the need, the performance...
So in few words : we don't know a priori the size of the registers. We have to run
the application, which will recognize the computer configuration with special instructions,
and then calibrate the loop counts or modify the pointer updates. This is almost the same
process as a loading a dynamic library...
Once the program has recognized the characteristic width of the data the computer can manage,
the program can run as fast as the computer allows. Of course, if the application uses
a size wider than possible, this generates a trap that the OS can handle as a fault or
a feature to emulate.
Then the question is : how ?
We have to consider the whole process of programming, coding,
making processors and enhancing them.
The easiest solution is to use a lookup table, which interprets the 2 bits of the size flag
defined in the F-CPU Architecture Guide. The flags are by default interpreted like
this:
The software, and particularly the compiler will be a bit more complex becauseof these mechanisms. The algorithms will be modified (loop counts will be changed for example) and the four special registers must be saved and restored during each task switch or interrupt. Simple compiler could simply use the default four sizes but more sophisticatedcompilers will be needed to benefit from the performance of the later chips. At least, the scalability problem is knwon and solved since the beginning, and the coding techniques won't change between processor generations. This garantees the stable future of the FCPU, and the old "RISC" principle of letting the software solve the problems is used once again.