Part 2 :
2.1 The main chacteristics
2.2 The instructions are 32-bit wide
2.3 Register 0 is "read-as-zero/unmodifiable"
2.4 The F-CPU has 64 registers
2.5 The F-CPU is a variable-size processor
2.6 The F-CPU is SIMD-oriented
2.7 The F-CPU has generalized registers
2.8 The F-CPU has special registers
2.9 The F-CPU has no stack pointer
2.10 The F-CPU has no condition code register
2.11 The F-CPU is endianless
2.12 The F-CPU uses paged memory
2.13 The F-CPU stores the state of a task in Context
Memory Blocks
2.14 The F-CPU can use the CMBs to single-step tasks
2.15 The F-CPU uses a simple protection mechanism
2.1 The main characteristics :
The CPU described here can be thought as a crossover
between a R2000 chip (or early ALPHA) and a CDC6600 computer. Some constraints are similar :
the F-CPU must be as simple and performant as possible. From the R2000, it inherits from
the RISC main characteristics like fixed size instructions, the register set and the size
of the chip that is bound by the current technology. In the CDC6600, FC0 finds the execution
scheme, the scoreboard, the multiple parallel execution units and most of all : the inspiration
for smart techniques that ease both design and programming.
The following text is a step-by-step description of the currently developped F-CPU. The features will be more deeply described and get interdependent, so it is recommended to read them from the beginning :-) We will begin with the most basic F-CPU characteristics before discussing more critical and hardware-dependent subjects.
2.2 The instructions are 32-bit wide. This is
a heritage of the traditional RISC processors, and the benefits of fixed size instructions
are not discussed anymore, except for certain niche applications. Even the microcontroller
market is invaded by RISC cores with fixed size intructions.
The instruction size can be discussed a bit more anyway.
It is clear that a 16-bit word can't contain enough space to code 3-operand instructions
involving tens of registers and operation codes. There are some 24- and 48-bit
instruction processors, but they are limited to niche markets (like DSP) and they don't fit
in even-sized cache lines. If we access memory on a byte basis, this becomes too complex.
Because the F-CPU is mainly a 64-bit processor, 64-bit instructions have been proposed,
where two instructions are packed, but this is similar to 2 32-bit instructions which can
be atomic, while 64-bit pairs can't be split. There is also the Merced (IA64) that has
128-bit instruction words, each containing 3 opcodes and register dependency informations.
Since we use a simple scoreboard, and because IA64-like (VLIW) compilers are very tricky
to program, we let the CPU core decide wether to block the pipeline or not when needed,
thus allowing a wide range of CPU core types to execute the same simple instructions
and programs.
Since the F-CPU microarchitecture was not defined at the beginning of the project, the instructions had to execute on a wide range of processor types (pipelined, superscalar, out-of-order, VLIW, whatever the future will create). A fixed-sized, 32-bit instruction set seems to be the best choice for simplicity and scalability in the future. Core-dependent optimisations can be made on the binaries by applying specific scheduling rules, but the application will still run on other family members that have a completely different core.
2.3 Register 0 is "read-as-zero/unmodifiable". This is
another classical "RISC" feature that is meant to ease coding and reduce the opcode count.
This was valuable for earlier processors but current technologies need specific hints
about what the instruction does. It is dumb today to code "SUB R1,R1,R1" to clear R1
because it needs to fetch R1, perform a 64-bit substraction and write the result,
while all we wanted to do is simply clear R1. This latency was hidden on the early MIPS
processors but current technologies suffer from this kind of coding technique, because
every step contributing to perform the operation is costly. If we want to speedup these
instructions, the instruction decoder gets more complex.
So, while the R0=0 convention is kept, there is more emphasis on specific instructions.
For example,
"SUB R3,R1,R2" which compares R1 and R2, generaly to know if greater or equal,
can be replaced in F1 by
"CMP R3,R1,R2" because CMP does use a special comparison unit which
has less latency than a substraction (after all we don't care about the numerical result,
we simply want its property).
"MOV R1,R0" clears R1 with no latency because the value of R0 is already known
(hardwired to zero).
2.4 The F-CPU has 64 registers, while RISC processors
traditionally have 32 registers. More than a religion war, this subject proves that
the design choices are deeply influenced by a lot of parameters (this looks like
a thread on comp.arch). Let's look at them:
A) "It has been proved that 8 registers are plain enough for most algorithms." is a
deadbrain argument that appears sometimes. Let's see why and how this conclusion has
been made :
- it is an OLD study,
- it has been based on schoolbook algorithm examples,
- memory was less constrianing than today (even though magnetic cores was
slow) and memory to memory instructions were common,
- chips had less room than today (tens of thousands vs. tens of million),
- we ALWAYS use algorithms that are "special" because each program is a modification
and an adaptation of common cases to special cases, (we live in a real world, didn't
you know ?)
- who has ever programmed x86 processors in assembly langage knows how painful it is...
The real reason for having a lot of registers is to reduce the need to store and load from
memory. We all know that even with several cache memory levels, classical architectures
are memory-starved, so keeping more variables close to the execution units reduces the
overall execution latency.
B) "If there are too much registers there is no room for coding instructions" : that is where the design of processors is an art of balance and common sense. And we are artists, aren't we ? Through register renaming, the number of physical register can be virtually extended to any physical limit.
C) "The more there are registers, the longer it takes to switch between tasks or acknowlege
interrupts" is another reason that is discussed a lot. Then, i wonder why Intel has put
128 registers in IA64 ???
It is clear anyway that *FAST* context switch is an issue for a lot of obvious reasons.
Several techniques exist and are well known, like register windows (a la SPARC), register bank
switching (like in DSPs) or memory-to-memory architectures (not much known), but
none of them can be used in a simple design and a first proto, where transistor count
and complexity are an issue.
In the discussions of the mailing lists, it appeared that:
- most of the time is actually spent in the scheduler's
code (if we're discussing about OS speed) so the register backup issue is like the tree that
hides the forest,
- the number of memory bursts caused by a context switch
or an interrupt wastes most of the time when the memory bandwidth is limited (common sense
and performance measurements on a P2 will do the rest if you're not convinced)
- a smart programmer will interleave register backup code
with IRQ handler code, because an instruction usually needs one destination and two sources,
so if the CPU executes one instruction per cycle there is NO need to switch all the register
set in one cycle. In fewer words, no need of register banks.
These facts led to design the "Smooth Register Backup", a hardware technique which replaces
the software at interleaving the backup code with the computation code.
A code like this:
IRQ_HANDLER: clear R1 ; cycle 1 load R2,[imm] ; cycle 2 load R3,[imm] ; cycle 3 OP R1,R2,R3 ; cycle 4 OP R2,R3,R0 ; cycle 5 store R2,[R3] ; cycle 6 ....can be a common code that would be the beginning of an IRQ handler.
IRQ_HANDLER: store R1,[imm] clear R1 ; cycle 1 store R2,[imm] load R2,[imm] ; cycle 2 store R3,[imm] load R3,[imm] ; cycle 3 OP R1,R2,R3 ; cycle 4 OP R2,R3,R0 ; cycle 5 store R2,[R3] ; cycle 6 ....
The conclusion of these discussions is that 64 registers are not too much.
The other problem is : is 64 enough ?
Since the IA64 has 128 registers, and superscalar processors need more register ports,
having more registers keeps the register port number from increasing. As a rule of thumb,
a processor would need (instructions per cycle)x(pipeline depth)x3 registers to avoid
register stalls on a code sequence without register dependencies. And since
the pipeline depth and the instructions per cycle both increase to get more performance,
the register set's size increases. 64 registers would allow a 4-issue superscalar CPU
to have 5 pipeline stages, which looks complex enough. Later implementation
will probably use register renaming and out-of-order techniques to get more performance
out of common code, but 64 registers are yet enough for a prototype.
As to increase the number of instructions executed during each cycle, the future F-CPUs
will need explicit register renaming. This will allow a F-CPU computer to have
tens of execution units without changing the instruction format.
2.5 The F-CPU is a variable-size processor.
This is a controversial side of the project that has been finally accepted with the
F-CPU goal which specify forward compatibility.
There are mainly two reasons behind this choice :
- As processors and families evolve, the data width becomes too tight. Adapting
the data width on a case-by-case basis led to the complexities of the x86 or the VAX
which are considered as good examples of how awful an architecture can become.
- We often need to process data of different sizes in the same time, such as pointers,
characters, floating point and integer numbers (for example in a floating-point to ASCII
function). Treating every data with the same big size is not an optimal solution because
we will spare registers if several characters or integers can be packed into one register
which would be rotated to access each subpart.
We need from the beginning a good way to adapt on the fly the size of the data we
handle. And we know that the width of the data to process will increase a lot in the future,
because it's almost the only way to increase performance. We can't count on the regular
performance increase provided by the new silicon processes because they are expensive
and we don't know if it will continue. The best example of this data parallelism is SIMD
programming, like in the recent MMX, SSE, AlphaPC, PPC or SPARC instruction sets where one
instruction performs several operations. From 64, it evolves to 128 and 256 bits per
instruction, and nothing keeps this width from increasing, while this increase gives
more performance. Of course, we are not building a PGP-breaker CPU, and 512-bit integers
are almost never needed. The performance lies in the parallelism, not the width. For example,
it would be very useful to parallely compare characters, like during substring search :
the performance of such a program would be directly proportional to the width of
the data that the CPU can handle.
The next question is : how wide ?
Because fixed sizes give rise to problems at one time or another, deciding of an arbitrarily
big size is not a good solution. And, as seen in the example of substring search, the wider
the better, so the solution is : not deciding the width of the data we process before
execution.
The idea is that software should run as fast as possible on every machine, whatever the
family or generation is. The chip maker decides of the width it can fund, but this choice
is independent from the programming model, because it can also take into account : the price,
the technology, the need, the performance...
So in few words : we don't know a priori the size of the registers. We have to run
the application, which will recognize the computer configuration with special instructions,
and then calibrate the loop counts or modify the pointer updates. This is almost the same
process as loading a dynamic library...
Once the program has recognized the characteristic width of the data the computer can manage,
the program can run as fast as the computer allows. Of course, if the application uses
a size wider than possible, this generates a trap that the OS can handle as a fault or
a feature to emulate.
Then the question is : how ?
We have to consider the whole process of programming, coding, making processors and enhancing them. The easiest solution is to use a lookup table, which interprets the 2 bits of the size flag in the instructions, as defined in the F-CPU Architecture Guide. The flags are by default interpreted this way :
At least, the scalability problem is known and solved since the beginning, and the coding techniques won't change between processor generations. This garantees the stable future of the FCPU, and the old "RISC" principle of letting the software solve the problems is applied once again. I hope that this side of the project will be soon included in the Architecture Guide, and that coding examples will be given, but we can consider that prototype F1s will be hardwired to the default values, and attempting to modify them will trigger a fault. But later, 4096-bit F-CPUs will be able to run programs designed on 128-bit F-CPUs and vice versa.
2.6 The F-CPU is SIMD-oriented because it's one easy way to increase the number of operations performed each cycle without increasing the control logic. The variable sized registers allow endless scalability and thus endless performance increase, but each instruction performing operations on data must have a SIMD flag, as to differentiate the type of operation.
2.7 The F-CPU has generalized registers,
meaning that integer numbers are mixed with pointers and floating-point numbers.
The most common objection is from the hardware side, because a first effect is that it
increases the number of read/write ports in the register set (this is almost similar to
having twice more registers).
The first argument from the F-CPU side is that software gets simpler, and that there are
hardware solutions to that problem. The first problem comes from the algorithms themselves:
some are purely integer-based, while other need a lot of floating point values. Having
a split register set for integer and floating point numbers would handicap both algorithms,
because specialized registers would not be used (the FP set would be unused for example during
programs like a mailer or a bitmap graphics edition, while a lot of FP is needed during
ray-tracing or simulations). And a lot of them is needed when it happens.
Another software aspect is about compilation, where register allocation algorithms
are critical for performance. Having a simple (single) register "pool" eases the decisions.
The second answer to the hardware problem is in the hardware. The first F-CPU chip, the F1,
will be a single-issue pipelined processor, where only three register read ports are
needed, thus there is no register set problem at the beginning.
Later chips, with more instructions issued per cycle, will probably use a technique
dear to the team : each register has attribute (or "property") bits that indicate
if the register is used as a pointer, a floating point number, etc, so they can be
mapped to different physical register sets while still being unified from the programming
point of view. The attributes are regenerated automaticaly and don't need to be saved or
restored during context switches.
2.8 The F-CPU has special registers that store the
context of the processor, manage the vital functions and ensure protection.
These special registers can be accessed only through a few special instructions and can
trigger a trap if the register does not exist or is not allowed for access in the current
running context. Since almost everything is managed through these special registers,
they are the key for protection in a multi-user, multi-task modern operating system.
These special registers are very important to recognize the CPU's configuration
and the "map" will evolve a lot in the future, adding more features without touching the
instruction set.
2.9 The F-CPU has no stack pointer. Or more
exactly, it has no dedicated stack pointer. It has no stack at all, in fact, because
each register can be used to access memory. One single hardwired stack pointer would
cause problems that are found in CISC processors and require special tricks to handle
them. For example, several push & pop instructions cause multiple register uses
in a single cycle in a superscalar processor, which requires special material.
In the RISC world, conventions (the ABI) are used to decide how to communicate
between applications or how to initialize the registers at their beginning,
and provided you save the registers between two calls, nothing keeps you from having
60 stacks at once if your algorithm requires it.
Accessing the stack is performed with the single load/store instruction which has
post-increment (only) capability. Considering an expand-down stack pointed to by R3,
we will code:
pop:
load.64 R2,[R3]+8push:
store.64 R2,[R3]-8Since the addition and the memory fetch are performed in the same time, the updated pointer is available after the instruction.
2.10 The F-CPU has no condition code register. It is not
because we don't like them but they cause some troubles when the processor scales up in
frequency and instructions per cycle : managing a few bits becomes as complex as the above
described stack.
The solution to this problem is the classical RISC fashion : a register is either zero or not.
A branch or a conditional operation is executed if a register is zero (or not). Therefore,
several conditions can be setup, without the need to manage a fixed set of bits (for example
during context switches).
But, as explained later, reading a register is rather "slow" in the FC0 and the latency may
slow down a large number of usual instructions. The solution is not to read them, but a
"cache" copy of the needed attribute. Like described above for the "attribute" or
"property" bits of the registers for the floating point issue, each register has an
attribute bit which is regenerated each time the register is written. While the register
is being acccessed, the value that is present on the write bus is checked for 0 and one bit
out of 63, corresponding to the register we write, is set or reset depending on the result.
This set of "transparent latch" gates is situated close to the instruction decoder in order to
reduce the latency of conditional instructions. Since they are regenerated at each write,
there is no need to save or restore them during context switches, and there are no
coherency issues.
There is no carry flag either. Addition with carry is performed through a special form of the intruction that writes the carry to a general purpose register next to the result register. This avoids any coherency trouble with the context switches and allows to use a carry with SIMD instructions : this is completely scalable and secure.
2.11 The F-CPU is "endianless" because either only big endian or little endian does not satisfy everybody. To solve this problem, there is an endian bit in the load/store instructions. The processor itself is not much biased towards one endianness and the instructions themselves are not subject to this debate. The choice is up to the end user. For further informations, read the discussions about the Endian flag
2.12 The F-CPU uses paged memory to provide the user with a large private, linear,
virtual memory to all executing tasks. Page-based protection is also a simple, software
way to protect the tasks' memory spaces from eachother. No special definition or mechanism
has been defined yet but we assume the following characteristics :
- The pages will have several sizes, for example 4KB, 32KB, 256KB and 2048KB, in order to
reduce the number of page descriptors (pressure on the malloc routines !)
- The pages could be compressed on the fly when flushed to hard disks
(especially for the huge pages)
- One could reserve some space in the cache memory hierarchy to hold the
most important pages.
- The cachability flags and the read/write flags of the pages will be used for the early
implementations to ensure cache coherency in multi-CPU systems with the OS functions and
traps, instead of using dedicated hardware. So, not only paged memory is used to protect
the tasks and provide more visible memory but also serves as a "software" replacement
of the MESI protocol.
- The internal TLBs are software-controlled through a set of Special Registers. No microcode
or hardware mechanism is foreseen that will help search a page table entry in memory.
An OS exception is triggered whenever a task issues an instruction that access a memory
location that is not in the internal Page Table (TLB). Since there will probably be only
four or eight entries of 4KB, 32KB, 256KB or 2048KB each (16 or 32 descriptors for data,
probably less for the instructions), the OS PTE miss trap handler must be very carefully coded.
In order to keep a good overall performance, the project counts on an efficient OS and the LINUX-likes are likely to be the best suited systems because they benefit from all the most recent researches and advances in kernel technology, smart task schedulers and efficient page replacement algorithms. The choice of a software page replacement strategy not only keeps the HW complexity low, but also allows the system to benefit from the future algorithmic advances.
2.13 The F-CPU stores the state of a task in Context Memory Blocks (CMB). These are very important structures for the OS because the SRB mechanism keeps the handlers from seeing the interrupted tasks for coherency reasons. The OS will deal with these blocks in order to set or modify the properties and access rights of a task, read its registers, or interpret a system call. A context memory block must store all the data that are private to a task in order to fully store and restore it. The endianness of the CMB is not defined.
A Context Memory Block is divided into a variable number of "slots" that are as wide
as the CPU can support (ie, 64 bits for a 64-bit CPU). Each slot contains an individual
global or special register.
The first 64 slots hold the contents of the normal "general" registers. They are
stored and restored by the Smooth Register Backup mechanism. Since R0 is hardwired
to 0, the corresponding slot (the first one) is left to the OS in order to manage
a linked list or any chosen management structure.
The CMB also contains the instruction pointer because it is not directly accessible
by the user in the space of the normal registers.
The CMB holds the access rights and the most important protection flags. The OS modifies
the access rights of a task in the CMB because it can't do it directly in the special registers
(which at this time store the OS's properties...)
The CMB holds the pointer to the task's page table (when paging is enabled).
This page table can be stored at the end of the CMB if the OS decides to do so.
Two last slots are used for multitasking and debugging, in conjunction with the
SRB mechanism : the "next" and "time slice" slots. The "next" slot is a pointer to
another CMB ; the task stored in the CMB can switch automatically to a new task,
whose CMB is pointed to by the "next" field. The "time slice" stores the number of
clock cycles that the task can execute before automatically switching to the "next" task.
This description is not exhaustive and the number of CMB slots will increase in the future, as the needs and the architectures evolve. A certain number of Special Registers are dedicated to the CMB management.
2.14 The F-CPU can use the CMBs to single-step tasks.
To use the CMB when single-stepping a task, no special device is required (except a brain):
1) Setup the task's CMB to the following parameters : "next" points to the debugger's own
CMB, and "time slice" is set to 1 (or any desired number for multiple stepping).
2) Set the "next" special register to the task's CMB.
3) Execute a RFE instruction (return from exception).
When RFE is executed, the processor will automatically switch to the task whose CMB is pointed to by the "next" special register. The processor will then load the CMB's "next" slot into the "next" special register, execute instructions, and switch (back) to the debugger when this number expired. The debugger can then analyze the contents of the task's CMB, its registers and special fields.
A flag in the MSR is also dedicated to single-stepping tasks. The CPU generates a trap after executing any instruction when this flag is set.
Other than single-stepping, the F-CPU will provide the user with traps on special conditions and events, as the implementations allow (this is more implementation-dependent).
2.15 The F-CPU uses a simple protection mechanism before a more sophisticated one is developped. A simple user/supervisor scheme is a good way to start a CPU but a more refined ressource-based protection will enable users to create a more flexible OS.