Part 2 :
2.1 The main chacteristics
2.2 The instructions are 32-bit wide
2.3 Register 0 is "read-as-zero/unmodifiable"
2.4 The F-CPU has 64 registers
2.5 The F-CPU is a variable-size processor
2.6 The F-CPU is SIMD-oriented
2.7 The F-CPU has generalized registers
2.8 The F-CPU has special registers
2.9 The F-CPU has no stack pointer
2.10 The F-CPU has no condition code register
2.11 The F-CPU is endianless
2.12 The F-CPU uses paged memory
2.13 The F-CPU stores the state of a task in Context Memory Blocks
2.14 The F-CPU can use the CMBs to single-step tasks
2.15 The F-CPU uses a simple protection mechanism
2.1 The main characteristics : The CPU described here can be thought as a crossover between a R2000 chip (or early ALPHA) and a CDC6600 computer. Some constraints are similar : the F-CPU must be as simple and performant as possible. From the R2000, it inherits from the RISC main characteristics like fixed size instructions, the register set and the size of the chip that is bound by the current technology. In the CDC6600, FC0 finds the execution scheme, the scoreboard, the multiple parallel execution units and most of all : the inspiration for smart techniques that ease both design and programming.
The following text is a step-by-step description of the currently developped F-CPU. The features will be more deeply described and get interdependent, so it is recommended to read them from the beginning :-) We will begin with the most basic F-CPU characteristics before discussing more critical and hardware-dependent subjects.
2.2 The instructions are 32-bit wide. This is a heritage of the traditional RISC processors, and the benefits of fixed size instructions are not discussed anymore, except for certain niche applications. Even the microcontroller market is invaded by RISC cores with fixed size intructions.
The instruction size can be discussed a bit more anyway. It is clear that a 16-bit word can't contain enough space to code 3-operand instructions involving tens of registers and operation codes. There are some 24- and 48-bit instruction processors, but they are limited to niche markets (like DSP) and they don't fit in even-sized cache lines. If we access memory on a byte basis, this becomes too complex. Because the F-CPU is mainly a 64-bit processor, 64-bit instructions have been proposed, where two instructions are packed, but this is similar to 2 32-bit instructions which can be atomic, while 64-bit pairs can't be split. There is also the Merced (IA64) that has 128-bit instruction words, each containing 3 opcodes and register dependency informations. Since we use a simple scoreboard, and because IA64-like (VLIW) compilers are very tricky to program, we let the CPU core decide wether to block the pipeline or not when needed, thus allowing a wide range of CPU core types to execute the same simple instructions and programs.
Since the F-CPU microarchitecture was not defined at the beginning of the project, the instructions had to execute on a wide range of processor types (pipelined, superscalar, out-of-order, VLIW, whatever the future will create). A fixed-sized, 32-bit instruction set seems to be the best choice for simplicity and scalability in the future. Core-dependent optimisations can be made on the binaries by applying specific scheduling rules, but the application will still run on other family members that have a completely different core.
2.3 Register 0 is "read-as-zero/unmodifiable". This is another classical "RISC" feature that is meant to ease coding and reduce the opcode count. This was valuable for earlier processors but current technologies need specific hints about what the instruction does. It is dumb today to code "SUB R1,R1,R1" to clear R1 because it needs to fetch R1, perform a 64-bit substraction and write the result, while all we wanted to do is simply clear R1. This latency was hidden on the early MIPS processors but current technologies suffer from this kind of coding technique, because every step contributing to perform the operation is costly. If we want to speedup these instructions, the instruction decoder gets more complex.
So, while the R0=0 convention is kept, there is more emphasis on specific instructions. For example,
"SUB R3,R1,R2" which compares R1 and R2, generaly to know if greater or equal, can be replaced in F1 by
"CMP R3,R1,R2" because CMP does use a special comparison unit which has less latency than a substraction (after all we don't care about the numerical result, we simply want its property).
"MOV R1,R0" clears R1 with no latency because the value of R0 is already known (hardwired to zero).
2.4 The F-CPU has 64 registers, while RISC processors traditionally have 32 registers. More than a religion war, this subject proves that the design choices are deeply influenced by a lot of parameters (this looks like a thread on comp.arch). Let's look at them :
A) "It has been proved that 8 registers are plain enough for most algorithms." is a deadbrain argument that appears sometimes. Let's see why and how this conclusion has been made :
- it is an OLD study,
- it has been based on schoolbook algorithm examples,
- memory was less constrianing than today (even though magnetic cores was slow) and memory to memory instructions were common,
- chips had less room than today (tens of thousands vs. tens of million),
- we ALWAYS use algorithms that are "special" because each program is a modification and an adaptation of common cases to special cases, (we live in a real world, didn't you know ?)
- who has ever programmed x86 processors in assembly langage knows how painful it is...
The real reason for having a lot of registers is to reduce the need to store and load from memory. We all know that even with several cache memory levels, classical architectures are memory-starved, so keeping more variables close to the execution units reduces the overall execution latency.
B) "If there are too much registers there is no room for coding instructions" : that is where the design of processors is an art of balance and common sense. And we are artists, aren't we ? Through register renaming, the number of physical register can be virtually extended to any physical limit.
C) "The more there are registers, the longer it takes to switch between tasks or acknowlege interrupts" is another reason that is discussed a lot. Then, i wonder why Intel has put 128*2 registers in IA64 ???
It is clear anyway that *FAST* context switch is an issue for a lot of obvious reasons. Several techniques exist and are well known, like register windows (a la SPARC), register bank switching (like in DSPs) or memory-to-memory architectures (not much known), but none of them can be used in a simple design and a first proto, where transistor count and complexity are an issue.
In the discussions of the mailing lists, it appeared that :
- most of the time is actually spent in the scheduler's code (if we're discussing about OS speed) so the register backup issue is like the tree that hides the forest,
- the number of memory bursts caused by a context switch or an interrupt wastes most of the time when the memory bandwidth is limited (common sense and performance measurements on a P2 will do the rest if you're not convinced)
- a smart programmer will interleave register backup code with IRQ handler code, because an instruction usually needs one destination and two sources, so if the CPU executes one instruction per cycle there is NO need to switch all the register set in one cycle. In fewer words, no need of register banks.
These facts led to design the "Smooth Register Backup", a hardware technique which replaces the software at interleaving the backup code with the computation code.
A code like this :
IRQ_HANDLER: clear R1 ; cycle 1 load R2,[imm] ; cycle 2 load R3,[imm] ; cycle 3 OP R1,R2,R3 ; cycle 4 OP R2,R3,R0 ; cycle 5 store R2,[R3] ; cycle 6 ....can be a common code that would be the beginning of an IRQ handler.
IRQ_HANDLER: store R1,[imm] clear R1 ; cycle 1 store R2,[imm] load R2,[imm] ; cycle 2 store R3,[imm] load R3,[imm] ; cycle 3 OP R1,R2,R3 ; cycle 4 OP R2,R3,R0 ; cycle 5 store R2,[R3] ; cycle 6 ....
The conclusion of these discussions is that 64 registers are not too much.
The other problem is : is 64 enough ?
Since the IA64 has 128 registers, and superscalar processors need more register ports, having more registers keeps the register port number from increasing. As a rule of thumb, a processor would need (instructions per cycle) x (pipeline depth) x3 registers to avoid register stalls on a code sequence without register dependencies. And since the pipeline depth and the instructions per cycle both increase to get more performance, the register set's size increases. 64 registers would allow a 4-issue superscalar CPU to have 5 pipeline stages, which looks complex enough. Later implementation will probably use register renaming and out-of-order techniques to get more performance out of common code, but 64 registers are yet enough for a prototype.
As to increase the number of instructions executed during each cycle, the future F-CPUs will need explicit register renaming. This will allow a F-CPU computer to have tens of execution units without changing the instruction format.
2.5 The F-CPU is a variable-size processor. This is a controversial side of the project that has been finally accepted with the F-CPU goal which specify forward compatibility. There are mainly two reasons behind this choice :
- As processors and families evolve, the data width becomes too tight. Adapting the data width on a case-by-case basis led to the complexities of the x86 or the VAX which are considered as good examples of how awful an architecture can become.
- We often need to process data of different sizes in the same time, such as pointers, characters, floating point and integer numbers (for example in a floating-point to ASCII function). Treating every data with the same big size is not an optimal solution because we will spare registers if several characters or integers can be packed into one register which would be rotated to access each subpart.
We need from the beginning a good way to adapt on the fly the size of the data we handle. And we know that the width of the data to process will increase a lot in the future, because it's almost the only way to increase performance. We can't count on the regular performance increase provided by the new silicon processes because they are expensive and we don't know if it will continue. The best example of this data parallelism is SIMD programming, like in the recent MMX, SSE, AlphaPC, PPC or SPARC instruction sets where one instruction performs several operations. From 64, it evolves to 128 and 256 bits per instruction, and nothing keeps this width from increasing, while this increase gives more performance. Of course, we are not building a PGP-breaker CPU, and 512-bit integers are almost never needed. The performance lies in the parallelism, not the width. For example, it would be very useful to parallely compare characters, like during substring search : the performance of such a program would be directly proportional to the width of the data that the CPU can handle.
The next question is : how wide ?
Because fixed sizes give rise to problems at one time or another, deciding of an arbitrarily big size is not a good solution. And, as seen in the example of substring search, the wider the better, so the solution is : not deciding the width of the data we process before execution.
The idea is that software should run as fast as possible on every machine, whatever the family or generation is. The chip maker decides of the width it can fund, but this choice is independent from the programming model, because it can also take into account : the price, the technology, the need, the performance...
So in few words : we don't know a priori the size of the registers. We have to run the application, which will recognize the computer configuration with special instructions, and then calibrate the loop counts or modify the pointer updates. This is almost the same process as loading a dynamic library...
Once the program has recognized the characteristic width of the data the computer can manage, the program can run as fast as the computer allows. Of course, if the application uses a size wider than possible, this generates a trap that the OS can handle as a fault or a feature to emulate.
Then the question is : how ?
We have to consider the whole process of programming, coding, making processors and enhancing them. The easiest solution is to use a lookup table, which interprets the 2 bits of the size flag in the instructions, as defined in Part 5 : The F-CPU Instruction Set Architecture. The flags are by default interpreted this way :
At least, the scalability problem is known and solved since the beginning, and the coding techniques won't change between processor generations. This garantees the stable future of the FCPU, and the old "RISC" principle of letting the software solve the problems is applied once again. We can consider that prototype F1s will be hardwired to the default values, and attempting to modify them will trigger a fault. But later, 4096-bit F-CPUs will be able to run programs designed on 128-bit F-CPUs and vice versa.
2.6 The F-CPU is SIMD-oriented because it's one easy way to increase the number of operations performed each cycle without increasing the control logic. The variable sized registers allow endless scalability and thus endless performance increase, but each instruction performing operations on data must have a SIMD flag, as to differentiate the type of operation.
2.7 The F-CPU has generalized registers, meaning that integer numbers are mixed with pointers and floating-point numbers. The most common objection is from the hardware side, because a first effect is that it increases the number of read/write ports in the register set (this is almost similar to having twice more registers).
The first argument from the F-CPU side is that software gets simpler, and that there are hardware solutions to that problem. The first problem comes from the algorithms themselves : some are purely integer-based, while other need a lot of floating point values. Having a split register set for integer and floating point numbers would handicap both algorithms, because specialized registers would not be used (the FP set would be unused for example during programs like a mailer or a bitmap graphics edition, while a lot of FP is needed during ray-tracing or simulations). And a lot of them is needed when it happens.
Another software aspect is about compilation, where register allocation algorithms are critical for performance. Having a simple (single) register "pool" eases the decisions.
The second answer to the hardware problem is in the hardware. The first F-CPU chip, the F1, will be a single-issue pipelined processor, where only three register read ports are needed, thus there is no register set problem at the beginning.
Later chips, with more instructions issued per cycle, will probably use a technique dear to the team : each register has attribute (or "property") bits that indicate if the register is used as a pointer, a floating point number, etc, so they can be mapped to different physical register sets while still being unified from the programming point of view. The attributes are regenerated automaticaly and don't need to be saved or restored during context switches.
2.8 The F-CPU has special registers that store the context of the processor, manage the vital functions and ensure protection.
These special registers can be accessed only through a few special instructions and can trigger a trap if the register does not exist or is not allowed for access in the current running context. Since almost everything is managed through these special registers, they are the key for protection in a multi-user, multi-task modern operating system. These special registers are very important to recognize the CPU's configuration and the "map" will evolve a lot in the future, adding more features without touching the instruction set.
2.9 The F-CPU has no stack pointer. Or more exactly, it has no dedicated stack pointer. It has no stack at all, in fact, because each register can be used to access memory. One single hardwired stack pointer would cause problems that are found in CISC processors and require special tricks to handle them. For example, several push & pop instructions cause multiple register uses in a single cycle in a superscalar processor, which requires special material.
In the RISC world, conventions (the ABI) are used to decide how to communicate between applications or how to initialize the registers at their beginning, and provided you save the registers between two calls, nothing keeps you from having 60 stacks at once if your algorithm requires it.
Accessing the stack is performed with the single load/store instruction which has post-increment (only) capability. Considering an expand-down stack pointed to by R3, we will code :
pop :
load.64 R2,[R3]+8 or (updated syntax) load 8,r3,r2push :
store.64 R2,[R3]-8 or (updated syntax) store -8,r3,r2Since the addition and the memory fetch are performed at the same time, the updated pointer is available after the instruction accesses memory.
2.10 The F-CPU has no condition code register. It is not because we don't like them but they cause some troubles when the processor scales up in frequency and instructions per cycle : managing a few bits becomes as complex as the above described stack.
The solution to this problem is the classical RISC fashion : a register is either zero or not. A branch or a conditional operation is executed if a register is zero (or not). Therefore, several conditions can be setup, without the need to manage a fixed set of bits (for example during context switches). We don't use predication bits as found on some other architectures.
But, as explained later, reading a register is rather "slow" in the FC0 and the latency may slow down a large number of usual instructions. The solution is not to read them, but a "cache" copy of the needed attribute. Like described above for the "attribute" or "property" bits of the registers for the floating point issue, each register has an attribute bit which is regenerated each time the register is written. While the register is being acccessed, the value that is present on the write bus is checked for 0 and one bit out of 63, corresponding to the register we write, is set or reset depending on the result. This set of "transparent latch" gates is situated close to the instruction decoder in order to reduce the latency of conditional instructions. Since they are regenerated at each write, there is no need to save or restore them during context switches, and there are no coherency issues.
There is no carry flag either. Addition with carry is performed through a special form of the intruction that writes the carry to a general purpose register next to the result register. This avoids any coherency trouble with the context switches and allows to use a carry with SIMD instructions : this is completely scalable and secure.
2.11 The F-CPU is "endianless" because either only big endian or little endian does not satisfy everybody. To solve this problem, there is an endian bit in the load/store instructions. The processor itself is not much biased towards one endianness (well, due to the SIMD nature of the CPU, it is preferred to use little endianness) and the instructions themselves are not subject to this debate. The choice is up to the end user. For further informations, read the discussions about the Endian flag
2.12 The F-CPU uses paged memory to provide the user with a large private, linear, virtual memory to all executing tasks. Page-based protection is also a simple, software way to protect the tasks' memory spaces from eachother. No special definition or mechanism has been defined yet but we assume the following characteristics :
- The pages will have several sizes, for example 4KB, 32KB, 256KB and 2048KB, in order to reduce the number of page descriptors (pressure on the malloc routines !). A few page descriptors of arbitrary sized blocks (powers of two) would also be necessary to manage pages larger than 2MB (if you have 128MB of RAM in your computer you will need 64 x 2MB descriptors, more descriptors than the CPU can hold onchip). Proposed granularity for these large blocks is 128KB (base address and size, in a "fence" system) and the CPU could store two such page descriptors onchip.
- The pages could be compressed on the fly when flushed to hard disks (especially for the huge pages). This is an optional feature though because it doesn't decrease the latency of the hard disk, but this can reduce the main bus' contention.
- One could reserve some space in the cache memory hierarchy to hold the most important pages. Yet, the OS is at the best place to do it...
- The cachability flags and the read/write flags of the pages will be used for the early implementations to ensure cache coherency in multi-CPU systems with the OS functions and traps, instead of using dedicated hardware. So, not only paged memory is used to protect the tasks and provide more visible memory but also serves as a "software" replacement of the MESI protocol.
- The internal TLBs are software-controlled through a set of Special Registers. No microcode or hardware mechanism is foreseen that will help search a page table entry in memory. An OS exception is triggered whenever a task issues an instruction that access a memory location that is not in the internal Page Table (TLB). Since there will probably be only four or eight entries of 4KB, 32KB, 256KB or 2048KB each (32 descriptors shared for data and instructions in the first implementations), the OS PTE miss trap handler must be very carefully coded. [Remember my motto ? "coding carefully has always paid !"]
In order to keep a good overall performance, the project counts on an efficient OS and the LINUX-likes are likely to be the best suited systems because they benefit from all the most recent researches and advances in kernel technology, smart task schedulers and efficient page replacement algorithms. The choice of a software page replacement strategy not only keeps the HW complexity low, but also allows the system to benefit from the future algorithmic advances. If the features are not used, there is no dangling hardware...
2.13 The F-CPU stores the state of a task in Context Memory Blocks (CMB). These are very important structures for the OS because the SRB mechanism keeps the handlers from seeing the interrupted tasks for coherency reasons. The OS will deal with these blocks in order to set or modify the properties and access rights of a task, read its registers, or interpret a system call. A context memory block must store all the data that are private to a task in order to fully store and restore it. The endianness of the CMB is not defined.
A Context Memory Block is divided into a variable number of "slots" that are as wide as the CPU can support (ie, 64 bits for a 64-bit CPU). Each slot contains an individual global or special register.
The first 64 slots hold the contents of the normal "general" registers. They are stored and restored by the Smooth Register Backup mechanism. Since R0 is hardwired to 0, the corresponding slot (the first one) is left to the OS in order to manage a linked list or any chosen management structure.
The CMB also contains the instruction pointer because it is not directly accessible by the user in the space of the normal registers. The CMB holds the access rights and the most important protection flags. The OS modifies the access rights of a task in the CMB because it can't do it directly in the special registers (which at this time store the OS's properties...)
The CMB holds the pointer to the task's page table (when paging is enabled). This page table can be stored at the end of the CMB if the OS decides to do so.
Two last slots are used for multitasking and debugging, in conjunction with the SRB mechanism : the "next" and "time slice" slots. The "next" slot is a pointer to another CMB ; the task stored in the CMB can switch automatically to a new task, whose CMB is pointed to by the "next" field. The "time slice" stores the number of clock cycles that the task can execute before automatically switching to the "next" task.
This description is not exhaustive and the number of CMB slots will increase in the future, as the needs and the architectures evolve. A certain number of Special Registers are dedicated to the CMB management.
2.14 The F-CPU can use the CMBs to single-step tasks. To use the CMB when single-stepping a task, no special device is required (except a brain) :
1) Setup the task's CMB to the following parameters : "next" points to the debugger's own CMB, and "time slice" is set to 1 (or any desired number for multiple stepping).
2) Set the "next" special register to the task's CMB.
3) Execute a RFE instruction (return from exception).
When RFE is executed, the processor will automatically switch to the task whose CMB is pointed to by the "next" special register. The processor will then load the CMB's "next" slot into the "next" special register, execute instructions, and switch (back) to the debugger when this number expired. The debugger can then analyze the contents of the task's CMB, its registers and special fields.
A flag in the MSR is also dedicated to single-stepping tasks. The CPU generates a trap after executing any instruction when this flag is set.
Other than single-stepping, the F-CPU will provide the user with traps on special conditions and events, as the implementations allow (this is more implementation-dependent).
2.15 The F-CPU uses a simple protection mechanism before a more sophisticated one is developped. A simple user/supervisor scheme is a good way to start a CPU but a more refined ressource-based protection will enable users to create a more flexible OS.
It is not "a good thing" to use protection level rings because some pieces of software, for example in a microkernel OS, are dedicated to a certain task and the rings don't isolate their function properly. OTOH, a task that is dedicated to handle page table entry (PTE) misses only needs to access the associated Special Registers and the hard disk drive : if it fails, there is no consequence on other tasks that are dedicated to communications or memory management, even though they are "trusted" : they are normal tasks but their property flags allow them to access a certain hardware.