F-CPU Design Team
FCPU MANUAL REV. 0.2

 

            Part 4 :

 

 

 

Advanced topics

 

 

 

 

 


 
Summary

Previous part (3)

Next part (5)

            4.1 Foreword
            4.2 The exceptions
            4.3 The Smooth Register backup mechanism

 


 

4.1 Foreword :

            A superpipelined CPU core does not only implies the use of variable length pipelines. Some characteristics of the FC0 and the F-CPU in general will be discussed here, they are not only "features" but design philosophies that are lead by the choices that are discussed in the first part of the document.

 

4.2 The exceptions :

            A processor of any kind (CISC, RISC or any other architecture) generates a lot of exceptions, interrupts, traps and system calls (here, context switches are not the point). Each pipeline stage can generate several errors that the OS must handle, which requires that the application must "restart" the trapped instruction or continue after the trap. This implies that the whole context must be saved, but which ?
            Control can be transfered to the OS, an interrupt handler or a trap handler at anytime, at any stage of the pipeline. A classic RISC pipeline comprises (and generates) for example :
            - IF (Instruction Fetch) : page fault
            - ID (Instruction Decode) : invalid instruction, trap instruction, privileged instruction.
            - EX (EXecute) : divide by zero, overflow, any IEEE FP math error...
            - MEM (MEMory access) : page fault, protection error
            Not only should the processor trigger the correct handler (because several errors can occur in the same cycle) but it must also preserve or flush the correct stages of the pipeline. And since FC0 completees the operations OOO, it is too complex to do without a lot of buffers everywhere as well as sophisticated bookkeeping, which we can't afford for obvious reasons. We need to keep precise exception anyway, and the ability to stop the pipeline at any time without losing data that would require some code to be reexecuted. We need a simple and predicatable yet efficient pipeline that is not influenced in its architecture by faults.

            The simplest alternative to this problem is dictated by good sense : make all the exception occur at one place, before the potentially faultive instructions enter the pipeline and require additional hardware. This means : NO INSTRUCTION IS ISSUED IF IT CAN TRIGGER AN EXCEPTION or, in other words, ALL EXCEPTIONS MUST BE CHECKED AT DECODE TIME AS TO PREVENT THEM FROM OCCURING IN THE EXECUTION PIPELINE. Remember this clearly, meditate about this, since it influences how the instruction set is designed too.
            The good side of this choice is that there is no "trap source" register as in the MIPS CPUs. All exceptions are caught at the same place and are disambiguified and ordered implicitely. Another important good consequence is that there is no temporary buffer or "renamed registers" as called in the PowerPC. The previously described OOOC pipeline is not changed at all and the critical datapath does not suffer from additional buffers. There is no register allocation bookkeeping, nor added control logic.

            The other side, which is about the constraints, is discussed here. Most obvious limitations have simple turnarounds. The first problem is : can we detect all the exceptions at decode time and how ?

            First cause : page fault at instruction fetch time.
First, we are not absolutely sure that we will even decode the next coming instruction, since the last instruction of a page could be a jump, or any similar instruction. So why trigger the trap now ? The easy turnaround to this problem is to "tag" the instruction as faultive or, better, replace it with a trap instruction (which requires less hardware). So, if the instruction is executed, it will trap. Simple, isn't it ?
            Of course, if a page fault is triggered by the instruction prefetch unit, it is a good practice to directly prefetch the necessary code before it is needed. Just by precaution.

            Second cause : invalid instruction, privileged instruction...
Why bother, we will trap without worry. Depending on the type of trap, we will advance the instruction pointer or not, fetch the needed code to execute it, and begin to backup of the registers with the SRB mechanism. The precedent instructions don't need to be flushed from the pipeline, because the SRB will communicate with the scoreboard to backup the registers in a correct order. When the pipeline will be "naturally" flushed from the old application's instructions, the registers will be saved and the faultive application will restart later without any loss or reexecution.

            Third cause : math fault.
The saturation (or overflow) exception (a la MIPS) is not implemented. The IEEE Floating Point instructions have a "compliance" flag that stops the instruction issue until the result is "safe", otherwise the result will sturate and not trigger any trap. The "division by zero" condition is easily detected at decode stage with the ZERO property bit of the dividing register. In the same time, we can detect if the result will be zero and issue a "clear" operation instead of the divide operation.

            Fourth cause : page fault, invalid address fault.
We can consider that the memory is protected on a page granularity basis, so the page fault will trigger a protection checking code before loading the page. But detecting a page fault is very simple : we have to check the address with the values contained in a page table. If the address does not correspond to the available pages, it is a page fault : we trap.
            Now, the problem is to have the status (page present or not ?) at decode time. Let's be smart, because memory accesses are almost half of the executed instructions !
The alternative is to use a similar mechanism to the ZERO "property" bits of each registers. This means that when a value is written to the register set through the Xbar, some ports of the Xbar communicate the value to the page table. In one cycle or two, the data is ready for the ID stage, this is a speculative check that is transparent to the instruction set architecture. In this page check time, we can also check for the address range, verify if the value is in L1 cache, and if yes, indicate in which bank it is and prefetch the cache line, etc...
            An obvious problem though is that we can't seriously check all the values flowing through the Xbar to the reg set. Not only this is not always useful but it also consumes power. The simplest way (for the prototype) is to check the result of the pointer updates since they are most likely to be reused soon as pointer.
            For more sophisticated architectures, another "transparent tag", saying that the register is used as a pointer, can be very useful. We can allow for example only a few registers to hold this tag, something like 16 (64/4 sounds reasonable) and this flag would be set each time a memory access is performed with this register. The flags would be allocated with a LRU mechanism using a 4 bit down counter. This way, when the ID recognizes a memory read/write instruction, it checks the pointer flag and if set, sends the associated informations to the L/S unit (informations like : in which L1 bank the data is, or in which buffer, etc.) or it traps if the page table lookup returned a negative value. If the pointer flag is not set, the ID pauses for a page table lookup and sets the pointer flag. Of course, like all transparent flags, their value is not saved during context switches and is regenerated automatically as soon as they are used. In the absence of explicit flags in the instructions, this is a rather simple way to reduce the table lookup overhead, and the address can be checked BEFORE it is needed. The L/S unit is only in charge of buffering the data that flows to/from memory and caches. This last detail invalidates the drawing of the figure 3 where the page table was stored in the L/S Unit.

            There, almost all exception causes are covered and the turnarounds have been explained. There is no visible impact to the ISA but coding rules are getting tighter, like in a superscalar processor. Anyway the turnarounds of the problems caused by the "exception-less" execution pipeline of the FC0 are known and explained. Other new exceptions will probably use the same idea of the existing exceptions : using a dynamic flag. This way, programming the FC0 looks almost like programming a normal RISC CPU with some additional coding rules.

 

4.3 The Smooth Register backup mechanism :

            As described in the first section of the document, in the "64 registers" discussion, one alternative to register windowing, banked register sets or memory-to-memory architectures is to implement a "Smooth Register Backup" (SRB for short) for automatic register saving. It is not an usual feature in a microprocessor, because it is characterized by the communication with the scoreboard and the use of a "find first flag" algorithm. The whole mechanism is rather simple, as we will describe it here (even though i seem to rant too much, thus : read slowly then reread more slowly).

Note : Depending on its actual use and usefulness, the SRB mechanism may be removed from the F-CPU with minimal impact on the overall architecture, instruction set and application software. Some drivers and kernels may need the additional, manual register backup code. Other similar techniques can also be used instead.

            How and when is the SRB used ? Well, it is used for what it does : flush the register set to memory and/or load a new context. It could be used at any time, since it does not interfere with other hardware except the L/S unit. It is mainly used for context switch (the SRB can be triggered by an interrupt and the rest is done automatically), to save a context when an interrupt is triggered, and to restore the registers after the IRQ routine has completed. In these cases (the flushing or reading of registers to/from memory by an load-many or store-many instruction is another case), there are two threads : the "old" thread and the "new" thread. The "new" thread is defined to start as soon as the SRB signal is triggered, and the SRB must save these registers before the new thread uses them as to ensure data coherency.

            Not only does the SRB remove the need to manually save and restore registers, but it does it faster than software (while the application still runs) and adapts itself to the circumstances by reordering the backup sequence on the fly. It uses a few simple additional hardware, data from the scoreboard (the "register's value is being computed" flag), it steals unused clock cycles from the memory L/S unit to load and store the registers, it has a few flags, some pointer registers and some logic. To know how to use this, let's define some behaviour rules :
- (1) We can't save a register as long as its value is being computed. The scoreboard tells us what register not to backup (yet). This status changes at every cycle, so knowing the state of the scoreboard quickly is very important.
- (2) There's no need to save a register that has not been modified since the last backup. There's a "dirty" flag for this purpose, that is set whenever the register is written to.
- (3) We have a special "not yet saved" flag that says that the physical register must be saved before it is ready for use by the new thread. In the same time, this flag blocks the scoreboard sothat it can query an "express" request. This flag is loaded from the "dirty bit" when the SRB signal is detected, and the dirty bit is cleared for the new thread.
- (4) When the new thread needs to use (read and write) a register that has not been saved yet, it instructs the SRB sequencer to modify the order and waits for the register to be free. The scoreboard, that is queried by the instruction decoding unit, "blocks" the instruction until the value is ready, and the "save in priority" flag is set until the data is ready.
- The SRB sequence is atomic, it can't be stopped unless there's a memory fault. A new SRB signal must wait for the previous SRB signal issued to be completely processed. Turning off the IRQs while SRB is running avoids lost cycles (waiting for the previous SRB to complete, while the previous handler is being executed). If an exception occurs during the SRB sequence, good news : we had already begun to save the registers : -) We need to wait for the (old) sequence to complete, before triggering a "new" SRB and executing the handler.
            Of course, this high number of flag bits can be condensed by using a Finite State Machine (this implementation detail is left to the designer). But the following algorithm doesn't need one : "for each cycle, write to memory the first register that 1) asks for express backup 2) is not yet saved (in decreasing priority), starting from register #1". The algorithm stops when no more register needs to be saved. When a context switch occurs, there are two memory accesses, one for saving the old register value and one for fetching the new thread's value. If one half of the thread's operations are loads or stores, this would use about onehundred cycles to save a context. With a single-issue pipeline and not much bandwidth, it can take about 200 cycles to perform a full context switch. SRB is bandwidth-hungry, but software backup would be too. At least, the SRB uses the whole available hardware.

            The SRB algorithm is a sequence that can be reordered with 63 register load and/or stores, starting from register #1, we need a way to extract this sequence from a line of flags. A "find first" unit, similar (but simpler) than the binary tree used in the increment unit, can do this easily. At the input, it selects the "express requests" if any, or the normal request from the registers that need to be saved. The express flag is set by the scoreboard, and the "not yet saved" flags belong to the SRB mechanism. The output of the binary tree directly selects one register (out of 63) for reading and/or writing and resets the register's flags. Maybe a simple drawing speaks more : -)


figure 10 : Detail of one bit of the SRB flags and decision mechanism.
The (numbers) refer to the text.

            When no "express" request is made, the application has priority over the SRB for accessing the L/S unit. Otherwise, the "express" flag means that the instruction is blocked at ID stage and that no memory access is performed (unless a cache miss had just been resolved...). So, in any case, the SRB has never priority (which simplifies things).

            The SRB principle can be extended for multiple-issue processors without modification. A four-way superscalar F-CPU would be able to reuse this part a priori with no worries.

            As noted before, there are only two tasks that the CPU considers : the "old" and the "new" task. No assumption is made about where the data is transferred, about the address of the context buffers, therefore there is no limitation in the number of tasks. The caches will perform their roles of keeping data with time and space locality close to the core, so multithreaded programs will run normally. But the user can't always specify the address of these context buffers. The SRB mechanism has two pointers : SRB_old and SRB_new, that are used during SRB operation. After context switches completion, the SRB_new pointer is copied into SRB_old so that during the next context switch, only the new task need to be provided to the CPU. It's up to the user to setup the desired list. This "new" pointer could also be stored in the "old"'s context buffer, as to perform round-robin operation automatically at each task switch IRQ.

            Cache control hardware will probably allow to "map" a certain memory area directly to the cache, so that no LRU operation will flush the data from cache. This is where important tasks such as the kernel and real-time tasks should store their context buffers for better performance. Furthermore, if automatic context switch is implemented, it would first prefetch the whole context buffers (old and new) into L1 cache before triggering the SRB mechanism. This prefetch would occur in background as to not penalize the foreground tasks.

 

[more to come...]

 

 


mar avr 25 04 : 14 : 55 CEST 2000 by Whygee
Copyright (c) 1999-2000 The F-CPU Group Design Team.