\part{Advanced topics}

\setcounter{chapter}{0}

 A superpipelined CPU core does not only implies the use of variable length pipelines.
Some characteristics of the FC0 and the F-CPU in general will be discussed here, they 
are not only "features" but design philosophies that are lead by the choices as
discussed in the first part of the document.

\chapter{The exceptions}

 A processor of any kind (CISC, RISC or any other architecture) generates a lot of 
exceptions, interrupts, traps and system calls (here, context switches are not the point). 
Each pipeline stage can generate several errors that the OS must handle, which requires that 
the application must "restart" the trapped instruction or continue after the trap. This implies 
that the whole context must be saved, but which ?

~

 Control can be transfered to the OS, an interrupt handler or a trap handler at anytime, at any 
stage of the pipeline. A classic RISC pipeline comprises (and generates) for example :

\begin{itemize}
 \item  IF (Instruction Fetch) : page fault
 \item  ID (Instruction Decode) : invalid instruction, trap instruction, privileged instruction.
 \item  EX (EXecute) : divide by zero, overflow, any IEEE FP math error...
 \item  MEM (MEMory access) : page fault, protection error
\end{itemize}

~

 Not only should the processor trigger the correct handler (because several errors can occur in 
the same cycle) but it must also preserve or flush the correct stages of the pipeline. And since 
FC0 completees the operations OOO, it is too complex to do without a lot of buffers everywhere as
well as sophisticated bookkeeping, which we can't afford for obvious reasons. We need to keep 
precise exception anyway, and the ability to stop the pipeline at any time without losing data 
that would require some code to be reexecuted. We need a simple and predicatable yet efficient
pipeline that is not influenced in its architecture by faults.

~

 The simplest alternative to this problem is dictated by good sense : 
make all the exception occur at one place, before the potentially faultive instructions enter
the pipeline and require additional hardware. This means : \textit{NO INSTRUCTION IS ISSUED IF IT CAN 
TRIGGER AN EXCEPTION} or, in other words, \textit{ALL EXCEPTIONS MUST BE CHECKED AT DECODE TIME AS TO 
PREVENT THEM FROM OCCURING IN THE EXECUTION PIPELINE}. Remember this clearly, meditate about this,
since it influences how the instruction set is designed too.

~

 The good side of this choice is that there is no "trap source" register as in the MIPS CPUs. 
All exceptions are caught at the same place and are disambiguified and ordered implicitely. 
Another important good consequence is that there is no temporary buffer or "renamed registers" 
as called in the PowerPC. The previously described OOOC pipeline is not changed at all and the 
critical datapath does not suffer from additional buffers. There is no register allocation 
bookkeeping, nor added control logic.

~

 The other side, which is about the constraints, is discussed here. 
Most obvious limitations have simple turnarounds. The first problem is : can we detect all the 
exceptions at decode time and how ?

~

 \underline{First cause :} page fault at instruction fetch time.
First, we are not absolutely sure that we will even decode the next coming instruction, since the
last instruction of a page could be a jump, or any similar instruction. So why trigger the trap 
now ? The easy turnaround to this problem is to "tag" the instruction as faultive or, better, replace
it with a trap instruction (which requires less hardware). So, if the instruction is 
executed, it will trap. Simple, isn't it ? Of course, if a page fault is triggered by the instruction 
prefetch unit, it is a good practice to directly prefetch the necessary code before it is needed.
Just by precaution.            

~

 \underline{Second cause :} invalid instruction, privileged instruction...
Why bother, we will trap without worry. Depending on the type of trap, we will advance the 
instruction pointer or not, fetch the needed code to execute it, and begin to backup of the 
registers with the SRB mechanism. The precedent instructions don't need to be flushed from the 
pipeline, because the SRB will communicate with the scoreboard to backup the registers in a 
correct order. When the pipeline will be "naturally" flushed from the old application's instructions,
the registers will be saved and the faultive application will restart later without any loss or 
reexecution.

~

 \underline{Third cause :} math fault.
The saturation (or overflow) exception (a la MIPS) is not implemented. The IEEE Floating Point 
instructions have a "compliance" flag that stops the instruction issue until the result is "safe",
otherwise the result will sturate and not trigger any trap. The "division by zero" condition is 
easily detected at decode stage with the ZERO property bit of the dividing register. In the same time,
we can detect if the result will be zero and issue a "clear" operation instead of the divide operation.

~

 \underline{Fourth cause :} page fault, invalid address fault.
We can consider that the memory is protected on a page granularity basis, so the page fault will 
trigger a protection checking code before loading the page. But detecting a page fault is very 
simple : we have to check the address with the values contained in a page table. If the address 
does not correspond to the available pages, it is a page fault : we trap.

~

 Now, the problem is to have the status (page present or not ?) at decode time. Let's be smart, 
because memory accesses are almost half of the executed instructions !

The alternative is to use a similar mechanism to the ZERO "property" bits of each registers. 
This means that when a value is written to the register set through the Xbar, some ports of the 
Xbar communicate the value to the page table. In one cycle or two, the data is ready for the ID 
stage, this is a speculative check that is transparent to the instruction set architecture. In 
this page check time, we can also check for the address range, verify if the value is in L1 cache
, and if yes, indicate in which bank it is and prefetch the cache line, etc...

~

 An obvious problem though is that we can't seriously check all the values flowing through the 
Xbar to the reg set. Not only this is not always useful but it also consumes power. The simplest way
(for the prototype) is to check the result of the pointer updates since they are most likely to be 
reused soon as pointer.

~

 For more sophisticated architectures, another "transparent tag", saying that the register is used 
as a pointer, can be very useful. We can allow for example only a few registers to hold this tag, 
something like 16 (64/4 sounds reasonable) and this flag would be set each time a memory access is 
performed with this register. The flags would be allocated with a LRU mechanism using a 4 bit down 
counter. This way, when the ID recognizes a memory read/write instruction, it checks the pointer 
flag and if set, sends the associated informations to the L/S unit (informations like : in which 
L1 bank the data is, or in which buffer, etc.) or it traps if the page table lookup returned a 
negative value. If the pointer flag is not set, the ID pauses for a page table lookup and sets 
the pointer flag. Of course, like all transparent flags, their value is not saved during context 
switches and is regenerated automatically as soon as they are used. In the absence of explicit 
flags in the instructions, this is a rather simple way to reduce the table lookup overhead, and 
the address can be checked BEFORE it is needed. The L/S unit is only in charge of buffering 
the data that flows to/from memory and caches. This last detail invalidates the drawing of the 
figure 3 where the page table was stored in the L/S Unit.

~

 There, almost all exception causes are covered and the turnarounds have been explained. There is
no visible impact to the ISA but coding rules are getting tighter, like in a superscalar processor.
Anyway the turnarounds of the problems caused by the "exception-less" execution pipeline of the FC0
are known and explained. Other new exceptions will probably use the same idea of the existing 
exceptions : using a dynamic flag. This way, programming the FC0 looks almost like programming a 
normal RISC CPU with some additional coding rules.
 
\chapter{The Smooth Register backup mechanism}

 As described in the first section of the document, in the "64 registers" discussion, one alternative 
to register windowing, banked register sets or memory-to-memory architectures is to implement a 
"Smooth Register Backup" (SRB for short) for automatic register saving. It is not an usual feature in 
a microprocessor, because it is characterized by the communication with the scoreboard and the 
use of a "find first flag" algorithm. The whole mechanism is rather simple, as we will describe 
it here (even though i seem to rant too much, thus : read slowly then reread more slowly).
Note : Depending on its actual use and usefulness, the SRB mechanism may be removed from the F-CPU 
with minimal impact on the overall architecture, instruction set and application software. Some drivers 
and kernels may need the additional, manual register backup code. Other similar techniques can also be 
used instead.

~

 How and when is the SRB used ? Well, it is used for what it does : 

Flush the register set to memory and/or load a new context.

It could be used at any time, since 
it does not interfere with other hardware except the L/S unit. 

It is mainly used for context switch (the SRB can be triggered by an interrupt and the rest is
done automatically), to save a context when an interrupt is triggered, and to restore the registers
after the IRQ routine has completed. In these cases (the flushing or reading of registers to/from 
memory by an load-many or store-many instruction is another case), there are two threads : the "old" 
thread and the "new" thread. The "new" thread is defined to start as soon as the 
SRB signal is triggered, and the SRB must save these registers before the new thread uses them as
to ensure data coherency.
 Not only does the SRB remove the need to manually save and restore registers, but it does it faster
than software (while the application still runs) and adapts itself to the circumstances by reordering 
the backup sequence on the fly. It uses a few simple additional hardware, data from the scoreboard 
(the "register's value is being computed" flag), it steals unused clock cycles from the memory L/S unit
to load and store the registers, it has a few flags, some pointer registers and some logic. To know how 
to use this, let's define some behaviour rules :

\begin{itemize}
\item  We can't save a register as long as its value is being computed. The scoreboard tells us 
       what register not to backup (yet). This status changes at every cycle, so knowing the state 
       of the scoreboard quickly is very important.
\item  There's no need to save a register that has not been modified since the last backup. There
       is a "dirty" flag for this purpose, that is set whenever the register is written to.
\item  We have a special "not yet saved" flag that says that the physical register must be saved
       before it is ready for use by the new thread. In the same time, this flag blocks the scoreboard
       so that it can query an "express" request. 
       This flag is loaded from the "dirty bit" when the SRB signal is detected, and the dirty bit is 
       cleared for the new thread.
\item  When the new thread needs to use (read and write) a register that has not been saved yet,
       it instructs the SRB sequencer to modify the order and waits for the register to be free.
       The scoreboard, that is queried by the instruction decoding unit, "blocks" the instruction
       until the value is ready, and the "save in priority" flag is set until the data is ready.
\item  The SRB sequence is atomic, it can't be stopped unless there's a memory fault. A new SRB 
       signal must wait for the previous SRB signal issued to be completely processed. Turning 
       off the IRQs while SRB is running avoids lost cycles (waiting for the previous SRB to 
       complete, while the previous handler is being executed). 
       If an exception occurs during the SRB sequence, good news : we had already begun to save
       the registers : -) We need to wait for the (old) sequence to complete, before triggering 
       a "new" SRB and executing the handler.
\end{itemize}

 Of course, this high number of flag bits can be condensed by using a Finite State Machine (this
implementation detail is left to the designer). But the following algorithm doesn't need one : 
"for each cycle, write to memory the first register that 1) asks for express backup 2) is not yet 
saved (in decreasing priority), starting from register \#1". The algorithm stops when no more 
register needs to be saved. When a context switch occurs, there are two memory accesses, one for
saving the old register value and one for fetching the new thread's value. If one half of the 
thread's operations are loads or stores, this would use about onehundred cycles to save a context.
With a single-issue pipeline and not much bandwidth, it can take about 200 cycles to perform a full 
context switch. SRB is bandwidth-hungry, but software backup would be too. At least, the SRB uses
the whole available hardware.

~

 The SRB algorithm is a sequence that can be reordered with 63 register load and/or stores, 
starting from register \#1, we need a way to extract this sequence from a line of flags. A "find 
first" unit, similar (but simpler) than the binary tree used in the increment unit, can do this 
easily. At the input, it selects the "express requests" if any, or the normal request from the 
registers that need to be saved. The express flag is set by the scoreboard, and the "not yet saved"
flags belong to the SRB mechanism. The output of the binary tree directly selects one register 
(out of 63) for reading and/or writing and resets the register's flags. 
Maybe a simple drawing speaks more : -)

\begin{figure}[H]
  \begin{center}
    \includegraphics[width=15cm,draft=false]{srb1.eps}
    \caption{Detail of one bit of the SRB flags and decision mechanism}
  \end{center}
\end{figure}

 When no "express" request is made, the application has priority over the SRB for accessing the 
L/S unit. Otherwise, the "express" flag means that the instruction is blocked at ID stage and 
that no memory access is performed (unless a cache miss had just been resolved...). So, in any 
case, the SRB has never priority (which simplifies things).
The SRB principle can be extended for multiple-issue processors without modification. A four-way
superscalar F-CPU would be able to reuse this part a priori with no worries.
As noted before, there are only two tasks that the CPU considers : 
the "old" and the "new" task. No assumption is made about where the data is transferred, about 
the address of the context buffers, therefore there is no limitation in the number of tasks. The 
caches will perform their roles of keeping data with time and space locality close to the core,
 so multithreaded programs will run normally. But the user can't always specify the address of 
these context buffers. The SRB mechanism has two pointers : SRB\_old and SRB\_new, that are used 
during SRB operation. After context switches completion, the SRB\_new pointer is copied into 
SRB\_old so that during the next context switch, only the new task need to be provided to the CPU.
It's up to the user to setup the desired list. This "new" pointer could also be stored in the 
"old"'s context buffer, as to perform round-robin operation automatically at each task switch IRQ.
 Cache control hardware will probably allow to "map" a certain memory area directly to the cache,
so that no LRU operation will flush the data from cache. This is where important tasks such as the 
kernel and real-time tasks should store their context buffers for better performance. Furthermore,
 if automatic context switch is implemented, it would first prefetch the whole context buffers 
(old and new) into L1 cache before triggering the SRB mechanism. 
This prefetch would occur in background as to not penalize the foreground tasks.
 
 
\chapter{The scheduler}

Managing several superpipelined units which can issue their result at the same time looks
tricky at first. The following behavioural rules will help understand what to do when :

\begin{itemize}
 \item The Xbar "gates" of the 2 write ports must be commanded during every cycle,
 so the 2 read ports of the register set have the correct data coming from the correct unit.
 \item One instruction can not be issued if more than two write ports are used during
the cycle when the instruction will complete.
 \item If the instruction can be issued, it must use a "free" write port.
\end{itemize}

Let's remember too that the scoreboard rules apply. More specifically, it is not possible
to issue an instruction if the operands are not ready, in the register set or on the Xbar
(during a register bypass cycle, for back-to-back dependent instruction pairs). The scheduler
must also recognize this situation.

~