\chapter{The Smooth Register backup mechanism}

 As described in the first section of the document, in the "64 registers" discussion, one alternative 
to register windowing, banked register sets or memory-to-memory architectures is to implement a 
"Smooth Register Backup" (SRB for short) for automatic register saving. It is not an usual feature in 
a microprocessor, because it is characterized by the communication with the scoreboard and the 
use of a "find first flag" algorithm. The whole mechanism is rather simple, as we will describe 
it here (even though i seem to rant too much, thus : read slowly then reread more slowly).
Note : Depending on its actual use and usefulness, the SRB mechanism may be removed from the F-CPU 
with minimal impact on the overall architecture, instruction set and application software. Some drivers 
and kernels may need the additional, manual register backup code. Other similar techniques can also be 
used instead.

~

 How and when is the SRB used ? Well, it is used for what it does : 

Flush the register set to memory and/or load a new context.

It could be used at any time, since 
it does not interfere with other hardware except the L/S unit. 

It is mainly used for context switch (the SRB can be triggered by an interrupt and the rest is
done automatically), to save a context when an interrupt is triggered, and to restore the registers
after the IRQ routine has completed. In these cases, there are two threads : the "old" 
thread and the "new" thread. The flushing or reading of registers to/from 
memory by a load-many or store-many instruction is an exception, though.

 The "new" thread is defined to start as soon as the 
SRB signal is triggered, and the SRB must save these registers before the new thread uses them as
to ensure data coherency.

 Not only does the SRB remove the need to manually save and restore registers, but it does it faster
than software (while the application still runs) and adapts itself to the circumstances by reordering 
the backup sequence on the fly. It uses a few simple additional hardware, data from the scoreboard 
(the "register's value is being computed" flag), it steals unused clock cycles from the memory L/S unit
to load and store the registers, it has a few flags, some pointer registers and some logic. To know how 
to use this, let's define some behaviour rules :

\begin{itemize}
\item  We can't save a register as long as its value is being computed. The scoreboard tells us 
       what register not to backup (yet). This status changes at every cycle, so knowing the state 
       of the scoreboard quickly is very important.
\item  There's no need to save a register that has not been modified since the last backup. There
       is a "dirty" flag for this purpose, that is set whenever the register is written to.
\item  We have a special "not yet saved" flag that says that the physical register must be saved
       before it is ready for use by the new thread. In the same time, this flag blocks the scoreboard
       so that it can query an "express" request. 
       This flag is loaded from the "dirty bit" when the SRB signal is detected, and the dirty bit is 
       cleared for the new thread.
\item  When the new thread needs to use (read and write) a register that has not been saved yet,
       it instructs the SRB sequencer to modify the order and waits for the register to be free.
       The scoreboard, that is queried by the instruction decoding unit, "blocks" the instruction
       until the value is ready, and the "save in priority" flag is set until the data is ready.
\item  The SRB sequence is atomic, it can't be stopped unless there's a memory fault. A new SRB 
       signal must wait for the previous SRB signal issued to be completely processed. Turning 
       off the IRQs while SRB is running avoids lost cycles (waiting for the previous SRB to 
       complete, while the previous handler is being executed). 
       If an exception occurs during the SRB sequence, good news : we had already begun to save
       the registers : -) We need to wait for the (old) sequence to complete, before triggering 
       a "new" SRB and executing the handler.
\end{itemize}

 Of course, this high number of flag bits can be condensed, using a Finite State Machine (this
implementation detail is left to the designer). But the following algorithm doesn't need one : 
"for each cycle, write to memory the first register that 1) asks for express backup 2) is not yet 
saved (in decreasing priority), starting from register \#1". The algorithm stops when no more 
register needs to be saved. When a context switch occurs, there are two memory accesses, one for
saving the old register value and one for fetching the new thread's value. If one half of the 
thread's operations are loads or stores, this would use about onehundred cycles to save a context.
With a single-issue pipeline and not much bandwidth, it can take about 200 cycles to perform a full 
context switch. SRB is bandwidth-hungry, but software backup would be too. At least, the SRB uses
the whole available hardware, while a SW solution requires yet more bandwidth (because of the
explicit backup code).

~

 The SRB algorithm is a sequence that can be reordered with 63 register load and/or stores, 
starting from register \#1 : we need a way to extract this sequence from a line of flags. A "find 
first" unit, similar (but simpler) than the binary tree used in the increment unit, can do this 
easily. At the input, it selects the "express requests" if any, or the normal request from the 
registers that need to be saved. The express flag is set by the scoreboard, and the "not yet saved"
flags belong to the SRB mechanism. The output of the binary tree directly selects one register 
(out of 63) for reading and/or writing and resets the register's flags. 
Maybe a simple drawing speaks more : -)

\begin{figure}[H]
  \begin{center}
    \includegraphics[width=15cm,draft=false]{srb.eps}
    \caption{Detail of one bit of the SRB flags and decision mechanism}
  \end{center}
\end{figure}

 When no "express" request is made, the application has priority over the SRB for accessing the 
L/S unit. Otherwise, the "express" flag means that the instruction is blocked at ID stage and 
that no memory access is performed (unless a cache miss had just been resolved...). So, in any 
case, the SRB has never priority (which simplifies things).
The SRB principle can be extended for multiple-issue processors without modification. A four-way
superscalar F-CPU would be able to reuse this part a priori with no worries.

As noted before, there are only two tasks that the CPU considers : 
the "old" and the "new" task. No assumption is made about where the data is transferred, about 
the address of the context buffers, therefore there is no limitation in the number of tasks. The 
caches will perform their roles of keeping data with time and space locality close to the core,
so multithreaded programs will run normally. But the user can't always specify the address of 
these context buffers. The SRB mechanism has two pointers : SRB\_old and SRB\_new, that are used 
during SRB operation. After context switches completion, the SRB\_new pointer is copied into 
SRB\_old so that during the next context switch, only the new task need to be provided to the CPU.
It's up to the user to setup the desired list. This "new" pointer could also be stored in the 
"old"'s context buffer, as to perform round-robin operation automatically at each task switch IRQ.
 Cache control hardware will probably allow to "map" a certain memory area directly to the cache,
so that no LRU operation will flush the data from cache. This is where important tasks such as the 
kernel and real-time tasks should store their context buffers for better performance. Furthermore,
if automatic context switch is implemented, it would first prefetch the whole context buffers 
(old and new) into L1 cache before triggering the SRB mechanism. 
This prefetch would occur in background as to not penalize the foreground tasks.

~

Note that in the case where the FC0 core has 2 independent private memory ports,
the register bank switch process can be eased if the banks are mapped to different
memory regions. For example, if the register bank is flushed to one port, the new
register bank benefits from being read from the second port. There will be much less
bus turnaround cycles and the switch latency will decrease.