Part 3 :
3.1 About the FC0 core
3.1.1 The FC0 is superpipelined
3.1.2 The FC0 core implements an out of order
completion pipeline
3.1.3 The FC0 uses a scoreboard
3.1.4 The FC0 uses a crossbar
3.2 Evolution of the FC0
3.3 The FC0 Execution Units
3.3.1 The "ROP2" unit
3.3.2 The "bit scrambling" unit
3.3.3 The "increment" unit
3.3.4 The add/sub unit
3.3.5 The integer multiply unit
3.3.6 The integer divide unit
3.3.7 The Load/Store unit
3.3.8 Other units
Here, we speak about characteristics that are specific to the FC0 ("F-CPU Core #0"), and even though they influence the general definition of the F-CPU, they may be abandonned in the future. This is where the hardware engineer is getting more involved.
3.1.1 The FC0 is superpipelined.
When designing a microprocessor, one of the first question is "what is the granularity of
the pipeline ?". This is not a critical issue for "toy processors" or designs that are adapted
from existing processors, but the F1 is not a toy and it must be very performant since the
first prototype... For the F1 case, where the first prototype will probably be a FPGA or
an ASIC but not a full-custon chip, performance matters more because the process will not
be able to compete with existing chips. Performance always matters anyway, but in our
case there is a strong technological handicap. We need a technique that reaches the same
"speed" with slower technology.
So the equation is : speed = silicon technology x critical datapath length, or
speed = speed of one transistor x number of transistors, so with slow transistors
the only way to run fast is to reduce the critical datapath (as an approximative
estimation, because other parameters influence this). So now, what is the minimal
operation we can perform without overloading the chip with flip-flops ?
The depth of around ten transistors is a compromise between functionality and atomicity.
We can create circuits that have around six logical gates of depth or add eight-bit numbers.
Care is taken to have simple and fast "building blocks", but the good side is that
with 6 logic gates we can't make complex things, while longer datapaths usually give birth
to complex problems. With this "limitation" in mind, we also limit complexity and only
neighbour-to-neighbour connexions between units are possible. Furthermore, as soon as
a unit becomes too complex, it becomes either "parallelized" (a large lookup table can be used
for example) or "serialized" (in another word, pipelined) so there is no need to
slow down the processor or use asynchronous technology.
The net effect of this bias toward extremely fine grained logic and pipeline stages is that even an addition becomes "slow" because it needs more cycles than usual. This apparent slowness is companded by higher performance through overlapping of the operations (pipelining) but requires the use of coding techniques usually found in superscalar processors (pointer duplication, loop unrolling and interleaving etc.). Because the stages are shorter, there are more pipeline stages than usual, that's why the FC0 can be considered as superpipelined. But it is only one aspect of the project and today, several processors are also superpipelined.
3.1.2 The FC0 core implements an
out of order completion pipeline to get more performance from a single-issue
pipeline. This is NOT a superscalar or out-of-order execution (or OOO
instruction issue) scheme but the "adaptation" of a simple pipelined CPU where
instructions are issued in order.
The fundamental reason behind this choice is that not all instructions really take the
same time to complete. This fact becomes more important in the F-CPU because it is
superpipelined, and one short instruction will be penalized by longer instructions which
would lengthen the pipeline. For example, if we want to calibrate the pipeline length on
a 64-bit addition, then longer operations like division, multiplication or memory access
with cache miss will freeze the whole pipeline ; on the other side, simple register-to-register
moves or simply writing an immediate value to a register will be much slower than actually
needed. This can be done on an early MIPS processor but not on a superpipelined processor.
Let's look at the instructions that need to be completed, after the decoding stage :
approximative cycles : 1 2 3 4 ------------------------------------------------------------------------------ write imm to reg: write dest load from memory: read address <access data: undetermined> write dest write to memory: read address & data <access data> logic operation: read operands operation write result arithmetic op.: read operands operation1 operation2 write result move reg to reg: read source write dest.
We can also notice that successive instructions may be independent, not needing the
result of the precedent instructions. Last remark is that they don't all need the same
hardware. We can come to some conclusions : not all instructions need to read and write
registers or compute something, not all instructions complete at the same speed, and
some instructions may be much longer than others (for example, reading a memory location
with a cache miss, compared to a simple logic operation). We need a variable sized
pipeline that allows several instructions to be performed and finish at the
same time. One way to envision this is to consider the pipeline as "folded", or "forked"
like in a superscalar processor. But this all consists to three successive and optional
things : reading operands, processing them and writing the result.
- Reading the operands is not a problem since at most three registers can need to be read
in one cycle. this is limited by the instructions themselves.
- Computing is fully pipelined and independent because specialized units process the data.
- Writing the results is a bit more complex because several operations can complete
at the same time. A one cycle operation (logical operation for example) will complete
at the same time as a two cycle (arithmetic) operation that has been issued during the
precedent cycle.
For this last reason, the register set has at least two write buses. In case more than
two values must be written at the same time, the "oldest" instruction (earliest issued)
has priority.
This kind of processor core has the advantage that long operations don't slow down or block
the whole program if the result data are not needed before the operation is finished. For
example, a memory read can cause cache miss delays but this won't keep the other execution
units to do their job and write their result to the register set.
Of course, this puts some pressure on the compiler but not more than for other existing processors, and careful coding has always paid anyway.
The difference between OOO completion and OOO execution is that OOO execution CPUs can issue the operations out of order and need a last unit called "completion unit" or "retire unit" that validates the operations in the program order. This also requires "renamed" registers that hold the temporary results before they are validated for good by the completion unit. All these "features" can be avoided by the techniques described in this document and, unlike OOO execution processors (like PowerPC and P6 cores) the peak performance is not limited by the size of the completion unit's FIFO (or the "ReOrdering Buffer", ROB) but by the number of register ports.
3.1.3 The FC0 uses a scoreboard because it is
the simplest way to handle the out-of-order nature of the core. The way it works
is very simple : each register has a flag that is set when the result is currently being
computed, and the instructions are delayed until no flag is set for the registers it uses
for read and write. This way, strict coherency is ensured and no operation can conflict
with another at the execution stage : verification of conflicts is done at only one
point.
These flags are not exactly like the "attribute" bits because they are not directly accessible
by the user but they have the same dynamic behaviour and are not saved or restored.
Because they don't occur often and are not critical for performance, write-after-write
situations are not checked by the scoreboard. The simple rule of blocking an instruction at the
decode stage if at least one of the used (read or written) register is not ready
is strictly enforced. Of course, the Register 0 which is hardwired to 0 is the only exception
and does not block anything.
The scoreboard interacts with the "Smooth Register Backup" mechanism to ensure coherency
between the switching tasks.
3.1.4 The FC0 uses a crossbar between the register
set and the execution units because :
- It is the easiest way to "fold" the pipeline.
- It provides a "one fits all" register bypass bus that shortens the latency between
dependent instruction.
- It reduces the number of register ports.
Because of its role, the crossbar (or Xbar for short) is a central part of the CPU.
The register set is only written or read through this device which virtually provides it
with more than ten ports. It allows the execution units to communicate without the need
to write and read registers (in register bypass mode, when operations are dependent)
it provides the hardwired register 'zero' and the results are checked for zero with
two additional ports.
The Xbar extends the register set's read and write ports, making 4 "vertical" buses (see figure 2), and each four bus is connected to one of the input and output ports of each execution unit with "horizontal" buses. It also performs some width formatting (byte, word, etc).
Because of the relatively high number of ports, the crossbar uses a lot of surface and transistors. It requires a cycle of his own to let the data flow through its whole length, and the goal of ten equivalent transistors is likely to be reached fast, because of both transistor count and wire lengths. Therefore, accessing a register takes two cycles from the time the register number has been decoded : one cycle for the register set and another for the Xbar. But when consecutive instructions are dependent, the result that will be written to a register is present on the Xbar and can be used during the next cycle for the next operation ("register bypass").
This can be summarized in the following drawing :
Discussion after discussion, the FC0 has taken a shape that makes it unique.
Because it is a gradual change, and because there is not only one view of the processor
structure, there have been several drawings that show the internal organization
of the chip.
The figure 2 is the first drawing that shows the general shapes of the FC0,
from the schematic, functional and implementation points of view. At that time,
the Xbar did not count for a full clock cycle in the pipeline. The memory
hierarchy was not designed and consisted of empty "units". The execution pipeline
though was almost determined and did not change much.
The figure 3 shows how the units that access the memory would be architected.
These are still at both extremities of the chip and require very long wires
to snoop for data/instruction access conflicts. The memory units are explicited though,
and consist of several cache line buffers. A curious feature is that the address "fences"
(that store the base address and limit size of the blocks that a task is allowed to
access) are inside the memory units, the TLBs are now outside of the units.
The Xbar now takes a full clock cycle and is considered as a full unit, the execution
pipeline is refined. Due to the ongoing discussions, the register set had only
two read and two write ports, the third read port was accepted later.
The figure 4 shows the current status of the FC0 as it is envisionned for the F1.
The memory units have been gathered so the wires that drive the address and data
lines outside of the chip have a minimal length. They are symmetrically positionned
so the tags of the cache line buffers can all be compared in one simple unit
that decides and schedules the memory accesses. The data and instruction TLBs are separated
from the memory units because they are parts of the pipeline, and should be placed
close to the decoding unit in order to signal an invalid pointer as soon as possible.
For ease of development and scalability, to name a few reasons, the Execution Units
(EUs) are like LEGO bricks that add new computational capabilities to the processor.
Like the whole core, they are designed with a full-custom process in mind
but can be implemented with libraries (if they have the corresponding functions)
or in FPGA cells or whatever alien technology falls from the sky...
Here are described the minimal necessary EUs that have been considered until today.
As they come, units can provide the same function (like : shifting left by one is like
multiplying by two or adding the number to itself) so the wisest habit is to check which
unit does what and in how many cycles with wich throughput, in order to pick the
best opcode for the desired operation in each context. Transistor count saving
has not been a serious consideration, more care has been taken to reduce the
critical datapath to the minimum possible.
Because of their different latencies, the EUs have not been packed into one
"one-fits-all" ALU. We can also pick one unit and think about it without caring of the
surrounding units. This way, we see that the hardware being designed provides
new unexpected operations that can be used in the instruction set. When the hardware is
in place, only a few additional logic gates provide useful operations that can spare
several instructions in application software.
This is the classical "logic unit". Its purpose is to compute bit-to-bit operations.
Due to its simplicity, it has one cycle of latency and is among the fastest units.
Now, what operations will it execute ? With two inputs, there are 2^2^2=16
possible operations, from which 8 are unique and useful:
Some opcodes are duplicated (if we include operands commutativity), others
are not "real" 2-operands operations (there are 1-op and 0-op operations).
We can include directly 4 function bits in the opcode, but if we need room
and better architectures, we can save some opcodes by using "condensed" codes.
We select 8 2-operands operations, 1 1-operand operation (NOT) and 1 0-operand
operation (SET to all ones). The decoder can thus avoid to read unnecessary source
registers.
So if we run out of opcodes, we can use this table to translate the opcode field
into the real computation code without really lengthening the critical datapath.
The necessary hardware for computing this function is rather low, maybe twenty (?)
transistors per bit :
There are probably a few other technical details to discuss about, but they are
too technology dependent (signal "tree" of the operation bus, for example).
This is the most straight-forward element of the processor.
3.3.2 The "bit scrambling" unit :
The aim is to have a one-cycle shifting unit that
can do other things as well. As opposed to the ROP2 unit, it does
not change the value of the input data bits but it changes the position
of the bits. Therefore, shifting and rotating are only examples of
the intended purposes of this sometimes called "shuffling" unit : bit field extraction
and insertion, as well as bit and byte reversing and bit testing are examples of what
this hardware is meant to perform.
There is a problem, though : F-CPU will be a 64-bit processor
and a classical barrel shifter is a O(log2(n)) unit, which is
fairly close to the pipeline granularity. A shifting array
(a kind of transistor array) will be necessary to get to O(1),
at the price of more transistors and probably more transistor load,
but it is the only solution if we want to shift 128, 256 or 512
bits in one 10-transistor pipeline cycle.
During prototyping, we can use pre-synthetized hardware anyway.
This unit will also perform SIMD specific operations like SIMD word expansion
and mixing.
This is maybe the most curious unit, because it is not usually found
in normal CPUs. The reason for this dedicated unit is simple :
a lot of code adds or substracts one, in loops for example.
This is unnecessary work for an adder, if the second operand is one,
so let's hardwire it and run it faster. That was the first idea.
The method to increment a binary number is not complex to understand :
You scan the number starting from the LSB, inverting every bit until you
find a 1. then, you turn this 1 into 0. This is the same thing as "find the
first LSB set". So, let's go, let's have it too in the instruction set.
In some cases, it is very valuable, and there's no hardware overhead.
This makes two instructions.
So now, we can increment, we can also decrement : we have to invert
each bit at the input and the output of the unit. This added hardware
lets us also find the "LSB cleared". four instructions. We can also
add a bit reverser at the input, as to find the MSB too. Six instructions.
let's go further : let's put a multiplexer at the end of the incrementer,
wich is commanded by the sign bit of the input value. If the bit sign is
set, we set the output to -(n+1) (there is a bit of juggling to do with
inverters but it's just a "technical detail"). With this unit, we can
compute the absolute value of a 2s-complement binary number. Seven instructions.
Now that we have these multiplexers at the input and the output of the 'incrementer',
we can do yet more things. Since the incrementer is a "find first bit" binary tree,
we can use it to compare two numbers. The idea is simple, a (positive) number
is greater than another if at least one of its MSBs is set while the corresponding
bits of the other number is cleared :
From a purely abstract point of view, finding the first set bit
is done with a "binary tree", so the depth of the unit is O(log2(n))
with rather simple "nodes". This is almost a schoolbook case to design.
Anyway, like for the shifter array, i presume there will be some problems
to fit it in the pipeline's stage depth...
In this unit, i have not yet addressed the problem of the SIMD data.
Using a carry-lookahead adder, it needs around two cycles to complete a 64-bit
addition or substraction : it is a O(log2(n)) process with some more heavy
mechanisms than the incrementer, but it would compute a 8-bit add/sub
in one cycle. Therefore, SIMD with 8-bit data would be fast (1 cycle instead of
2). For these reasons, it would be difficult to use standard
library pre-synthetized elements because of the variable-depth and SIMD
nature of this unit.
3.3.5 The integer multiply unit :
Here, same remarks as for the adder. There are SIMD constraints
and a variable-depth, fine-grained pipeline (depending on the width
of the input data). It will be difficult to find this kind of unit in
pre-synthetized libraries.
3.3.6 The integer divide unit :
Same as the multiplier. Notice, though, that a divide by zero can be caught at
decode time with the "zero" property flags. We can trigger a trap without issuing
the instruction.
This is a very special case because no actual computation is performed.
The latency is completely unknown at compile time, and there is the problem
of the memory protection. If the memory protection is ensured by other
mechanisms, the L/SU is simply a big cache buffer with a crossbar
to perform the word/endian selection. Notice that its structure is
similar to the instruction fetcher unit, but with a different granularity.
The floating point numbers have not been discussed, because we better have something
work in the integer domain first, we'll add FP hardware and instructions later.
The case of the math exceptions will be probably managed with the same kind of mechanism
as the "zero" property flag, so no error will break the execution pipeline flow.
One "cheap" way to avoid the use of floating point numbers is by using the logarithmic
number base. Recent works succeeded in making a 32-bit logarithmic adder with descent
speed and die space use. Any other operation (SQRT, SQR, multiply, divide...) can be
performed by existing hardware (maybe slightly modified for the MSB). The conversion
between integers and log numbers will be a rather heavy software task, as long as no
hardware exists.
When FP hardware will become available, only add/sub and multiply units will be
implemented at first. Any other mathematical operation (including division) will
be computed with a Newton-Raphson approximation algorithm in software. A third
unit will provide the "seed" from hardwired ROM tables.
figure 2 : The first F-CPU chip proposal.
figure 3 : A more precise, first-attempt F-CPU description.
figure 4 : A third F-CPU description.
CLEAR (set to 0) : equiv. to mov res, reg0
A AND B
A AND /B
A (do nothing)
/A AND B (similar to A AND /B above)
B (do nothing)
A XOR B
A OR B
A NOR B (NOT [A OR B])
NOT (A XOR B)
NOT B (do almost nothing)
A OR /B (NOT [/A AND B])
NOT A (do almost nothing)
/A OR B (similar to A OR /B]
A NAND B (NOT [A AND B])
SET to 1 (-1)
A AND B A AND /B A XOR B A OR B A NOR B A XNOR B A OR /B A NAND B
figure 5 : Detail of the ROP2 unit.
figure 6 : Overview of the Scrambling unit
0 > 1, 11 > 10...
So, just XOR the two input numbers, find the first MSB set, and AND the result with
one input number. If the result is cleared, then this number is lower then the other,
and vice versa. This makes eight instructions. Still better, we can use the ending
multiplexer to select one of the input values : we can have the min and max
instructions, as well as the derivated like 'if reg1 > reg2 then reg1=reg2'
(for graphics, in coordinates clipping, or saturated arithmetics...). We can have more than
ten useful instructions with this simple single-cycle unit ! Some are very useful because
they usually involve conditional branches (and pipeline stalls or branch mispredictions...).
When there is no cache miss or buffer to flush, the data can be directly sent
or read from the buffer through the L/S crossbar then sent to the main Xbar.
In the ideal case, there is no latency for memory writes and 1 cycle for memory
reads. The memory fetch logic tries to keep the buffers full when contiguous accesses
are performed.
The memory buffer can "cache" eight cache lines (the number of lines may vary).
It communicates with the external memory data bus, the data cache memory and the
main Xbar. This reduces the latency when recovering from cache misses, and simplifies the
cache memory organisation because it does not communicate directly with the memory :
the memory buffer (L/SU) is used to split the cache line into smaller chuncks that
can be sent to the memory interface.
part3.html dec. 3 by Whygee