Part 5 :
5.1 Designing an instruction set
5.2 Instruction formats
5.3 The ISA modularity
5.4 The 2r1w format and its extensions
5.5 Flags
5.5.1 Size flags
5.5.2 SIMD flag
5.5.3 IEEE flag
5.5.4 saturate/carry flag
5.5.5 other flags / reserved fields
Once the most fundamental features and characteristics of the CPU have been agreed upon, it is then necessary to define the instruction set.
For the FCPU, it is not completely straight-forward, even though the architecture is rather simple and it does not include big innovations. The real problem lies in the iterative way things are decided and integrated in the CPU. The Instruction Set Architecture (ISA) faces a lot of constraints, and evolitivity is the greatest. The ISA determines a lot of characteristics for the future because one can't change it like a CPU on a socket. Since so many characteristics determine the lifespan of the whole architecure and project, all the informations disclosed here must be considered as temporary and they will change without notice. Actually, the ISA will be defined slowly, after each simulation cycle where one can draw conclusions on the usefulness and necessity of a particular opcode or flag.
So, the instruction set will change often and evolve a lot before it is completely defined by the group. Some changes may even take place after the first prototypes or chips are built. Therefore, the current ISA is not completely defined at this time of writing and several tricks are used to ease its development.
First, the instruction word itself, which is 32 bit wide, must be flexibly used. The instructions that the FCPU will execute require a variable number of operands and flags. They are gathered in the middle of the world so the bit field allocation is easier. The opcode (a 8-bit field that defines which instruction it is) is situated at one end of the word (in the LSB) and the destination register is at the other end (the MSB). The immediate data width can be 8 or 16 bits and we can include one or two other register operands. The remaining room is filled by the flags which can be merged with the instruction's opcode when there is not enough room, or the immediate data field can be narrowed. When there is still some room, we can extend the immediate data field (even though the flags usually try to use as much space as possible).
We design the instruction set with a census of all the necessary instructions and the forms they use. The width of the immediate field is not defined but it is left to the final synthesis. When we have summed up all the necessary instruction forms, we will allocate the fields. They will be placed accordingly to their functions and all the similar functions will be grouped. This is very simple for the register fields but it is less easy when we allocate the flags. The size of the immediate data fields will be determined when all the other fields will be allocated.
The second trick optimizes the opcode map. Of course, there will be a lot of room in it for future opcodes. But if the opcode count will be known at a time, their value can be redefined until the final prototype is made. This means that before F1 comes out, binary compatibility is uncertain but the opcodes will be defined with include files in the simulators and the emulators. This leaves all the necessary room to "allocate" the opcode values at the last moment and optimize them to simplify the instruction decoding logic. But at any time, the compatibility is kept at the source level in the assembly langage files. Only their encoding can change during the development.
This methodology allows the group to work with early implementations of the chip and synthesise the instruction set before it comes out. No arbitrary decision is made because every feature will be analyzed and discussed by the group.
The F-CPU is a RISC-like processor with 32-bit wide instructions. The opcode field is 8-bit wide, each register requires a 6-bit field and the remaining space is used for immediate values and flags. The following (preliminary) tables show how they are organized.
Notice that the opcode field is in the Least Significant Bits but the most used register operand is in the Most Significant Bits. Therefore, by convention, the assembly langage syntax (for consistency reasons) follows the instruction's structures and writes the operands in this order : first the opcode, eventually followed by the flags, the immediate values and the source operands, and finally the destination register. For example :
add.b r1,r2,r3 ; adds the bytes in the lower part of r1 and r2, result put in r3.
The instructions formats are :
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | Opcode | Flags | Reg 3 | Reg 2 | Reg 1 |
size : | 8 | 12 | 6 | 6 |
bits : | 0 7 | 8 19 | 20 25 | 26 31 |
function : | Opcode | Flags | Reg 2 | Reg 1 |
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | Opcode | Flags | Imm 8 | Reg 2 | Reg 1 |
size : | 8 | 2 | 16 | 6 |
bits : | 0 7 | 8 9 | 10 25 | 26 31 |
function : | Opcode | Flags | Imm 16 | Reg 1 |
It is very tempting to use a 2-bit opcode prefix to identify the instruction formats but this idea should be left for a later opcode compilation.
The F-CPU instruction set is modular and contains a ``core'' and several ``optional'' instruction groups that would require several core instructions to complete the operation otherwise. The presence of these optional instruction can be detected at run time with the indications contained in a set of hardwired Special Registers.
It must be understood that the ``core'' instruction set is meant to provide a minimal binary compatiblity accross different implementations. Any chip can hardwire one or more ``optional'' instructions independently from other considerations. This depends on the needed performance, the aimed application, the available technology and the algorithms.
What is core and what is optional ? As a rule of thumb, the optional instructions include "features" that are usually possible through more hardware or more complex circuits. For example, the SIMD capability is recommended but not mandatory because a SIMD arithmetic unit is more complex than a scalar unit. The increment-based instructions, the floating-point instructions, the logarithmic instructions and SRB management instructions are enabled when the corresponding Execution Unit or functionality is implemented. It is possible to implement a truely minimal F-CPU and extend it by adding the desired instructions and Execution Units, leaving unused opcodes when there is not enough transistors.
On the other hand, it is recommended that most of the integer instructions and the SIMD functions are implemented because they provide the most important features for the future.
The F-CPU increases the MOPS/MIPS ratio of its architecture by breaking the golden rule of the 2 register reads and 1 register write instruction limitation of the classic RISC architectures. Several instructions of the F-CPU need more than one register to be written back to the register set, some others need three register operands to be read. Those "non-RISC" instructions are marked as 3r1w or 2r2w in this document, as they might influence the coding rules of future F-CPU implementations. They probably require a special bit in the opcode to simplify decoding. Their support is optional (non-core) yet necessary for the load and store instructions with pointer update.
The instructions share a certain number of properties, which are put in ``flags'' outside of the opcode field. While their position can change in the future, their meaning will roughly remain the same throughout all the processor generations.
The flags do not alter much the syntax of the instructions. They add one letter per flag to the existing mnemonic so one can always recognize the instruction. This avoids the proliferation of obscure mnemonics and the necessity to remember them all. On the other hand, the size of the mnemonics is variable and can range from two (or) to nine (sshiftrai) letters and the mnemonics will probably be reorganized later to reduce the size of the longest ones. Usually, the flag letters are added in the order in which they appear in the instruction word.
In some opcodes the flags can contain a ``size'' parameter that define the size of the operand on which the operation should take place. This flag is by default decoded according to the following table :
Flags | Size (byte) | Suffix | Name |
00 | 1 | B | Byte |
01 | 2 | D | Double-Byte |
10 | 4 | Q | Quad-Byte |
11 | 8 | (none) | Octa-Byte (Word) |
In the F-CPU assembly langage, the size flag is noted by a postfix on the opcode, either ``.b'', ``.d'', ``.q'' or a plain number when the current settings don't provide the needed size. In the absence of a size postfix, the flag is set to ``11''. If the CPU is a 32-bit version only, the ``11'' code is mapped to ``10'' (32 bits) so this is always the largest word supported by the machine.
When the data width of the CPU increases, the processor can change the interpretation of this flag with a set of special registers. This allows the F-CPU platform to handle any data width that is a power of two, above 32 bits. The SIMD words and algorithms will scale up in a straight-forward fashion to 128-bit, 256-bit, 512-bit, 1024-bit etc.
The F-CPU is a SIMD-oriented processor. Most instructions operating on data can specify if these data are treated as a whole or in individual chunks. The SIMD flag, along with size the flag, specifies how the data are treated.
When the SIMD flag is not set, the CPU behaves like a normal processor, treating each register depending on the size flag. The whole register, or only the lower part, is treated.
When the SIMD flag is set, the CPU treats the whole register in its full width and the size flag defines the size of the individual chunks inside this large word.
Syntactically, in the F-CPU assembly langage, the SIMD flag is noted by a ``s'' prefix on the opcode, in a similar fashion to the leading ``f'' for the floating-point operations.
For the floating-point instructions, the F-CPU defines a ``IEEE754 compliance flag''. This flag alters the IEEE standard for floating point operations in two ways : when an error condition is detected, it does not trap the processor and the result values are saturated or biased. This flag is meant to ease the pipeline design of the FC0 core family where no potentially faultive instruction must enter the pipeline. On other core families, this behaviour must be preserved. This flag is used when speed is more important than accuracy, so this can also, depending on the implementation, disable the use of IEEE denormal numbers for example.
This field is used by the integer addition, substraction and multipy instructions where the result does not completely fit in one register. There are three possibilities :
- ignore the high part (and ``wrap around''),
- saturate (``clip''), or
- write the high part to another register, which number is destination+1 (next neighbour).
Triggering an exception on carry is out of question because it would slow down the CPU in critical loops. Writing the carry to a special carry register would create some architectural problems and writing the carry to one of the source operands would cancel the benefits of the three-operands instruction format.
Note that when carrying is performed with register #63 as destination, the carry does not get written anywhere because the "next" register is register #0 which is hardwired to 0.
This flag requires two bits, which can be zeroed (default : wrap around), or one of them is set (either clip or write to the neighbour of the result register). Depending on the kind of operation, the flag pair is called ``floor'' or ``saturate''.
The carry or saturation behaviours are written in assembly langage with a ``c'' and ``s'' postfix respectively. The default behaviour (wrap) is noted by the absence of postfix.
The forbidden combination (both c and s set) could be used later for a ``signed'' saturation where the floor and ceiling values are 0x8000 and 0x7FFF instead of 0x0000 and 0xFFFF.
In order to merge the result and the carry, the mixhi and mixlo instructions are provided. For example, the 16-bit SIMD values of a 8-bit substraction can be generated in three instructions :
ssubb.b r1,r2,r3 ; r3=result, r4=borrow
mixlo.b r3,r4,r5 ; takes the two lower halves from r3 and r4 and mix them into r5
mixhi.b r3,r4,r6 ; takes the two higher halves from r3 and r4 and mix them into r6
Note that the carry (or "borrow" [sub], or "high" [mul], or "modulo" [div] flag) might influence the instruction decoding rules in future F-CPU implementation. This is not a problem for FC0 but it should change with superscalar designs, due to register set size limitations.
The Load/Store instructions and the dedicated unit (s) can specify in which endianness the memory access operations are performed. This is optional for minimal and embedded systems because the necessary hardware may not be justified, in which case the endianness is recommended to be little. For general purpose applications, the dual endianness support is recommended because the OS may be written for one, and the application for another.
The Load/Store instructions can specify which of the seven " streams " the pointer belongs to. In the F-CPU, a "stream" is similar in meaning as in a CRAY T3E but with a different mechanism. This can be implemented as several L/S Units (the stream number references an individual LSU), as support of different user-visible DRAM banks, strides, channels or cache sets, or as any combination. As the name indicates, this should help the CPU separate independent data streams, avoid datapath congestion and cache thrashing, to finally increase the effective bandwidth with no additional complex hardware.
This field can be silently ignored by the CPU if the implementation can't suppport this feature.
5.5.6 other flags / reserved fields :
At the moment, all the bits have not been allocated. There are bit fields that are not yet used and should be cleared (0), as to preserve the forward compatibility of the architecture. This is valable for any field marked as reserved, ignored, unused or empty. These bits may be used for any purpose at any time without notice. The group will maybe implement a F-CPU with support for logarithmic and/or fractional number system and the bit #11 which is reserved in most instructions will be very useful.