F-CPU Design Team
FCPU MANUAL REV. 0.1

 

       Part 5 :

 

 

 

The F-CPU Instruction
Set Architecture

 

 

 

 

 


 
Summary

Previous part (4)

Next part (6)

       5.1 Designing an instruction set
       5.2 Instruction formats
       5.3 The ISA modularity
       5.4 The 2r1w format and its extensions
       5.5 Flags
             5.5.1 Size flags
             5.5.2 SIMD flag
             5.5.3 IEEE flag
             5.5.4 saturate/carry flag
             5.5.5 other flags / reserved fields
       5.6 Preliminary opcode map overview

 


 

5.1 Designing an instruction set :

       Once the most fundamental features and characteristics of the CPU have been agreed upon, it is then necessary to define the instruction set.

       For the FCPU, it is not completely straight-forward, even though the architecture is rather simple and it does not include big innovations. The real problem lies in the iterative way things are decided and integrated in the CPU. The Instruction Set Architecture (ISA) faces a lot of constraints, and evolitivity is the greatest. The ISA determines a lot of characteristics for the future because one can't change it like a CPU on a socket. Since so many characteristics determine the lifespan of the whole architecure and project, all the informations disclosed here must be considered as temporary and they will change without notice. Actually, the ISA will be defined slowly, after each simulation cycle where one can draw conclusions on the usefulness and necessity of a particular opcode or flag.

       So, the instruction set will change often and evolve a lot before it is completely defined by the group. Some changes may even take place after the first prototypes or chips are built. Therefore, the current ISA is not completely defined at this time of writing and several tricks are used to ease its development.

       First, the instruction word itself, which is 32 bit wide, must be flexibly used. The instructions that the FCPU will execute require a variable number of operands and flags. They are gathered in the middle of the world so the bit field allocation is easier. The opcode (a 8-bit field that defines which instruction it is) is situated at one end of the word (in the LSB) and the destination register is at the other end (the MSB). The immediate data can range from 6 to 16 bits and we can include one or two other register operands. The remaining room is filled by the flags which can be merged with the instruction's opcode when there is not enough room, or the immediate data field can be narrowed. When there is still some room, we can extend the immediate data field (even though the flags usually try to use as much space as possible).


figure 8 : Preliminary overview of the instruction forms

       We design the instruction set with a census of all the necessary instructions and the forms they use. The width of the immediate field is not defined but it is left to the final synthesis. When we have summed up all the necessary instruction forms, we will allocate the fields. They will be placed accordingly to their functions and all the similar functions will be grouped. This is very simple for the register fields but it is less easy when we allocate the flags. The size of the immediate data fields will be determined when all the other fields will be allocated.

       The second trick optimizes the opcode map. Of course, there will be a lot of room in it for future opcodes. But if the opcode count will be known at a time, their value can be redefined until the final prototype is made. This means that before F1 comes out, binary compatibility is uncertain but the opcodes will be defined with include files in the simulators and the emulators. This leaves all the necessary room to "allocate" the opcode values at the last moment and optimize them to simplify the instruction decoding logic. But at any time, the compatibility is kept at the source level in the assembly langage files. Only their encoding can change during the development.

       This methodology allows the group to work with early implementations of the chip and synthesise the instruction set before it comes out. No arbitrary decision is made because every feature will be analyzed and discussed by the group.

 

5.2 Instruction formats :

       The F-CPU is a RISC-like processor with 32-bit wide instructions. The opcode field is 8-bit wide, each register requires a 6-bit field and the remaining space is used for immediate values and flags. The following (preliminary) tables show how they can be organized.

       Notice that the opcode field is in the Least Significant Bits but the most used register operand is in the Most Significant Bits. Therefore, by convention, the assembly langage writes the operands in this order : first the opcode, eventually followed by the flags, the immediate values and the source operands, and finally the destination register. For example :

add.b r1,r2,r3 ; adds the bytes in the lower parts of r1 and r2, result put in r3.  

       The most used instructions formats are :

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : Opcode Flags Reg 3 Reg 2 Reg 1

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : Opcode Flags Imm 8 Reg 2 Reg 1

size : 8 2 16 6
bits : 0                 7 8     9 10                                               25 26             31
function : Opcode Flags Imm 16 Reg 1

 

[to be changed soon]

 

5.3 The ISA modularity :

       The F-CPU instruction set is modular and contains a ``core'' and several ``optional'' instruction groups that would take several core instructions to complete the operation otherwise. The presence of these optional instruction can be detected at run time with the indications contained in a set of hardwired Special Registers.

       It must be understood that the ``core'' instruction set is meant to provide a minimal binary compatiblity accross different implementations. Any chip can hardwire one or more ``optional'' instructions independently from other considerations. This depends on the needed performance, the aimed application, the available technology and the algorithms.

       What is core and what is optional ? As a rule of thumb, the optional instructions include "features" that are usually possible through more hardware or more complex circuits. For example, the SIMD capability is recommended but not obligatory because a SIMD arithmetic unit is more complex than a scalar unit. The increment-based instructions, the floating-point instructions, the logarithmic instructions and SRB management instructions are enabled when the corresponding Execution Unit or functionality is implemented. It is possible to implement a truely minimal F-CPU and extend it by adding the desired instructions and Execution Units, leaving unused opcodes when there is not enough transistors.

 

5.4 The 2r1w format and its extensions :

       The F-CPU increases the MOPS/MIPS ratio of its architecture by breaking the golden rule of the 2 register reads and 1 register write instruction limitation. Several instructions of the F-CPU need more than one register to be written back to the register set, some other need three register operands to be read. Those non-RISC instructions are marked as 3r1w or 2r2w in this document, as they might influence the coding rules of future F-CPU implementations. Their support is optional (non-core).

 

5.5 Flags :

       The instructions share a certain number of properties, which are put in ``flags'' outside of the opcode field. While their position can change in the future, their meaning will roughly remain the same throughout all the processor generations.
       The flags do not alter much the syntax of the instructions. They add one letter per flag to the existing mnemonic so one can always recognize the instruction. This avoids the proliferation of obscure mnemonics and the necessity to remember them all. On the other hand, the size of the mnemonics is variable and can range from two ( or ) to nine ( sshiftrai ) letters and the mnemonics will probably be reorganized later to reduce the size of the longest ones. Usually, the flag letters are added in the order in which they appear in the instruction word.

 

5.5.1 Size flags :

       In some opcodes the flags can contain a ``size'' parameter that define the size of the operand on which the operation should take place. This flag is by default decoded according to the following table:

Flags Size (byte) Suffix Name
00 1 B Byte
01 2 D Double-Byte
10 4 Q Quad-Byte
11 8 (none) Octa-Byte (Word)

       In the F-CPU assembly langage, the size flag is noted by a postfix on the opcode, either ``.b'', ``.d'', ``.q'' or a plain number when the current settings don't provide the needed size. In the absence of a size postfix, the flag is set to ``11''. If the CPU is a 32-bit version only, the ``11'' code is mapped to ``10'' (32 bits) so this is always the largest word supported by the machine.

       When the data width of the CPU increases, the processor can change the interpretation of this flag with a set of special registers. This allows the F-CPU platform to handle any data width that is a power of two, above 32 bits. The SIMD words and algorithms will scale up in a straight-forward fashion to 128-bit, 256-bit, 512-bit, 1024-bit etc.

 

5.5.2 SIMD flag :

       The F-CPU is a SIMD-oriented processor. Most instructions operating on data can specify if these data are treated as a whole or in individual chunks. The SIMD flag, along with size the flag, specifies how the data are treated.

       When the SIMD flag is not set, the CPU behaves like a normal processor, treating each register depending on the size flag. The whole register, or only the lower part, is treated.

       When the SIMD flag is set, the CPU treats the whole register in its full width and the size flag defines the size of the individual chunks inside this large word.

       Syntactically, in the F-CPU assembly langage, the SIMD flag is noted by a ``s'' prefix on the opcode, in a similar fashion to the leading ``f'' for the floating-point operations.

 

5.5.3 IEEE flag :

       For the floating-point instructions, the F-CPU defines a ``IEEE754 compliance flag''. This flag alters the IEEE standard for floating point operations in two ways : when an error condition is detected, it does not trap the processor and the result values are saturated or biased. This flag is meant to ease the pipeline design of the FC0 core family where no potentially faultive instruction must enter the pipeline. On other core families, this behaviour must be preserved. This flag is used when speed is more important than accuracy, so this can also, depending on the implementation, disable the use of IEEE denormal numbers for example.

 

5.5.4 saturate/carry flag :

       This field is used by the integer addition, substraction and multipy instructions where the result does not completely fit in one register. There are three possibilities :
            - ignore the high part (and ``wrap around''),
            - saturate (``clip''), or
            - write the high part to another register, which number is destination+1 (next neighbour).
       Triggering an exception on carry is out of question because it would slow down the CPU in critical loops. Writing the carry to a special carry register would create some architectural problems and writing the carry to one of the source operands would cancel the benefits of the three-operands instruction format.

       Note that when carrying is performed with register #63 as destination, the carry does not get written anywhere because the "next" register is register #0 which is hardwired to 0.

       This flag requires two bits, which can be zeroed (default : wrap around), or one of them is set (either clip or write to the neighbour of the result register). Depending on the kind of operation, the flag pair is called ``floor'' or ``saturate''.

       The carry or saturation behaviours are written in assembly langage with a ``c'' and ``s'' postfix respectively. The default behaviour (wrap) is noted by the absence of postfix.

       The forbidden combination (both c and s set) could be used later for a ``signed'' saturation where the floor and ceiling values are 0x8000 and 0x7FFF instead of 0x0000 and 0xFFFF.

       In order to merge the result and the carry, the mixhi and mixlo instructions are provided. For example, the 16-bit SIMD values of a 8-bit substraction can be generated in three instructions :
            ssubb.b r1,r2,r3 ; r3=result, r4=borrow
            mixlo.b r3,r4,r5 ; takes the two lower halves from r3 and r4 and mix them into r5
            mixhi.b r3,r4,r6 ; takes the two higher halves from r3 and r4 and mix them into r6

       Note that the carry (or "borrow" [sub], or "high" [mul], or "modulo" [div] flag) might influence the instruction decoding rules in future F-CPU implementation. This is not a problem for FC0 but it should change with superscalar designs, due to register set size limitations.

 

5.5.5 Endian flag :

       The Load/Store instructions and the dedicated unit(s) can specify in which endianness the memory access operations are performed. This is optional for minimal and embedded systems because the necessary hardware may not be justified, in which case the endianness is recommended to be little. For general purpose applications, the dual endianness support is recommended because the OS may be written for one, and the application for another.

 

5.5.6 Stream Hint flag :

       The Load/Store instructions can specify which of the seven "streams" the pointer belongs to. In the F-CPU, a "stream" is similar in meaning as in a CRAY T3E but with a different mechanism. This can be implemented as several L/S Units (the stream number references an individual LSU), as support of different user-visible DRAM banks, strides, channels or cache sets, or as any combination. As the name indicates, this should help the CPU separate independent data streams, avoid datapath congestion and cache thrashing, to finally increase the effective bandwidth with no additional complex hardware.
       This field can be silently ignored by the CPU if the implementation can't suppport this feature.

 

5.5.6 other flags / reserved fields :

       At the moment, all the bits have not been allocated. There are bit fields that are not yet used and should be cleared (0), as to preserve the forward compatibility of the architecture. This is valable for any field marked as reserved, ignored, unused or empty. These bits may be used for any purpose at any time without notice. The group will maybe implement a F-CPU with support for logarithmic and/or fractional number system and the bit #11 which is reserved in most instructions will be very useful.

 

5.6 Preliminary opcode map overview :

       This table is a first census of the available instructions. It is currently used to determine the efficiency of the instruction word format.

# opcode names C
o
r
e
Form S
I
M
D
EU remarks
1 OP_ADD add, adds * (t0)2r1w * ASU 1 bit left for LNS or
fractional (Q) format
2 individual opcode ? addc   (t0)2r2w * ASU (idem, LNS/Q)
3 OP_ADDI addi   i81r1w * ASU (idem, LNS/Q)
4 OP_ADDSUB addsub, addsubs   (t0)2r2w * ASU(x2) (idem, LNS/Q)
5 OP_SUB sub, subb * (t0)2r1w * ASU (idem, LNS/Q)
6 individual opcode ? subb   (t0)2r2w * ASU (idem, LNS/Q)
7 OP_SUBI subi   i81r1w * ASU (idem, LNS/Q)
8 OP_MUL mul, muls * (t0)2r1w * IMU (idem, LNS/Q)
9 individual opcode ? mulh, mulsh   (t0)2r2w * IMU (idem, LNS/Q)
10 OP_MULI muli   i8r1w * IMU (idem, LNS/Q)
11 OP_MAC mac, macs   (t0)3r1w * IMU+ASU (idem, LNS/Q)
12 individual opcode ? mach, macsh   (t0)3r1w * IMU+ASU (idem, LNS/Q)
13 OP_DIV div, divs * (t0->T)2r1w * IDU (idem, LNS/Q)
14 individual opcode ? divm, divms   (t0->T)2r2w * IDU (idem, LNS/Q)
15 OP_DIVI divi   i81r1w * IDU (idem, LNS/Q)
16 OP_MOD mod, mods   (t0->T)2r1w * IDU (idem, LNS/Q)
17 OP_MODI modi   i81r1w * IDU (idem, LNS/Q)
18 OP_POPCOUNT popcount   (t0)1r1w * ?  
19 OP_INC inc   1r1w * INC  
20 OP_DEC dec   1r1w * INC  
21 OP_NEG neg   1r1w * INC  
22 OP_ABS abs   (tMSB)1r1w * INC  
23 OP_SCAN scan, lsb1, lsb0,
msb1, msb0
  (t?)1r1w * INC  
24 OP_CMPL cmpl   2r1w * INC Problem with
signed operations
25 OP_CMPLI cmpli   i81r1w * INC (idem, sign)
26 OP_CMPLE cmple   1r1w * INC (idem, sign)
27 OP_CMPLEI cmplei   i81r1w * INC (idem, sign)
28 OP_MAX max   2r1w * INC (idem, sign)
29 OP_MAXI maxi   i81r1w * INC (idem, sign)
30 OP_MIN min   2r1w * INC (idem, sign)
31 OP_MINI mini   i81r1w * INC (idem, sign)
32 OP_SORT sort   2r2w * INC (idem, sign)
33 OP_LADD ladd   2r1w * LASU  
34 OP_LSUB lsub   2r1w * LASU  
35 OP_L2INT l2int   (tMSB)1r1w * LCONV  
36 OP_INT2L int2l   (t0)1r1w * LCONV  
37 OP_SHIFTL shiftl * 2r1w * SHUFFLER 1 bit left for LNS ?
38 OP_SHIFTLI shiftli   i61r1w * SHUFFLER (idem, LNS)
39 OP_SHIFTR shiftr * 2r1w * SHUFFLER (idem, LNS)
40 OP_SHIFTRI shiftri   i61r1w * SHUFFLER (idem, LNS)
41 OP_SHIFTRA shiftra * 2r1w * SHUFFLER (idem, LNS)
42 OP_SHIFTRAI shiftrai   i61r1w * SHUFFLER (idem, LNS)
43 OP_ROTL rotl * 2r1w * SHUFFLER  
44 OP_ROTLI rotli   i61r1w * SHUFFLER  
45 OP_ROTR rotr * 2r1w * SHUFFLER  
46 OP_ROTRI rotri   i61r1w * SHUFFLER  
47 OP_BITOP bitop, bchg, bset,
bclr, btst
  2r1w * SHUFFLER  
48 OP_BITOPI bitopi, bchgi, bseti,
bclri, btsti
  i61r1w * SHUFFLER  
49 OP_BITREV bitrev   2r1w   SHUFFLER  
50 individual opcode ? bitrevo   3r1w   SHUFFLER  
51 OP_BITREVI bitrevi   i81r1w   SHUFFLER  
52 individual opcode ? bitrevio   i82r1w   SHUFFLER  
53 OP_BYTEREV byterev   (t0)1r1w * SHUFFLER  
54 OP_MIX mixl, mixh   (t0)1r1w ! SHUFFLER  
55 OP_EXPAND expandl, expandh   (t0)1r1w ! SHUFFLER  
56 OP_SDUP sdup   1r1w ! SHUFFLER  
57 OP_LOGIC logic, or, orn, and,
andn, xor, nxor,
not, nor, nand
* 2r1w ! ROP2  
58 OP_LOGICI logici, andi,
andni, ori, xori
  i81r1w   ROP2 not enough
room here.
59 OP_FADD fadd, faddx f1 2r1w * FASU  
60 OP_FSUB fsub, fsubx f1 2r1w * FASU  
61 OP_FMUL fmul, fmulx f1 2r1w * FMU  
62 OP_F2INT f2int, f2intx f1 1r1w * ?  
63 OP_INT2F int2f, int2fx f1 1r1w * ?  
64 OP_FIAPRX fiaprx, fiaprxx f1 1r1w * FLUT  
65 OP_FSQRTIAPRX fsqrtiaprx,
fsqrtiaprxx
f1 1r1w * FLUT  
66 OP_FDIV fdiv, fdivx f2 2r1w * ?  
67 OP_FSQRT fsqrt, fsqrtx f2 1r1w * ?  
68 OP_FLOG flog, flogx f3 1r1w * ?  
69 OP_FEXP fexp, fexpx f3 1r1w * ?  
70 OP_FMAC fmac, fmacx f3 3r1w * FASU+FMU  
71 OP_FADDSUB faddsub, faddsubx f3 1r1w * FASU x2  
72 OP_LOAD load, loade * (P)(t0)1r1w   LSU  
73 * load, loade   (P)(t0)2r2w   LSU+ASU  
74 OP_LOADI loadi, loadie   (P)[t0]i81r2w   LSU+ASU  
75 OP_LOADF loadf, loadfe   (P)(t0)2r2w   LSU+ASU  
76 OP_LOADIF loadif, loadife   (P)[t0]i81r2w   LSU+ASU  
77 OP_STORE store, storee * (P)(t0)2r   LSU  
78 * store, storee   (P)(t0)3r1w   LSU+ASU  
79 OP_STOREI storei, storeie   (P)(t0)i82r1w   LSU+ASU  
80 OP_STOREF storef, storefe   (P)(t0)3r1w   LSU+ASU  
81 OP_STOREIF storeif, storeife   (P)(t0)i82r1w   LSU+ASU  
82 OP_CACHEMM cachemm   (P,t0)2r   ?  
83 OP_MOVE move -n, -s * 1C1r1w     #define OP_MOVE 0
84 OP_LOADCONS loadcons * i161w     2 or 4 opcodes (range)
85 OP_LOADCONSX loadconsx * i161w     2 or 4 opcodes (range)
86 OP_GET get * 1r1w   SR  
87 OP_GETI geti   i161w   SR  
88 OP_PUT put * 2r   SR  
89 OP_PUTI puti   i161r   SR  
90 OP_LOADM loadm   (P,t0)3r     immediate
version needed...
91 OP_STOREM storem   (P,t0)3r     (idem, immediate)
92 OP_JMPA jmpa * h,(P)1C1r1w     missing : hints...
93 OP_LOADADDR loadaddr,
loadaddrd,
loopentry
* (t0)1r1w   ASU  
94 OP_LOADADDRI loadaddri,
loadaddrid
* i161w   ASU  
95 OP_LOOP loop * (I)2[1]r1w   ASU/INC  
96 OP_SYSCALL syscall, trap * i161r      
97 OP_HALT halt *        
98 OP_RFE rfe *        
99 OP_SRB_SAVE srb_save          
100 OP_SRB_RESTAURE srb_restaure          
101 OP_SRB_RESTAURE srb_restaure          
102 OP_SERIALIZE serialize          

 


part5.html dec. 16 by Whygee