Part 6 :
6.1 Arithmetic Operations
6.1.1 Core Arithmetic operations
6.1.1.1 add, adds, addc, sadd, sadds, saddc
6.1.1.2 sub, subb, subf, ssub, ssubb, ssubf
6.1.1.3 mul, mulh, muls, mulsh, smul, smulh, smuls, smulsh
6.1.1.4 div, divs, divm, divms, sdiv, sdivs, sdivm, sdivms
6.1.2 Optional Arithmetic operations
6.1.2.1 addi, saddi
6.1.2.2 subi, ssubi
6.1.2.3 muli, smuli
6.1.2.4 divi, sdivi
6.1.2.5 mod, mods, smod, smods
6.1.2.6 modi, smodi
6.1.2.7 mac, macs, mach, machs, smac, smacs, smach, smachs
6.1.2.8 addsub, addsubs, saddsub, saddsubs
6.1.2.9 popcount, spopcount
6.1.3 Optional increment-based operations
6.1.3.1 inc, sinc
6.1.3.2 dec, sdec
6.1.3.3 neg, sneg
6.1.3.4 scan, sscan, lsb1, lsb0, msb1, msb0,
slsb1, slsb0, smsb1, smsb0
6.1.3.5 cmpl, scmpl
6.1.3.6 cmple, scmple
6.1.3.7 cmpli, scmpli
6.1.3.8 cmplei, scmplei
6.1.3.9 abs, sabs
6.1.3.10 max, smax
6.1.3.11 min, smin
6.1.3.12 maxi, smaxi
6.1.3.13 mini, smini
6.1.3.14 sort, ssort
6.1.4 Optional Logarithmic Number System operations
6.1.4.1 ladd, sladd
6.1.4.2 lsub, slsub
6.1.4.3 l2int, sl2int
6.1.4.4 int2l, sint2l
6.2 Bit Shuffling based operations
6.2.1 Core Shift and Rotate operations
6.2.1.1 shiftl, sshiftl
6.2.1.2 shiftr, sshiftr
6.2.1.3 shiftra, sshiftra
6.2.1.4 rotl, srotl
6.2.1.5 rotr, srotr
6.2.2 Optional Bit Shift and Rotate operations
6.2.2.1 shiftli, sshiftli
6.2.2.2 shiftri, sshiftri
6.2.2.3 shiftrai, sshiftrai
6.2.2.4 rotli, srotli
6.2.2.5 rotri, srotri
6.2.2.6 bitop, sbitop, bchg, bset, bclr, btst, sbchg,
sbset, sbclr, sbtst
6.2.2.7 bitopi, sbitop, bchgi, bseti , bclri, btsti,
sbchgi, sbseti, sbclri, sbtsti
6.2.3 Optional Bit Shuffling operations
6.2.3.1 bitrev, bitrevo
6.2.3.2 bitrevi, bitrevio
6.2.3.3 byterev, sbyterev
6.2.3.4 mix (mixl, mixh)
6.2.3.5 expand, (expandl, expandh)
6.2.3.6 sdup
6.3 Logic operations
6.3.1 Core Logic operations
6.3.1.1 logic, or, orn, and, andn, xor, nxor, not, nor, nand
6.3.2 Optional Logic operations
6.3.2.1 logici, andi, andni, ori, xori
6.4 Floating Point Operations
6.4.1 Level 1 Floating Point Operations
6.4.1.1 fadd, sfadd, faddx, sfaddx
6.4.1.2 fsub, sfsub, fsubx, sfsubx
6.4.1.3 fmul, sfmul, fmulx, sfmulx
6.4.1.4 f2int, sf2int, f2intx, sf2intx
6.4.1.5 int2f, sint2f, int2fx, sint2fx
6.4.1.6 fiaprx, fiaprxx, sfiaprx, sfiaprxx
6.4.1.7 fsqrtiaprx, fsqrtiaprxx,
sfsqrtiaprx, sfsqrtiaprxx
6.4.2 Level 2 Floating Point Operations
6.4.2.1 fdiv, fdivx, sfdiv, sfdivx
6.4.2.2 fsqrt, fsqrtx, ssqrt, ssqrtx
6.4.2 Level 3 Floating Point Operations
6.4.3.1 flog, flogx, sflog, sflogx
6.4.3.2 fexp, fexpx, sfexp, sfexpx
6.4.3.3 fmac, fmacx, smac, smacx
6.4.3.4 faddsub, faddsubx, sfaddsub, sfaddsubx
6.5 Memory Access operations
6.5.1 Core Memory Access operations
6.5.1.1 load, loade
6.5.1.2 store, storee
6.5.2 Optional Memory Access operations
6.5.2.1 load, loade
6.5.2.2 store, storee
6.5.2.3 loadi, loadie
6.5.2.4 storei, storeie
6.5.2.5 loadf, loadfe
6.5.2.6 loadif, loadife
6.5.2.7 storef, storefe
6.5.2.8 storeif, storeife
6.5.2.9 cachemm
6.6 Data move operations
6.6.1 Core Data move operations
6.6.1.1 move
6.6.1.2 loadcons
6.6.1.3 loadconsx
6.6.1.4 get
6.6.1.5 put
6.6.2 Optional Data move operations
6.6.2.1 loadm
6.6.2.2 storem
6.6.2.3 geti
6.6.2.4 puti
6.7 Instruction Flow Control Operations
6.7.1 Core Instruction Flow Control instructions
6.7.1.1 jmpa
6.7.1.2 loadaddr, loadaddrd
6.7.1.3 loopentry
6.7.1.4 loadaddri, loadaddrid
6.7.1.5 loop
6.7.1.6 syscall, trap
6.7.1.7 halt
6.7.1.8 rfe
6.7.2 Optional Instruction Flow Control instructions
6.7.2.1 srb_save
6.7.2.2 srb_restore
6.7.2.3 serialize
ADDition
add r3, r2, r1 adds r3, r2, r1 addc r3, r2, r1 sadd r3, r2, r1 sadds r3, r2, r1 saddc r3, r2, r1 |
Computes r1 = r2 + r3
add performs an integer addition of the two source operands (r3 + r2) and puts the result in the destination operand (r1).
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_ADD | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | -s postfix | 1 if set | Saturation flag |
13 | -c postfix | 1 if set | Carry flag ( 2r2w ) |
Examples :
Scalar :
R1 contains 0xF8 (we only consider the lower byte in the registers)
R2 contains 0x0F
add.b r1,r2,r3 : r3 = 0x07 (default behaviour)
adds.b r1,r2,r3 : r3 = 0xFF (saturation)
addc.b r1,r2,r3 : r3 = 0x07, r4= 0x01 (carry)
SIMD :
R1 contains 0x000000F800000001 (in a 64-bit system)
R2 contains 0x0000000F00000002
sadd.b r1,r2,r3 : r3 = 0x0000000700000003 (default behaviour)
sadds.b r1,r2,r3 : r3 = 0x000000FF00000003 (saturation)
saddc.b r1,r2,r3 : r3 = 0x0000000700000003 , r4= 0x0000000100000000 (carry)
Execution Unit : Add/Sub Unit
Latency : 1 cycle for 8-bit data, 2 cycles for 16-bit to 64-bit data
Throughput : 1 operation per cycle per ASU.
SUBstraction
sub r3, r2, r1 subb r3, r2, r1 subf r3, r2, r1 ssub r3, r2, r1 ssubb r3, r2, r1 ssubf r3, r2, r1 |
Computes r1 = r2 - r3
sub performs an integer substraction of the two source operands (r3 - r2) and puts the result in destination operand (r1).
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_SUB | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | -f postfix | 1 if set | Floor flag |
13 | -b postfix | 1 if set | Borrow flag ( 2r2w ) |
Examples :
Scalar :
R1 contains 0x05 (we only consider the lower byte in the registers)
R2 contains 0x07
sub.b r1,r2,r3 : r3 = 0xFE (default behaviour)
subf.b r1,r2,r3 : r3 = 0x00 (floor)
subb.b r1,r2,r3 : r3 = 0xFE, r4= 0xFF (borrow)
SIMD :
R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001
ssub.b r1,r2,r3 : r3 = 0x0000000700000003 (default behaviour)
ssubf.b r1,r2,r3 : r3 = 0x0000000000000002 (floor)
ssubb.b r1,r2,r3 : r3 = 0x000000FE00000002, r4= 0x000000FF00000000 (borrow)
Execution Unit : Add/Sub Unit
Latency : 1 cycle for 8-bit data, 2 cycles for 16-bit to 64-bit data
Throughput : 1 operation per cycle per ASU.
MULtiplication
mul r3, r2, r1 mulh r3, r2, r1 muls r3, r2, r1 mulsh r3, r2, r1 smul r3, r2, r1 smulh r3, r2, r1 smuls r3, r2, r1 smulsh r3, r2, r1 |
Computes r1 = r2 x r3
mul performs an integer multiplication of the two source operands (r3 x r2) and puts the result in the destination operand (r1). The size flags indicate the size of the source operands.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_MUL | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | -s postfix | 1 if set | Sign flag |
13 | -h postfix | 1 if set | High flag ( 2r2w ) |
Examples :
Scalar :
R1 contains 0x23 (we only consider the lower byte in the registers)
R2 contains 0x36
mul.b r1,r2,r3 : r3 = 0x62 (default)
mulh.b r1,r2,r3 : r3 = 0x62 , r4 = 0x07 (High flag)
SIMD :
R1 contains 0x00 00 00 00 00 00 00 00 (in a 64-bit system)
R2 contains 0x00 00 00 00 00 00 00 00
smul.b r1,r2,r3 : r3 = 0x00 00 00 00 00 00 00 00
smulh.b r1,r2,r3 : r3 = 0x00 00 00 00 00 00 00 00 , r4 = 0x00 00 00 00 00 00 00 00
[Completed later, when all the errors will be corrected]
Execution Unit : Integer Multiply Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably 1 operation per cycle per IMU (pipelined multiplier).
DIVision
div r3, r2, r1 divs r3, r2, r1 divm r3, r2, r1 divms r3, r2, r1 sdiv r3, r2, r1 sdivs r3, r2, r1 sdivm r3, r2, r1 sdivms r3, r2, r1 |
Computes r1 = r3 / r2
div performs an integer division of the two source operands (r3 / r2) and puts the result in destination operand (r1). The size defined by the size flags corresponds to the size of the source operands.
Remark : the division computation is slow and heavy, try to use powers-of-two divisors as to simply shift the source operand, which takes only a cycle to perform in the FC0.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_DIV | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | -s postfix | 1 if set | Sign flag |
13 | -m postfix | 1 if set | Modulo flag ( 2r2w ) |
Examples :
Scalar :
R1 contains 0x10 (we only consider the lower byte in the registers)
R2 contains 0x05
div.b r1,r2,r3 : r3 = 0x03
divm.b r1,r2,r3 : r3 = 0x03 , r4 = 0x01
Execution Unit : Integer Divide Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably equal to the latency (not pipelined).
ADDition Immediate
addi Imm8, r2, r1 saddi Imm8, r2, r1 |
Computes r1 = r2 + Imm8.
This instruction is similar to the ``add'' instruction but it takes one of the source operands from the opcode (without sign extension). It has less room for the options and flags, so the usage of the reserved bit is still being discussed.
Remark : with wide operands, the latency may be higher than expected because the adder would use the full pipeline. In order to add or substract 1 from a large number (more than 8 bits) it is recommended to use the inc/dec instructions (when available) because they use the increment unit which has a lower latency.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_ADDI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Examples :
R2 contains 0x00F80F00F045FF82 (in a 64-bit system)
addi.b 0x87,r2,r3 : r3 = 0x00F80F00F045FF09
addi.d 0x87,r2,r3 : r3 = 0x00F80F00F0450009
saddi.b 0x87,r2,r3 : r3 = 0x877F968777CC8609
saddi.d 0x87,r2,r3 : r3 = 0x017F0F87F0CC0009
Execution Unit : Add/Sub Unit
Latency : 1 cycle for 8-bit data, 2 cycles for 16-bit to 64-bit data
Throughput : 1 operation per cycle per ASU.
SUBstraction Immediate
subi Imm8 , r2, r1 ssubi Imm8, r2, r1 |
Computes r2 = r1 - Imm8.
This instruction is similar to the ``sub'' instruction but it takes one of the source operands from the opcode (Imm8) (without sign extension, use addi instead). It has less room for the options and flags, so the usage of the reserved bit is still being discussed.
Remark : with wide operands, the latency may be higher than expected because the adder would use the full pipeline. In order to add or substract 1 from a large number (more than 8 bits) it is recommended to use the inc/dec instructions (when available) because they use the increment unit which has a lower latency.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_SUBI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Add/Sub Unit
Latency : 1 cycle for 8-bit data, 2 cycles for 16-bit to 64-bit data
Throughput : 1 operation per cycle per ASU.
MULtiplication Immediate
muli imm8, r2, r1 smuli Imm8, r2, r1 |
Computes r1 = r2 x imm8.
This instruction is similar to the ``mul'' instruction but it takes one of the source operands from the opcode (Imm8) and sign-extends it. It has less room for the options and flags, so the usage of the reserved bit is still being discussed.
Remark : the multiply computation is slow and heavy, try to use powers-of-two multipliers as to simply shift the source operand, which takes only a cycle to perform in the FC0.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_MULI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Integer Multiply Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably 1 operation per cycle per IMU (pipelined multiplier).
DIVision Immediate
divi imm8, r2, r1 sdivi Imm8, r2, r1 |
Computes r1 = r2 / Imm8.
This instruction is similar to ``div'' but the second operand is the sign-extended value of imm8. This will trigger a math trap if Imm8 is cleared (=0).
Remark : the division computation is slow and heavy, try to use powers-of-two divisors as to simply shift the source operand, which takes only a cycle to perform in the FC0.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_DIVI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Integer Divide Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably equal to the latency (not pipelined).
MODulo
mod r3, r2, r1 mods r3, r2, r1 smod r3, r2, r1 smods r3, r2, r1 |
Computes r1 = r3 % r2
mod performs an integer modulo of the two source operands (r3 % r2) and puts the result in destination operand (r1).
Remark : the modulo computation is slow and heavy, try to use powers-of-two modulos as to simply mask the MSB of the source operand, which takes only a cycle to perform in the FC0.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_MOD | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | -s postfix | 1 if set | Signed flag |
13 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Integer Divide Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably equal to the latency (not pipelined).
MODulo Immediate
modi Imm8, r2, r1 smodi Imm8, r2, r1 |
Computes r1 = r2 % Imm8
modi performs an integer modulo of the two source operands (r2 % Imm8) and puts the result in destination operand (r1). Imm8 is sign extended (?).
Remark : the modulo computation is slow and heavy, try to use powers-of-two modulos as to simply mask the MSB of the source operand, which takes only a cycle to perform in the FC0.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_MODI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Integer Divide Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably equal to the latency (not pipelined).
Multiply and ACcumulate
mac r3, r2, r1 macs r3, r2, r1 mach r3, r2, r1 machs r3, r2, r1 smac r3, r2, r1 smacs r3, r2, r1 smach r3, r2, r1 smachs r3, r2, r1 |
Computes r1 = r1 + ( r2 x r3 )
mac performs an integer multiplication of the two source operands (r3 x r2) and adds the result to the destination operand (r1). The size flags indicate the size of the source operands, the "granularity" of the destination operand is twice this size if the hardware can do it.
Remark 2 : this instruction reads three operands and therefore is a 3r1w operation that is not in the core. Its implementation depends on architectural parameters.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_MAC | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | -s postfix | 1 if set | Sign flag |
13 | -h postfix | 1 if set | High flag |
Example :
Scalar :
R1 contains 0x23 (we only consider the lower byte in the registers)
R2 contains 0x36
R3 contains 0x0136
mac.b r1,r2,r3 : r3 = 0x0868
[To be completed later, when all the other errors will be corrected]
Execution Unit : Integer Multiply Unit then Add/Sub Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably 1 operation per cycle per IMU+ASU
(pipelined multiplier and adder).
ADDition and SUBstraction
addsub r3, r2, r1 addsubs r3, r2, r1 saddsub r3, r2, r1 saddsubs r3, r2, r1 |
Computes r1 = r2 + r3 and r1+1 = r2 - r3
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_ADDSUB | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Example :
R1 contains 0x23 (we only consider the lower byte in the registers)
R2 contains 0x36
addsub.b r1,r2,r3 : r3 = 0x59 , r4 = 0xED
Execution Unit : Add/Sub Unit
Latency : 1 cycle for 8-bit data, 2 cycles for 16-bit to 64-bit data
Throughput : 1 operation per cycle per ASU.
POPulation COUNT
popcount r2, r1 spopcount r2, r1 |
Computes r1 = nb_bits(r2)
popcount counts the number of set bits in r2 and writes the result to the destination operand (r1). The size flags indicate the size of the source operands.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_POPC | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Example :
R1 contains 0x0123456789ABCDEF
popcount r1,r2 : r2 = 0x0000000000000020
Execution Unit : Unknown
Latency : unknown, but it's O(log2(size)) if you wanted to
know (just in case you're not a spook).
Throughput : unknown.
INCrement
inc r2, r1 sinc r2, r1 |
Computes r1 = r2 + 1
This instruction increments the source operand in a special unit that is designed for low latency when large data are processed. The value wraps around when reaching the maximum value.
In the future, the increment value could be specified so keep the reserved fields cleared.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_INC | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Example :
R1 contains 0xFF05891213450100 (in a 64-bit system)
sinc.b r1,r2 : r2 = 0x00068A1314460201
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
DECrement
dec r2, r1 sdec r2, r1 |
Computes r1 = r2 - 1
This instruction decrements the source operand in a special unit that is designed for low latency when large data are processed. The value wraps around when reaching the minimum value.
In the future, the decrement value could be specified so keep the reserved fields cleared.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_DEC | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Example :
R1 contains 0xFF05891213450100 (in a 64-bit system)
sinc.b r1,r2 : r2 = 0xFE048811124400FF
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
NEGation
neg r2, r1 sneg r2, r1 |
Computes r1 = not(r2) + 1
This instruction negates the source operand in a special unit that is designed for low latency when large data are processed.
This instruction is designed to work in the 2s-complement numbering sytem (signed integer numbers) and is not subject to saturation/overflow problems.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_NEG | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Example :
R1 contains 0xFF05891213450100 (in a 64-bit system)
sneg.b r1,r2 : r2 = 0x01FB77EEEDBBFF00
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
scan[n][r] r2, r1 sscan[n][r] r2, r1 lsb1 r2, r1 lsb0 r2, r1 msb1 r2, r1 msb0 r2, r1 slsb1 r2, r1 slsb0 r2, r1 smsb1 r2, r1 smsb0 r2, r1 |
Computes r1 = scan_for_lsb(r2)
This instruction scans the source operand (r2) for the first set bit, starting from the LSB, and writes the position of this bit to the destination register (r1). If the source is cleared, the result is zero, otherwise the bit #0 is counted as position 1.
This instruction has options that bit reverse the source and/or complement the bits, so it can search for the last bit reset for example.
lsb1 is an alias for scan
lsb0 is an alias for scann
This instruction scans the source operand (r2) for the first reset bit, starting from the LSB,
and writes the position of this bit to the destination register (r1). If the source is set
(all ones), the result is zero, otherwise the bit #0 is counted as position 1.
msb1 is an alias for scanr
This instruction scans the source operand (r2) for the first set bit, starting from the MSB,
and writes the position of this bit to the destination register (r1). If
the source is cleared, the result is zero, otherwise the bit #0 is counted as position 1.
msb0 is an alias for scannr
This instruction scans the source operand (r2) for the first reset bit, starting from the MSB,
and writes the position of this bit to the destination register (r1). If the source is set
(all ones), the result is zero, otherwise the bit #0 is counted as position 1.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_SCAN | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | -n postfix | 1 if set | Negate the input |
13 | -r postfix | 1 if set | Bit-Reverse the input |
Examples :
R1 contains 0xFF05891213450100 (in a 64-bit system)
lsb1 r1,r2 : r2 = 0x9
lsb0 r1,r2 : r2 = 0x1
msb1 r1,r2 : r2 = 0x40
msb0 r1,r2 : r2 = 0x38
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
CoMPare for Lower
cmpl r3, r2, r1 scmpl r3, r2, r1 |
Compare the two source operands and sets or clear the destination register according to the result. This operation is performed in the Increment unit so no substraction is required and it is performed faster for large data. In order to compare for greater, simply swap the source operands or negate the result of CMPLE. The comparison is valid only for unsigned values (yet)
Remark : this instruction can't be used for IEEE floating point data (the comparison is not signed).
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_CMPL | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001
scmpl.b r1,r2,r3 : r3 = 0x00000000000000FF
scmpl.b r2,r1,r3 : r3 = 0x000000FF00000000
cmpl r1,r2,r3 : r3 = 0x0000000000000000
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
CoMPare for Lower or Equal
cmple r3, r2, r1 scmple r3, r2, r1 |
Compare the two source operands and sets or clear the destination register according to the result. This operation is performed in the Increment unit so no substraction is required and it is performed faster for large data. In order to compare for greater or equal, simply swap the source operands or negate the result of CMPL. The comparison is valid only for unsigned values (yet)
Remark : this instruction can't be used for IEEE floating point data (the comparison is not signed).
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_CMPLE | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001
scmpl.b r1,r2,r3 : r3 = 0xFFFFFF00FFFFFFFF
scmpl.b r2,r1,r3 : r3 = 0xFFFFFFFFFFFFFF00
cmpl r1,r2,r3 : r3 = 0x0000000000000000
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
CoMPare for Lower with Immediate
cmpli Imm8, r2, r1 scmpli r3, r2, r1 |
Similarly to CMPL, with an immediate operand (that is not sign-extended), compare the two source operands and sets or clear the destination register according to the result. The comparison is valid only for unsigned values (yet)
Remark : this instruction can't be used for IEEE floating point data (the comparison is not signed).
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_CMPLI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0000000500000003 (in a 64-bit system)
scmpli.b 0x04,r1,r2 : r2 = 0x00000000000000FF
cmpli 0x04,r1,r2 : r2 = 0x0000000000000000
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
CoMPare for Lower or Equal with Immediate
cmplei Imm8, r2, r1 scmplei r3, r2, r1 |
Similarly to CMPLE, with an immediate operand (that is not sign-extended), compare the two source operands and sets or clear the destination register according to the result. The comparison is valid only for unsigned values (yet)
Remark : this instruction can't be used for IEEE floating point data (the comparison is not signed).
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_CMPLEI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0000000500000003 (in a 64-bit system)
scmplei.b 0x04,r1,r2 : r2 = 0xFFFFFF00FFFFFFFF
cmplei 0x04,r1,r2 : r2 = 0x0000000000000000
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
ABSolute value
abs r2, r1 sabs r2, r1 |
Computes r1 = (not(r2) + 1) if MSB(r1)==1
This instruction negates the source operand in a special unit that is designed for low latency when large data are processed. If the sign bit (MSB) of the source is set (the number is negative) then the value is written back to the register set, or else (it is already positive) the result is cancelled.
This instruction is designed to work in the 2s-complement number sytem (signed integer numbers) and is not subject to saturation/overflow problems.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_ABS | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Example :
R1 contains 0xFF05891213450100 (in a 64-bit system)
sabs.b r1,r2 : r2 = 0x0105771213450100
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
MAXimum
max r3, r2, r1 smax r3, r2, r1 |
Computes r1 = r3 if ( r2 < r3 ) else r1 = r2
Compare the two source operands and writes the maximum of the two values to the destination register. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_MAX | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001
smax.b r1,r2,r3 : r3 = 0x0000000700000003
max r1,r2,r3 : r3 = 0x0000000700000003
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
MINimum
min r3, r2, r1 smin r3, r2, r1 |
Computes r1 = r3 if ( r2 > r3 ) else r1 = r2
Compare the two source operands and writes the minimum of the two values to the destination register. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_MIN | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001
smin.b r1,r2,r3 : r3 = 0x0000000500000001
min r1,r2,r3 : r3 = 0x0000000500000003
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
MAXimum Immediate
maxi Imm8, r2, r1 smaxi Imm8, r2, r1 |
Computes r1 = Imm8 if ( r2 < Imm8 ) else r1 = r2
Compare the two source operands and writes the maximum of the two values to the destination register. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_MAXI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Examples :
R2 contains 0x0000000500000003 (in a 64-bit system)
smaxi.b 0x04,r2,r3 : r3 = 0x0000000500000004
maxi 0x04,r2,r3 : r3 = 0x0000000500000003
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
MINimum Imemdiate
mini r3, r2, r1 smini Imm8, r2, r1 |
Computes r1 = Imm8 if ( r2 > Imm8 ) else r1 = r2
Compare the two source operands and writes the minimum of the two values to the destination register. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_MINI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Examples :
R2 contains 0x0000000500000003 (in a 64-bit system)
smini.b 0x04,r2,r3 : r3 = 0x0000000400000003
mini 0x04,r2,r3 : r3 = 0x0000000000000004
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
SORT
sort r3, r2, r1 ssort r3, r2, r1 |
Computes { r1 = r3 , r1+1 = r2 } if ( r2 > r3 ) else { r1 = r2 , r1+1 = r3 }
Compare the two source operands and writes the minimum of the two values to the destination register and the maximum to destination register+1. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data. This instruction is of the 2r2w form.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_SORT | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001
ssort.b r1,r2,r3 : r3 = 0x0000000500000001 , r4 = 0x0000000700000003
sort r1,r2,r3 : r3 = 0x0000000500000003 , r4 = 0x0000000700000001
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
Lns ADDition
ladd r3, r2, r1 sladd r3, r2, r1 |
Computes r1 = r2 + r3 in the LNS format.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_LADD | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Performance :
Execution Unit : LNS Unit (not implemented)
Latency : unknown
Throughput : unknown.
Lns SUBstract
lsub r3, r2, r1 slsub r3, r2, r1 |
Computes r1 = r2 - r3 in the LNS format.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_LSUB | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Performance :
Execution Unit : LNS Unit (not implemented)
Latency : unknown
Throughput : unknown.
Lns to INT conversion
l2int r2, r1 sl2int r2, r1 |
Computes the equivalence of r2 in the LNS format to the integer format.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_L2INT | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | rounding mode |
Performance :
Execution Unit : LNS Unit (not implemented)
Latency : unknown
Throughput : unknown.
INT to Lns conversion
int2l r2, r1 sint2l r2, r1 |
Computes the equivalence of r2 in the integer format to the LNS format.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_INT2L | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | reserved |
Performance :
Execution Unit : LNS Unit (not implemented)
Latency : unknown
Throughput : unknown.
6.2 Bit Shuffling based operations :
SHIFT Left logical
shiftl r3, r2, r1 sshiftl r3, r2, r1 |
Computes r1 = r2 << r3.
The value of r3 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_SHIFTL | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
SHIFT Right logical
shiftr r3, r2, r1 sshiftr r3, r2, r1 |
Computes r1 = r2 >> r3
The value of r3 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_SHIFTR | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
SHIFT Right Arithmetic
shiftra r3, r2, r1 sshiftra r3, r2, r1 |
Computes r1 = r2 >> r3 and preserve the sign.
The value of r2 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_SHIFTRA | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
ROTation Left
rotl r3, r2, r1 srotl r3, r2, r1 |
Computes r1 = r2 <@ r3
The value of r2 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_ROTL | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
ROTation Right
rotr r3, r2, r1 srotr r3, r2, r1 |
Computes r1 = r2 @> r3
The value of r2 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_ROTR | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
SHIFT Left Immediate
shiftli Imm8, r2, r1 sshiftli Imm8, r2, r1 |
Computes r1 = r2 << Imm8
The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_SHIFTLI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
SHIFT Right Immediate logic
shiftri Imm8, r2, r1 sshiftri Imm8, r2, r1 |
Computes r1 = r2 >> Imm8
The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_SHIFTRI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
SHIFT Right Arithmetic Immediate
shiftrai Imm8, r2, r1 sshiftrai Imm8, r2, r1 |
Computes r1 = r2 >> Imm8 and preserve the sign
The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_SHIFTRAI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
ROTate Left Immediate
rotli Imm8, r2, r1 srotli Imm8, r2, r1 |
Computes r1 = r2 <@ Imm8
The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_ROTLI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit :Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
ROTate Right Immediate
rotri Imm8, r2, r1 srotri Imm8, r2, r1 |
Computes r1 = r2 @> Imm8
The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_ROTRI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
single BIT OPeration
bitop[x/s/c/t] r3, r2, r1 sbitop[x/s/c/t] r3, r2, r1 bchg r3, r2, r1 bset r3, r2, r1 bclr r3, r2, r1 btst r3, r2, r1 sbchg r3, r2, r1 sbset r3, r2, r1 sbclr r3, r2, r1 sbtst r3, r2, r1 |
Computes r1 = F(function, r2, 1 << r3)
In the shifter, a 1 is shifted left r3 times and combined with the second operand (r2) according to the function F defined below :
Function number : | Logical function : | Operation : | Opcode : |
00 | OR | Bit Set | bset or bitops |
01 | ANDN | Bit Clear | bclr or bitopc |
10 | XOR | Bit Change | bchg or bitopx |
11 | AND | Bit Mask | btst or bitopt |
The value of r3 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_BITOP | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12-13 | x, c, t or s | 00-11 | F |
Example :
R1 contains 0x08
R2 contains 0xFF05891213450100 (in a 64-bit system)
bchg r1,r2,r3 : r3 = 0xFF05891213450000
bset r1,r2,r3 : r3 = 0xFF05891213450100
bclr r1,r2,r3 : r3 = 0xFF05891213450000
btst r1,r2,r3 : r3 = 0x0000000000000100
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
single BIT OPeration Immediate
bitop[x/s/c/t]i Imm6, r2, r1 sbitop[x/s/c/t]i Imm6, r2, r1 bchgi Imm6, r2, r1 bseti Imm6, r2, r1 bclri Imm6, r2, r1 btsti Imm6, r2, r1 sbchgi Imm6, r2, r1 sbseti Imm6, r2, r1 sbclri Imm6, r2, r1 sbtsti Imm6, r2, r1 |
Computes r1 = F(function, r2, 1 << Imm6)
In the shifter, a 1 is shifted left Imm6 times and combined with the second operand (r2) according to the function F defined below :
F : | Logical function : | Operation : | Opcode : |
00 | OR | Bit Set | bseti or bitopsi |
01 | ANDN | Bit Clear | bclri or bitopci |
10 | XOR | Bit Change | bchgi or bitopxi |
11 | AND | Bit Mask | btsti or bitopti |
The value of Imm6 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 4 | 2 | 6 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 13 | 14 19 | 20 25 | 26 31 |
function : | OP_BITOPI | Flags | F | Imm6 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12-13 | x, c, t or s | 00-11 | F |
Example :
R2 contains 0xFF05891213450100 (in a 64-bit system)
bchgi 0x08,r2,r3 : r3 = 0xFF05891213450000
bseti 0x08,r2,r3 : r3 = 0xFF05891213450100
bclri 0x08,r2,r3 : r3 = 0xFF05891213450000
btsti 0x08,r2,r3 : r3 = 0x0000000000000100
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
BIT REVerse
bitrev r3, r2, r1 bitrevo r3, r2, r1 |
Computes r1 = bit_reverse(r2) >> (size-r3)
or r1+1 = r1 | ( bit_reverse(r2) >> (size-r3) )
R2 is first bit-reversed then shifted right size - r3 times.
If the -o flag is set, the result is combined by a OR with the content of r1 before it is written back to r1+1. This instruction is used to compute pointer updates in butterfly data structures, where r3 is the log2 of the size of the structure, r2 is the current index in the structure (always inferior to 2^r3) and r1 is the base pointer. It is a 3r1w instruction form and therefore optional.
The value of r3 is truncated to the number of bits needed by the bit shuffler unit. Because it is aimed at pointer manipulation, the SIMD flag is not used. When a base address is used in conjunction with the -o flag, take care to align the base address to a boundray at least equivalent to the size of the data structure (just in case you weren't aware). In case the butterfly buffer is not aligned, an addition must be performed and the bitrev instruction must be used instead. The alignment of the final data is ensured by limiting the index : for example, a 256-byte buffer with 32-bit words requires that the index is between 0 and 63, so the final 2 LSB are always cleared.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_BITREV | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | -o postfix | 1 if set | OR the result with the destination |
11-13 | (none yet) | 0 | Reserved |
Example :
R1 contains 0x08 (a 256-byte buffer)
R2 contains 0x48 (the current index)
R3 contains 0xFF05891213450100 (the buffer base address)
bitrev r1,r2,r3 : r3 = 0x0C
bitrevo r1,r2,r3 : r4 = 0xFF0589121345010C
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
single BIT OPeration Immediate
bitrevi Imm8, r2, r1 bitrevio Imm8, r2, r1 |
Computes r1 = bit_reverse(r2) >> (size-Imm8)
or r1+1 = r1 | ( bit_reverse(r2) >> (size-Imm8) )
The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 14 19 | 20 25 | 26 31 |
function : | OP_BITREVI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | -o postfix | 1 if set | OR the result with the destination |
11 | (none yet) | 0 | Reserved |
Example :
R2 contains 0x48 (the current index)
R3 contains 0xFF05891213450100 (the buffer base address)
bitrevi 0x08,r2,r3 : r3 = 0x0C
bitrevio 0x08,r2,r3 : r4 = 0xFF0589121345010C
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
BYTe REVerse
byterev r2, r1 sbyterev r2, r1 |
Changes the endianness of r2 and puts the result into r1.
All the versions of the F-CPU may not support dual-endianness in the Load/Store unit, or simply the software may require internal operations of this kind. This is optional for the minimal systems, but yet useful in communication software. Remark, byterev.b has no use :-)
size : | 8 | 3 | 9 | 6 | 6 |
bits : | 0 7 | 8 10 | 11 19 | 20 25 | 26 31 |
function : | OP_BYTEREV | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Size flag | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
Examples :
R2 contains 0xFF05891213450100 (in a 64-bit system)
byterev.d r2,r3 : r3 = 0xFF05891213450001
byterev.q r2,r4 : r4 = 0xFF05891200014513
sbyterev.d r2,r3 : r3 = 0x05FF128945130001
sbyterev.q r2,r4 : r4 = 0x128905FF00014513
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
MIX
mixl r3, r2, r1 mixh r3, r2, r1 |
Mix two halves of r3 and r2 and puts the result into r1.
Depending on the h flag, the lower or higher part of r3 and r2 are interleaved. The size of the source chunks is determined by the size flags. This instruction is useful to interleave words in a "butterfly" fashion or reverse a little matrix. Or simply it can be used to create an extended form of the result of an addition with carry.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_MIX | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Size flag | |
10-11 | (none yet) | 0 | Reserved |
12 | -l or -h postfix | 0 for -l 1 for -h | High flag |
13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0001020304050607 (in a 64-bit system)
R2 contains 0x08090A0B0C0D0E0F
mixl.d r1,r2,r3 : r3 = 0x04050C0D06070E0F
mixh.d r1,r2,r4 : r4 = 0x0001080902030A0B
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
EXPAND
expandl r3, r2, r1 expandh r3, r2, r1 |
Mix chunks of r3 and r2 and puts the result into two halves of r1.
This is the reverse operation of the mix instruction. Depending on the h flag, the lower or higher part of r3 and r2 are interleaved. The size of the source chunks is determined by the size flags.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_EXPAND | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Size flag | |
10-11 | (none yet) | 0 | Reserved |
12 | -l or -h postfix | 0 for -l 1 for -h | High flag |
13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0001020304050607 (in a 64-bit system)
R2 contains 0x08090A0B0C0D0E0F
expandl.b r1,r2,r3 : r3 = 0x09010B030D050F07
expandh.b r1,r2,r4 : r4 = 0x08000A020C040E06
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
Simd DUPlication
sdup r2, r1 |
Duplicates the lower part of r2 and put the result in r1. The size of the destination SIMD chunks is determined by the size flags.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_SDUP | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Size flag | |
10-13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0001020304050607 (in a 64-bit system)
sdup.b r1,r2 : r2 = 0x0707070707070707
sdup.d r1,r3 : r3 = 0x0607060706070607
sdup.q r1,r4 : r4 = 0x0405060704050607
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
bitwise LOGIC
logic.xxxx r1, r2, r3 or r1, r2, r3 orn r1, r2, r3 and r1, r2, r3 andn r1, r2, r3 xor r1, r2, r3 nxor r1, r2, r3 not r1, r2, r3 nor r1, r2, r3 nand r1, r2, r3 |
Computes r3 = f(r1,r2) where f is a logic function whose truth table is defined in the flags.
Remark : XOR should be used to compare two numbers for equality, instead of sub.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_LOGIC | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Values | Function |
8-9 | [qdb] | Size flags |
10 | [01] | f(0,0) |
11 | [01] | f(1,0) |
12 | [01] | f(0,1) |
13 | [01] | f(1,1) |
or is an alias for logic.0111
and is an alias for logic.0001
xor is an alias for logic.0110
not is an alias for logic.1010
nor is an alias for logic.1000
nand is an alias for logic.1110
Execution Unit : ROP2 Unit
Latency : 1 cycle
Throughput : 1 result per cycle per ROP2.
bitwise LOGIC Immediate
logici.xxxx Imm8, r2, r3 andi Imm8, r2, r3 andni Imm8, r2, r3 ori Imm8, r2, r3 xori Imm8, r2, r3 |
Computes r1 = f(Imm8,r2) where f is a logic function whose truth table is defined in the flags.
Because there is less room than in the register form of the instruction, the logic functions are reduced to 4. I have chosen to use the same logic functions as in the bitop instructions. Yet, the SIMD flag is cruelly missing. The function could maybe be included in the opcode.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_LOGICI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Values | Function |
8-9 | [qdb] | Size flags |
10-11 | [xtcs] | logic function |
ori is an alias for logici.s
andi is an alias for logici.t
xori is an alias for logici.x
andni is an alias for logici.c
Execution Unit : ROP2 Unit
Latency : 1 cycle
Throughput : 1 result per cycle per ROP2.
Instruction \ Level | 0 | 1 | 2 | 3 |
fadd | * | * | * | |
fsub | * | * | * | |
fmul | * | * | * | |
int2f/f2int | * | * | * | |
fiaprx, fsqrtiaprx | * | * | * | |
fdiv, fsqrt | * | * | ||
flog | * | |||
fexp | * | |||
fmac | * | |||
faddsub | * |
The FP level of a CPU should be read in the associated Special Register before attempting to execute FP instructions.
Floating point ADDition
6.4.1 Level 1 Floating Point Operations :
6.4.1.1 fadd :
fadd r3, r2, r1 sfadd r3, r2, r1 faddx r3, r2, r1 sfaddx r3, r2, r1 |
Computes r1 = r2 + r3 in IEEE-754 compliant format.
fadd performs a floating addition of the two source operands (r1 + r2) and puts the result in destination operand (r3). The operation should be compliant with the IEEE-754 format.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_FADD | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .[??] postfix | 00 : 32-bit FP 01 : 64-bit FP | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | -x postfix | 1 to skip the tests | IEEE compliance flag |
12-13 | (none yet) | 0 | Reserved |
Floating point SUBstraction
fsub r3, r2, r1 sfsub r3, r2, r1 fsubx r3, r2, r1 sfsubx r3, r2, r1 |
Computes r1 = r2 - r3 in IEEE-754 compliant format.
fsub performs a floating substraction of the two source operands (r1 - r2) and puts the result in destination operand (r3). The operation should be compliant with the IEEE-754 format.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_FSUB | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .[??] postfix | 00 : 32-bit FP 01 : 64-bit FP | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | -x postfix | 1 to skip the tests | IEEE compliance flag |
12-13 | (none yet) | 0 | Reserved |
Floating point MULtiplication
fmul r3, r2, r1 sfmul r3, r2, r1 fmulx r3, r2, r1 sfmulx r3, r2, r1 |
Computes r1 = r2 x r3 in IEEE-754 compliant format.
fmul performs a floating multiplication of the two source operands (r1 x r2) and puts the result in destination operand (r3). The operation should be compliant with the IEEE-754 format.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_FMUL | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .[??] postfix | 00 : 32-bit FP 01 : 64-bit FP | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | -x postfix | 1 to skip the tests | IEEE compliance flag |
12-13 | (none yet) | 0 | Reserved |
Floating point to INTeger conversion
f2int r2, r1 sf2int r2, r1 f2intx r2, r1 sf2intx r2, r1 |
``f2int'' converts a floating point number in register r2 into an integer number, according to the mode flags, and put it in register r1.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_F2INT | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .[??] postfix | 00 : 32-bit FP 01 : 64-bit FP | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | -x postfix | 1 to skip the tests | IEEE compliance flag |
12-13 | see below | Rounding modes |
Rounding modes:
Value | Rounding mode |
00 | Nearest (default) |
01 | Towards 0 |
10 | Towards -infinity |
11 | Towards +infinity |
INTeger to Floating point conversion
int2f r2, r1 sint2f r2, r1 int2fx r2, r1 sint2fx r2, r1 |
``int2f'' converts an integer number in register r2 into a floating point number and put it in register r1.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_INT2F | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .[??] postfix | 00 : 32-bit FP 01 : 64-bit FP | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | -x postfix | 1 to skip the tests | IEEE compliance flag |
fiaprx r2, r1 fiaprxx r2, r1 sfiaprx r2, r1 sfiaprxx r2, r1 |
fiaprx approximates the inverse of r2 (1/r2) with the help of a hardwired lookup table and puts the result into r1. This operation is used at the beginning of a Newton-Raphson algorithm to compute a division. The accuracy of the lookup table depends on the application, and the number of NR iteration also depends on the desired accuracy and the size of the FP number.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_FIAPRX | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .[??] postfix | 00 : 32-bit FP 01 : 64-bit FP | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | -x postfix | 1 to skip the tests | IEEE compliance flag |
fsqrtiaprx r2, r1 fiaprxx r2, r1 sfsqrtiaprx r2, r1 sfsqrtiaprxx r2, r1 |
fsqrtiaprx approximates the inverse of the square root of r2 (1/Ör2) with the help of a hardwired lookup table and puts the result into r1. This operation is used at the beginning of a Newton-Raphson algorithm to compute a square root. The accuracy of the lookup table depends on the application, and the number of NR iteration also depends on the desired accuracy and the size of the FP number.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_FSQRTIAPRX | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .[??] postfix | 00 : 32-bit FP 01 : 64-bit FP | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | -x postfix | 1 to skip the tests | IEEE compliance flag |
6.4.2 Level 2 Floating Point Operations :
6.4.2.1 fdiv
Floating point Division
fdiv r3, r2, r1 fdivx r3, r2, r1 sfdiv r3, r2, r1 sfdivx r3, r2, r1 |
fdiv performs a floating division of the two source operands (r3 / r2) and puts the result in destination operand (r1). The operation should be IEEE-754 compliant.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_FDIV | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .[??] postfix | 00 : 32-bit FP 01 : 64-bit FP | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | -x postfix | 1 to skip the tests | IEEE compliance flag |
12-13 | (none yet) | 0 | Reserved |
fsqrt r3, r2, r1 fsqrtx r3, r2, r1 ssqrt r3, r2, r1 ssqrtx r3, r2, r1 |
fsqrt performs a floating point square root of the source operand (Ör2) and puts the result in destination operand (r1). The operation should be IEEE-754 compliant.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_FSQRT | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .[??] postfix | 00 : 32-bit FP 01 : 64-bit FP | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | -x postfix | 1 to skip the tests | IEEE compliance flag |
6.4.3 Level 3 Floating Point Operations :
6.4.3.1 flog
Floating point LOGarithm
flog r3, r2, r1 flogx r3, r2, r1 sflog r3, r2, r1 sflogx r3, r2, r1 |
Computes r1 = logr3(r2)
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_FLOG | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .[??] postfix | 00 : 32-bit FP 01 : 64-bit FP | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | -x postfix | 1 to skip the tests | IEEE compliance flag |
12-13 | (none yet) | 0 | Reserved |
fexp r3, r2, r1 fexpx r3, r2, r1 sfexp r3, r2, r1 sfexpx r3, r2, r1 |
Computes r1 = expr3(r2)
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_FEXP | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .[??] postfix | 00 : 32-bit FP 01 : 64-bit FP | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | -x postfix | 1 to skip the tests | IEEE compliance flag |
12-13 | (none yet) | 0 | Reserved |
fmac r3, r2, r1 fmacx r3, r2, r1 smac r3, r2, r1 smacx r3, r2, r1 |
Computes r1 = r1 + ( r2 x r3 )
fmac performs a floating multiplication of the two source operands (r2 x r3) and adds the result to destination operand (r3). The operation should be IEEE-754 compliant.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_FMAC | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .[??] postfix | 00 : 32-bit FP 01 : 64-bit FP | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | -x postfix | 1 to skip the tests | IEEE compliance flag |
12-13 | (none yet) | 0 | Reserved |
faddsub r3, r2, r1 faddsubx r3, r2, r1 sfaddsub r3, r2, r1 sfaddsubx r3, r2, r1 |
Computes r1 = r3 + r2 and r1+1 = r3 - r2
faddsub is a 2r2w instruction that performs both floating point addition and substraction of the two operands in IEEE-754 format.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_FADDSUB | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .[??] postfix | 00 : 32-bit FP 01 : 64-bit FP | Defines the size parameter |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | -x postfix | 1 to skip the tests | IEEE compliance flag |
12-13 | (none yet) | 0 | Reserved |
6.5 Memory Access operations :
LOAD a memory item into a register and adjust the Endianness
load r2, r1 loade r2, r1 |
Performs r1 = endian(e,mem[r2]).
LOAD fetches the memory item pointed to by r2, changes the endianness according to the endian flag, and puts the result of the specified size into r1.
This instruction can trigger two exceptions (in order of decreasing priority) :
Prefetch :
In the case where the destination register is r0 (the NULL register), none of these
exceptions are raised. This instruction form serves as a prefetch instruction
that is issued several cycles before the actual reference is performed. The prefetch
form prepares the memory hierarchy, the protection mechanisms and all the internal
hidden flags for an eventual exception. The CPU can use the time between the prefetch
and the actual fetch to prepare the page fault handler and the memory hierarchy
so that the actual fetch will have almost no latency, whenever there is a fault or not.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_LOAD | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Size Flag | |
10 | -e postfix | 0 : little endian 1 : big endian | Endian Flag |
11-13 | -0 .. -7 postfix | 000 .. 111 | Reserved for the Stream Hint bits |
Performance (FC0 only) :
Execution Unit : Load/Store Unit
Latency : 2 cycles if the item is already in the memory buffer, undetermined
(but more) otherwise.
Throughput : 1 operation per cycle per LSU (peak).
adjust the Endianness and STORE the result in memory
store r2, r1 storee r2, r1 |
Performs mem[r2] = endian(e,r1).
STORE adjusts the endianness of r1 according to the Endian flag and stores the item of the defined size to memory, at the location pointed to by r2.
This instruction can trigger two exceptions (in order of decreasing priority) :
The L/S Unit of the FC0 can perform the store operation with no latency for the entire pipeline when there is a free line in the memory bufer. If there are too much pending memory access requests, the pipeline must wait at the decoding stage for a memory buffer line to be freed.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_STORE | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Size Flag | |
10 | -e postfix | 0 : little endian 1 : big endian | Endian Flag |
11-13 | -0 .. -7 postfix | 000 .. 111 | Reserved for the Stream Hint bits |
Performance (FC0 only) :
Execution Unit : Load/Store Unit
Latency : none if the memory buffer has a free slot, undetermined
(but more) otherwise.
Throughput : 1 operation per cycle per LSU (peak).
LOAD a memory item into a register, adjust the Endianness and update the pointer
load r3, r2, r1 loade r3, r2, r1 |
Performs : r1 = endian(e,mem[r2])
r2 = r2 + r3
LOAD fetches the memory item pointed to by r2, changes the endianness according to the endian flag, puts the result of the specified size into r1. This version uses the same opcode as the core version but differs by the r3 parameter which makes it a 2r2w instruction. In addition to the core version, the r3 parameter is used to update the r2 pointer by adding them in parallel with the memory operation. Note that if r3 contains 0, the core version is executed : the CPU checks the zero flags, instead of checking the register number.
This instruction can trigger two exceptions (in order of decreasing priority) :
Prefetch :
In the case where the destination register is r0 (the NULL register), none of these
exceptions are raised. This instruction form serves as a prefetch instruction
that is issued several cycles before the actual reference is performed. The prefetch
form prepares the memory hierarchy, the protection mechanisms and all the internal
hidden flags for an eventual exception. The CPU can use the time between the prefetch
and the actual fetch to prepare the page fault handler and the memory hierarchy
so that the actual fetch will have almost no latency, whenever there is a fault or not.
The behaviour of the pointer update obeys to the simplest arithmetics rules. No saturation is performed and the pointer will wrap around in memory.
After the addition is performed, the result will be submitted to the DTLB (Data virtual address Translation Lookaside Buffer) to check for the pointer validity in advance. As soon as the physical address is known, the processor can also prefetch the data if necessary, issuing a fetch command to the cache or the external memory. In the same time, the processor can check the sign of r3 in order to predict in which direction the pointer advances and prepare the memory buffer.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_LOAD | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Size Flag | |
10 | -e postfix | 0 : little endian 1 : big endian | Endian Flag |
11-13 | -0 .. -7 postfix | 000 .. 111 | Reserved for the Stream Hint bits |
Performance (FC0 only) :
Execution Unit : Load/Store Unit and Add/Sub Unit.
Latency : 2 cycles if the item is already in the memory buffer, undetermined
(but more) otherwise. The pointer update takes three cycles (2 ASU + 1 DTLB).
Throughput : 1 operation per cycle per LSU (peak).
adjust the Endianness, STORE the result in memory and update the pointer
store r2, r1 storee r2, r1 |
Performs mem[r2] = endian(e,r1)
r2 = r2 + r3.
STORE adjusts the endianness of r1 according to the Endian flag and stores the item of the defined size to memory, at the location pointed to by r2. This version uses the same opcode as the core version but differs by the r3 parameter which makes it a 3r1w instruction. In addition to the core version, the r3 parameter is used to update the r2 pointer by adding them in parallel with the memory operation. Note that if r3 contains 0, the core version is executed : the CPU checks the zero flags, instead of checking the register number.
This instruction can trigger two exceptions (in order of decreasing priority) :
The L/S Unit of the FC0 can perform the store operation with no latency for the entire pipeline when there is a free line in the memory bufer. If there are too much pending memory access requests, the pipeline must wait at the decoding stage for a memory buffer line to be freed.
The behaviour of the pointer update obeys to the simplest arithmetics rules. No saturation is performed and the pointer will wrap around in memory.
After the addition is performed, the result will be submitted to the DTLB (Data virtual address Translation Lookaside Buffer) to check for the pointer validity in advance. In the same time, the processor can check the sign of r3 in order to predict in which direction the pointer advances and prepare the memory buffer.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_STORE | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Size Flag | |
10 | -e postfix | 0 : little endian 1 : big endian | Endian Flag |
11-13 | -0 .. -7 postfix | 000 .. 111 | Reserved for the Stream Hint bits |
Performance (FC0 only) :
Execution Unit : Load/Store Unit and Add/Sub Unit.
Latency : 2 cycles if the item is already in the memory buffer, undetermined
(but more) otherwise. The pointer update takes three cycles (2 ASU + 1 DTLB).
Throughput : 1 operation per cycle per LSU (peak).
LOAD a memory item into a register, adjust the Endianness and update the pointer with an Immediate number
loadi Imm8, r2, r1 loadie r3, r2, r1 |
Performs r1 = endian(e,mem[r2])
r2 = r2 + Imm8
LOAD fetches the memory item pointed to by r2, changes the endianness according to the endian flag, puts the result of the specified size into r1. Curiously, this is a 1r2w instruction. The Imm8 data is sign-extended with a ninth bit in the instruction word which also serves to predict in which direction the pointer moves.
This instruction can trigger two exceptions (in order of decreasing priority) :
Prefetch :
In the case where the destination register is r0 (the NULL register), none of these
exceptions are raised. This instruction form serves as a prefetch instruction
that is issued several cycles before the actual reference is performed. The prefetch
form prepares the memory hierarchy, the protection mechanisms and all the internal
hidden flags for an eventual exception. The CPU can use the time between the prefetch
and the actual fetch to prepare the page fault handler and the memory hierarchy
so that the actual fetch will have almost no latency, whenever there is a fault or not.
The behaviour of the pointer update obeys to the simplest arithmetics rules. No saturation is performed and the pointer will wrap around in memory.
After the addition is performed, the result will be submitted to the DTLB (Data virtual address Translation Lookaside Buffer) to check for the pointer validity in advance. As soon as the physical address is known, the processor can also prefetch the data if necessary, issuing a fetch command to the cache or the external memory. In the same time, the processor uses the sign bit of Imm8 in order to predict in which direction the pointer advances and prepare the memory buffer.
Because of the width of the immediate data, there is no room to specify the stream hint bits. It is therefore assumed that the processor will "associate" the stream number with the pointer register thanks to a hidden status flag.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_LOADI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Size Flag | |
10 | -e postfix | 0 : little endian 1 : big endian | Endian Flag |
11 | Sign bit of Imm8 |
Performance (FC0 only) :
Execution Unit : Load/Store Unit and Add/Sub Unit.
Latency : 2 cycles if the item is already in the memory buffer, undetermined
(but more) otherwise. The pointer update takes three cycles (2 ASU + 1 DTLB).
Throughput : 1 operation per cycle per LSU (peak).
adjust the Endianness, STORE the result in memory and update the pointer with an Immediate number
storei r2, r1 storeie r2, r1 |
Performs mem[r2] = endian(e,r1)
r2 = r2 + Imm8.
STORE adjusts the endianness of r1 according to the Endian flag and stores the item of the defined size to memory, at the location pointed to by r2 then adds Imm8 to the pointer. This is a 2r1w instruction. The Imm8 data is sign-extended with a ninth bit in the instruction word which also serves to predict in which direction the pointer moves.
This instruction can trigger two exceptions (in order of decreasing priority) :
The L/S Unit of the FC0 can perform the store operation with no latency for the entire pipeline when there is a free line in the memory bufer. If there are too much pending memory access requests, the pipeline must wait at the decoding stage for a memory buffer line to be freed.
The behaviour of the pointer update obeys to the simplest arithmetics rules. No saturation is performed and the pointer will wrap around in memory.
After the addition is performed, the result will be submitted to the DTLB (Data virtual address Translation Lookaside Buffer) to check for the pointer validity in advance. In the same time, the processor can check the sign bit of Imm8 in order to predict in which direction the pointer advances and prepare the memory buffer.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_STOREI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Size Flag | |
10 | -e postfix | 0 : little endian 1 : big endian | Endian Flag |
11 | Sign bit of Imm8 |
Performance (FC0 only) :
Execution Unit : Load/Store Unit and Add/Sub Unit.
Latency : 2 cycles if the item is already in the memory buffer, undetermined
(but more) otherwise. The pointer update takes three cycles (2 ASU + 1 DTLB).
Throughput : 1 operation per cycle per LSU (peak).
loadf r3, r2, r1 loadfe r3, r2, r1 storef r3, r2, r1 storefe r3, r2, r1 loadif r3, r2, r1 loadife r3, r2, r1 storeif r3, r2, r1 storeife r3, r2, r1 |
These instructions only differ from the normal opcodes by one flag which does not fit in the flag field (not enough room). This F flag is a hint for the onchip memory system, it influences the caching strategy. F means Flush, the data that is currently being processed (read or written) is not needed anymore, the CPU doesn't need to keep a copy onchip. This flag is meant to reduce the cache line thrashing whenever possible and increase the effective memory bandwidth.
More precisely, the semantic behind this flag is : "the data is needed once". This
is achieved inside the CPU by modifying the caching strategy with a cache line
granularity. By default, when the F flag is omitted, the strategy is :
- keep the current line in the memory buffer
- when the line expires, flush it to the internal cache
- when the line expires in cache, flush it to the external memory
When the F flag is used in a load operation, the whole cache line is
retrieved from external memory to the memory buffer. If possible, the succeeding
memory location (it can be the precedent or next memory locations, depending
on the sign of the pointer update) is retrieved. When the content of this second fetch
begins to be used, this frees the first line, which is then used to fetch the third
location. The two memory buffer lines continue this ping-pong as long as the stream
goes on. The cache line is clearly "flushed" but is not written back in memory
because it is not modified.
With the store instruction, the operation doesn't necessarily need to begin with
a fetch from memory. The F flag says that the line is flushed directly to the external
memory instead of going to the internal cache memory.
The behaviour when loading to r0 with the F flag set is undetermined. The semantics don't go together, it would be "prefetch something that will not be used after"... That's what i'd call "waste time". So stay tunned.
CACHE Memory Management
cachemm r2, r1 |
Controls where a block of data or instructions is cached in the memory hierarchy. The block begins at the location pointed to by r2 and the size of the block is determined by r1.
This instruction should provide an universal way to control the caching mechanism of the FCPU accross all the variants that may appear. The instruction may operate on a page or cache line granularity, in an implementation dependent way. This instruction is purely a hint for the CPU that may or may not transfer data between different memory levels (that physically exist or not).
The instruction can act in either of these two directions :
- Flush : all the data present in the levels between the CPU and the parameter are flushed
to at most this level. No data in the defined block is left in the above levels.
- Prefetch : loads the data belonging to the block in at least the level defined as
parameter.
In addition, the L flag is used to influence the LRU tags in order to define the importance and the use of the block. L means "Lock" and its absence unlocks the data from the level.
The C flag, when supported, tries to compress the block when it is flushed, or decompress it when it is loaded, with a dedicated hardware.
The status of this instruction could be read from a Special Register. This instruction is very important for memory management, and should be used when performing SMC or DMA for memory coherency. The OS can also lock the main TLB tables and the critical codes so that TLB replacement doesn't thrash the cache.
size : | 8 | 8 | 4 | 6 | 6 |
bits : | 0 7 | 8 15 | 16 19 | 20 25 | 26 31 |
function : | OP_CACHEMM | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Size Flag | |
10 | -f postfix -p postfix | 0 : Flush 1 : Prefetch | Direction flag |
11 | [l] | Lock. The data will be used a lot | |
12 | [c] | De/Compress data on the fly | |
13-15 | [0-7] | Memory level (see table below) |
D | 000 | onchip Data L1 cache |
I | 001 | onchip Instructions L1 cache |
C | 010 | onchip unified Cache |
011 | [unused] | |
U | 100 | offchip Unified cache |
L | 101 | offchip Local memory |
G | 110 | offchip Global memory |
V | 111 | Virtual memory (hard disk) |
Examples :
cachemmfg ra,rb flushes rb bytes starting at address ra from every memory level until global memory. Any cache (L1, L2, local...) containing data that belong to the block is updated in main memory and the corresponding cache spaces are freed (available for future use). this should be executed everytime the programer knows that he won't use a block of data until a certain moment, and the cache level is a hint for performance.
cachemmpu ra,rb copies the data block at address ra and size rb that is present in lower memory levels (virtual, global, local) to the unified offchip memory ("at least", which means that some parts may be present closer to the processor).
Execution Unit : Load/Store Unit (?).
Latency : unknown, context dependent.
Throughput : one instruction at a time. And it's slow.
These instructions typically do not use any Execution Unit.
conditionally MOVE a register into another
move r3, r2, r1 |
IF r3 == 0 then r1 = r2
The value of r3 is checked for nullity. If nil, r2 is copied to r3 according to the size parameter. The condition is tested on the full register, and only the move uses the size flag. By default, for an unconditional move, r0 is used as condition (always cleared). There are also other tests that the decoder can check for : sign of r3 (MSB), LSB. Each of these can be negated. Another optional flag can sign-extend the data.
Notice that move r0,r0,r0 is an alias for nop and is encoded as 0x00000000. Moving to r0 has no effect.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_MOVE | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | * | Defines the size parameter |
10 | -n | 1 if negated | Negation of the condition |
11-12 | (none yet) | 00 : nullity 10 : MSB 11 : LSB | Condition |
13 | -s | 1 if sign-extended | Sign-extension flag |
Examples :
r1 contains 0x0124356789ABCDEF
r2 contains 0xFEDCBA9876543210
move.b r1,r2 ; r2 = 0xFEDCBA98765432FE
if LSB r1 move.b r1,r2 ; r2 = 0xFEDCBA98765432FE
if MSB r1 move.b r1,r2 ; r2 = 0xFEDCBA9876543210 (do nothing)
if r1==0 move.b r1,r2 ; r2 = 0xFEDCBA9876543210 (do nothing)
Execution Unit : none
Latency : 1 cycle (Xbar)
Throughput : 1 per cycle per instruction.
LOAD a CONSTant into a register
loadcons.n Imm16, r1 |
r1(n) = Imm16
This instruction virtually shifts Imm16 by n multiples of 16 before writing the value to r1, leaving the other parts unchanged. In the FC0, Imm16 is duplicated on the Xbar on 16-bit boundaries and only the selected part (n) of r1 is written. The constant is not sign-exended (see loadconsx). This instruction is used in groups as to create a large constant in a register.
The architecture should ensure that a burst of LOADCONS does not stall the CPU. It is pipelinable in the FC0 so that a 64-bit constant only takes four cycles to complete.
As to increase the range of the constants, the 8th bit of the opcode serves as a third bit for n so a 128-bit CPU can directly load a 128-bit constant without using a shift operation.
size : | 8 | 2 | 16 | 6 |
bits : | 0 7 | 8 9 | 10 25 | 26 31 |
function : | OP_LOADCONS | N | Imm16 | Reg 1 |
Examples :
r1 contains 0x0124356789ABCDEF, the following instructions load 0xFEDCBA9876543210
loadcons.0 0x3210, r1 ; r1 = 0x0123456789AB3210
loadcons.1 0x7654, r1 ; r1 = 0x0123456776543210
loadcons.2 0xBA98, r1 ; r1 = 0x0123BA9876543210
loadcons.3 0xFEDC, r1 ; r1 = 0xFEDCBA9876543210
Execution Unit : none
Latency : 1 cycle (Xbar)
Throughput : 1 per cycle per instruction.
LOAD a CONSTant into a register with sign eXtension
loadconsx.n Imm16, r1 |
Loads the imm16 constant into the register r1 at the specified location (shifts of 16 bits). The higher part of the register is assigned the value of the most significant bit of the constant. The lower part of the register remains unmodified.
This instruction is similar to loadcons but it sign-extends Imm16 before shifting it by n x 16. The result is written in the higher parts of r1, leaving the lower parts unchanged. This instruction is used at the end of a group of loadcons instructions when the higher part is filled by the bit sign. It is also used alone when the constant is below 2^15.
The architecture should ensure that a burst of LOADCONSX does not stall the CPU. It is pipelinable in the FC0 so that a 64-bit constant only takes four cycles to complete.
As to increase the range of the constants, the 8th bit of the opcode serves as a third bit for n so a 128-bit CPU can directly load a 128-bit constant without using a shift operation.
size : | 8 | 2 | 16 | 6 |
bits : | 0 7 | 8 9 | 10 25 | 26 31 |
function : | OP_LOADCONSX | N | Imm16 | Reg 1 |
Examples :
r1 contains 0x0124356789ABCDEF
loadconsx.1 0x7777, r1 ; r1 = 0x0000000077773210
Execution Unit : none
Latency : 1 cycle (Xbar)
Throughput : 1 per cycle per instruction.
The following code is an example of how a combination of loadcons/loadconsx instructions can be automatically generated in a compiler or an assembler.
/* LOADCONST.C by WHYGEE 14 septembre 1999 rev. 1.1 Nov. 29 (updated HTML stuff + new syntax) to be included in a compiler or an assembler, after some interface fixing : it currently outputs to stderr, it will output to a file the same way. Placed under GPL. */ #include "stdlib.h" #include "stdio.h" #define MAXSIZE (sizeof(long long int)) /* should be ideally 8 */ /* this is the function that is called by the main program */ void emit_constant(unsigned long long int c, unsigned char reg) { unsigned short int data[MAXSIZE>>1]; /* temporary space for MAXSIZE bytes */ signed long long int t,u; signed int s=0; if (reg==0) { fprintf(stderr,"\n Error : can't write to register 0 \n"); exit(-1); /* should be performed by an error routine that does this cleanly */ } if (c==0) { fprintf(stderr,"move r0,r%d\n",reg); /* Clear */ } else if (c==-1) { fprintf(stderr,"logic.1111 r0,r0,r%d\n",reg); /* Set */ } else if ((c>65535)&((c & -c)==c)) { /* a power of two, but the latency of bitset is higher */ do { s++; c>>=1; } while (c!=0); /* find the LSB (could be replaced by a bit scan instruction)*/ if (s>63) { /* power of two too large to fit in the constant field (a 256-bit value ?) */ fprintf(stderr,"loadcons 0x%04X,r%d\n",s,reg); fprintf(stderr,"bset r%d,r0,r%d\n",reg,reg); } else { /* the constant field is large enough */ fprintf(stderr,"bseti %d,r0,r%d\n",s,reg); } } else { /* any kind of number */ u=c; do { /* put the number into data[] and cares for the sign */ t=u; data[s]=t & 0xFFFF; u=t>>16; s++; } while ((t!=u) & (s<(MAXSIZE>>1))); s--; /* handle the case where the MSB of the highest data is not the sign */ if ((data[s]^data[s-1])& 0x8000) { /* sign check */ fprintf(stderr,"loadconsx.%d 0x%04X,r%d\n", s,data[s],reg); s--; fprintf(stderr,"loadcons.%d 0x%04X,r%d\n", s,data[s],reg); s--; } else { /* i think there's a simplification to do here... */ s--; fprintf(stderr,"loadcons.%d 0x%04X,r%d\n", s,data[s],reg); s--; } while (s>=0) { /* finish */ fprintf(stderr,"loadcons.%d 0x%04X,r%d\n", s,data[s],reg); s--; } } }
GET the value of a special register and write it to a register.
get r2, r1 |
r1 = SPR(r2)
Get the Special Register at index r2 and put its content in register r1. The whole register gets dumped, there is no size flag.
Since protection is enforced through this kind of instruction, it may raise different exceptions if the access rights are not respected or if the SR number is not valid (supervisor or unimplemented). This is highly implementation dependent but a common and flexible definition will appear soon. Please refer to the manual.
Get and Put are atomic "serializing" instructions that block the pipeline at the decoding stage until it is finished or the completion is safe. They are used to configure the CPU and the programming environment during the program start for example. The values of r2 are not yet defined and symbolic names are used instead (like the opcodes).
size : | 8 | 12 | 6 | 6 |
bits : | 0 7 | 8 19 | 20 25 | 26 31 |
function : | OP_GET | 0 | Reg 2 | Reg 1 |
Performance (FC0 only) :
Execution Unit : none
Latency : unknown
Throughput : unknown (usually several cycles)
PUT the value of a register into a special register.
put r2, r1 |
SPR(r2) = r1
Read r1 and puts its value in the Special Register defined by r2. The whole register is used, there is no size flag.
Since protection is enforced through this kind of instruction, it may raise different exceptions if the access rights are not respected, if the SR number is not valid (supervisor or unimplemented) or if the put value does not correspond to the required format. This is highly implementation dependent but a common and flexible definition will appear soon. Please refer to the manual.
Get and Put are atomic "serializing" instructions that block the pipeline at the decoding stage until it is finished or the completion is safe. They are used to configure the CPU and the programming environment during the program start for example. The values of r2 are not yet defined and symbolic names are used instead (like the opcodes).
size : | 8 | 12 | 6 | 6 |
bits : | 0 7 | 8 19 | 20 25 | 26 31 |
function : | OP_PUT | 0 | Reg 2 | Reg 1 |
Performance (FC0 only) :
Execution Unit : none
Latency : unknown
Throughput : unknown (usually several cycles)
LOAD Multiple registers from memory
loadm r3, r2, r1 |
load r1 registers starting from r3 from the location in memory pointed by r2.
This instruction uses the SRB mechanism to load multiple contiguous registers from memory. This can be used during function epilogs where the classical RISC approach loads one register at a time.
The endianness of the operation is the endianness of the machine and the registers are full-length because it uses the SRB machinery verbatim. It benefits from the SRB reordering mechanism so when a value is needed but is not yet loaded, the SRB modifies the loading order. The operation is also performed in the background with few overhead for the application. Unlike the natural use of the SRB, this instruction can raise exceptions like all load operation.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_LOADM | 0 | Reg 3 | Reg 2 | Reg 1 |
Performance (FC0 only) :
Execution Unit : L/S Unit
Latency : unknown
Throughput : unknown
STORE Multiple registers to memory
storem r3, r2, r1 |
store r1 registers starting from r3 to the location in memory pointed by r2.
This instruction uses the SRB mechanism to store multiple contiguous registers to memory. This can be used during function prologs where the classical RISC approach stores one register at a time.
The endianness of the operation is the endianness of the machine and the registers are full-length because it uses the SRB machinery verbatim. It benefits from the SRB reordering mechanism so when a value is needed but is not yet loaded, the SRB modifies the loading order. The operation is also performed in the background with few overhead for the application. Unlike the natural use of the SRB, this instruction can raise exceptions like all load operation.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_STOREM | 0 | Reg 3 | Reg 2 | Reg 1 |
Performance (FC0 only) :
Execution Unit : L/S Unit
Latency : unknown
Throughput : unknown
GET the value of a special register defined by an Immediate value and write it to a register.
geti Imm16, r1 |
r1 = SPR(Imm16)
Get the Special Register at index Imm16 and put its content in register r1. The whole register gets dumped, there is no size flag.
Since protection is enforced through this kind of instruction, it may raise different exceptions if the access rights are not respected or if the SR number is not valid (supervisor or unimplemented). This is highly implementation dependent but a common and flexible definition will appear soon. Please refer to the manual.
Get(i) and Put(i) are atomic "serializing" instructions that block the pipeline at the decoding stage until it is finished or the completion is safe. They are used to configure the CPU and the programming environment during the program start for example. The values of Imm16 are not yet defined and symbolic names are used instead (like the opcodes).
This version of GET is a shorthand for the instruction that limits the addressable range to the first 65536 Special Registers. The Core version (get) can address virtually ANY number of Special Registers through the use of a general register.
size : | 8 | 2 | 16 | 6 |
bits : | 0 7 | 8 9 | 10 25 | 26 31 |
function : | OP_GETI | 0 | Imm16 | Reg 1 |
Performance (FC0 only) :
Execution Unit : none
Latency : unknown
Throughput : unknown (usually several cycles)
PUT the value of a register to the special register Imm16.
puti Imm16, r1 |
SPR(Imm16) = r1
read r1 and puts its value in the Special Register defined by Imm16. The whole register is read, there is no size flag.
Since protection is enforced through this kind of instruction, it may raise different exceptions if the access rights are not respected or if the SR number is not valid (supervisor or unimplemented). This is highly implementation dependent but a common and flexible definition will appear soon. Please refer to the manual.
Get(i) and Put(i) are atomic "serializing" instructions that block the pipeline at the decoding stage until it is finished or the completion is safe. They are used to configure the CPU and the programming environment during the program start for example. The values of Imm16 are not yet defined and symbolic names are used instead (like the opcodes).
This version of PUT is a shorthand for the instruction that limits the addressable range to the first 65536 Special Registers. The Core version (put) can address virtually ANY number of Special Registers through the use of a general register.
size : | 8 | 2 | 16 | 6 |
bits : | 0 7 | 8 9 | 10 25 | 26 31 |
function : | OP_PUTI | 0 | Imm16 | Reg 1 |
Performance (FC0 only) :
Execution Unit : none
Latency : unknown
Throughput : unknown (usually several cycles)
6.7 Instruction Flow Control instructions :
JuMP Absolute
jmpa [r3,] r2 [, r1] [syntax yet not fully determined] |
IF (negation XOR true(condition,r3)) THEN
r1 = PC
PC = r2
If the condition is verified for r3, the content of the Program Counter is saved to r1 and branches to the address pointed by r2. This instruction works like a mix between MOVE and LOAD.
If r1 is not cleared (written to register #0 which is hardwired to 0) the instruction is assimilated to a function call. The user is responsible of the "stack frame". Otherwise (r1=0) the value of PC is lost and the instruction is a normal jump.
The condition of the instruction is determined by the negation flag (n), the type flag (either cleared when checking for nullity, or MSB or LSB) and the specified condition register (r3). The convention specifies that the branch is taken when all are cleared : the type flag is zero when checking for nullity, the n flag is cleared when the condition is not negated and the register is 0 because it is hardwired to 0.
For several reasons, it is highly recommended that the destination of the jump is already associated to the register that contains the address, for example through a loadaddr instruction or by preserving r1 (overwriting it would cancel the association, for example when the "stack" in the register set is flushed then loaded from memory). When "association" is not certain or too early, it is recommended to prefetch the destination location a few tens of cycles in advance, otherwise it will result in a processor stall.
The Size flag is not used, all registers are used in full length.
This instruction can trigger two exceptions (in order of decreasing priority) :
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_JMPA | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | 0 | [undefined] branch probability hint | |
10 | -n postfix | 1 if negated | Negation of the condition |
11-12 | (none yet) | 00 : nullity 10 : MSB 11 : LSB | Condition type |
13 | 0 | (reserved) |
Performance (FC0 only) :
Execution Unit : none
Latency : 1 or 2 cycles if the destination is already in the
memory buffer, undetermined (but much more) otherwise.
Throughput : unknown ATM.
LOAD a relative ADDRess to a register
loadaddr r2, r1 loadaddrd r2, r1 |
r1 = PC + 4 + r2, check the result in the D/I TLB and eventually prefetch the data.
If the Data flag is set (1), the Data TLB is used instead of the Instruction TLB to check the pointer validity and the register is "associated" to either the L/S Unit or the Fetcher unit on success. Eventually, the CPU can prefetch the pointed data or prefetch the TLB miss code.
The Size flag is not used, all registers are used in full length.
size : | 8 | 1 | 11 | 6 | 6 |
bits : | 0 7 | 8 | 9 19 | 20 25 | 26 31 |
function : | OP_LOADADDR | D | 0 | Reg 2 | Reg 1 |
Performance (FC0 only) :
Execution Unit : Add/Sub Unit
Latency : 2 cycles if r2 != 0.
Throughput : 1 per cycle.
LOOP ENTRY point
loopentry r1 |
r1 = PC + 4 then check the result in the ITLB.
This instruction copies the address of the next instruction in r1 as to mark the entry point of a loop. A jmpa instruction will then use r1 instead of recalculating a relative offset at each loop iteration.
This instruction is a special case of the LOADADDR instruction with no D flag and no offset (r2 = 0).
The Size flag is not used, all registers are used in full length.
size : | 8 | 18 | 6 |
bits : | 0 7 | 8 25 | 26 31 |
function : | OP_LOADADDR | 0 | Reg 1 |
Performance (FC0 only) :
Execution Unit : Add/Sub Unit
Latency : none.
Throughput : 1 per cycle.
LOAD a relative ADDRess to a register with an Immediate offset
loadaddri Imm16, r1 loadaddrid Imm16, r1 |
r1 = PC + 4 + Imm16, check the result in the D/I TLB and eventually prefetch the data.
If the Data flag is set (1), the Data TLB is used instead of the Instruction TLB to check the pointer validity and the register is "associated" to either the L/S Unit or the Fetcher unit on success. Eventually, the CPU can prefetch the pointed data or prefetch the TLB miss code.
This instruction is similar to loadaddr but uses an immediate offset. The S flag sign-extends the Imm16 data.
The Size flag is not used, all registers are used in full length.
size : | 8 | 1 | 1 | 16 | 6 |
bits : | 0 7 | 8 | 9 | 10 25 | 26 31 |
function : | OP_LOADADDRI | D | S | Imm16 | Reg 1 |
Performance (FC0 only) :
Execution Unit : Add/Sub Unit
Latency : 2 cycles.
Throughput : 1 per cycle.
LOOP to r2 if r1 has not expired.
loop r2, r1 |
r1 = r1 - 1
// IF r1 != 0 THEN PC = r2
LOOP parallelly decrements r1 and checks the old value for nullity. If this old value was not zero, the CPU branches to [r2]. This is the simplest and fastest way to loop, the latency is typically 1 cycle and the operations overlap.
The couple LOOPENTRY/LOOP can code a WHILE or DO/WHILE loop where the loop count is known in advance. An initial value of r1 yields r1+1 iteration in a DO/WHILE loop, and the final value is -1.
The Size flag is not used, all registers are used in full length.
size : | 8 | 12 | 6 | 6 |
bits : | 0 7 | 8 19 | 20 25 | 26 31 |
function : | OP_LOOP | 0 | Reg 2 | Reg 1 |
Performance (FC0 only) :
Execution Unit : Inc Unit (or Add/Sub Unit when unavailable)
Latency : 1 cycle
Throughput : 1 per cycle.
operating SYStem CALL
syscall Imm16, r1 trap Imm16, r1 |
jump in supervisor mode and execute the service # Imm16.
If the Trap flag is set, the user-mode application gives up his current time slice and requests a critical service (the SRB mechanism is triggered).
The r1 operand is not (yet) used, it is cleared. The argument is ignored by the hardware and may be used to encode information for system software. To retrieve the argument system software must load the instruction word from memory.
Typically, the service's entry point address is computed with the immediate value (shifted left by 6, as it appears in the instruction, as to have 16-instruction entry points) and added to a supervisor-mode Special Register. In the same time, the immediate value is compared with another Special Register which specifies the maximum number of implemented services, and a trap is triggered if there is an overflow.
size : | 8 | 1 | 1 | 16 | 6 |
bits : | 0 7 | 8 | 9 | 10 25 | 26 31 |
function : | OP_SYSCALL | T | 0 | Imm16 | Reg 1 |
Performance (FC0 only) :
Execution Unit : Add/Sub Unit
Latency : unknown.
Throughput : unknown.
HALT the CPU
halt |
Goes idle until an exception occurs.
If in user mode, the application gives up his current time slice and the SRB mechanism is triggered to switch to the next task.
size : | 8 | 24 |
bits : | 0 7 | 8 31 |
function : | OP_HALT | 0 |
Return From Exception
rfe |
Restore the precedent task.
At the end of an Interrupt Service Routine, an exception handler or a Supervisor service, this instruction flushes the current task and restores the precedent one with the SRB mechanism.
size : | 8 | 24 |
bits : | 0 7 | 8 31 |
function : | OP_RFE | 0 |
use the SRB to SAVE the current task's context.
srb_save |
Begins to save the current task in its dedicated CMB.
In prevision of a system call or in real-time sensitive conditions where the CPU is about to trigger the SRB and switch to another routine, it is recommended to execute srb_save in advance to speed the switch up.
size : | 8 | 24 |
bits : | 0 7 | 8 31 |
function : | OP_SRB_SAVE | 0 |
use the SRB to RESTORE the last task's context.
srb_restore |
Begins to restore the last task from its dedicated CMB.
In prevision of a return from exception or in prevision of a task switch involving SRB use, it is recommended to execute this instruction in advance so the CPU can prefetch the necessary data and reduce the switch latency.
size : | 8 | 24 |
bits : | 0 7 | 8 31 |
function : | OP_SRB_RESTORE | 0 |
stop the CPU while it is not flushed.
serialize[m][s][x] |
Don't execute the next instruction before the internal state of the CPU has not reached the specified condition.
This instruction ensures that the specified units have completed processing
any previously issued instruction. The current flags consider three conditions :
- Memory operations (all transactions are finished and there are free LSU lines)
- Executions units (there is no operation pending, the scoreboard is clear)
- SRB ready (the scoreboard has no SRB, or smooth context switch pending, so a loadm
or storem instruction can be issued).
The condition is the "logical product" (AND) of all the individual conditions : execution continues when all individual conditions are met.
size : | 8 | 24 |
bits : | 0 7 | 8 31 |
function : | OP_SERIALIZE | condition |
Flags | Syntax | Values | Function |
8 | -m postfix | 1 if used | Memory operations pending |
9 | -x postfix | 1 if used | Execution Units busy |
10 | -s | 1 if used | SRB pending |
13-31 | 0 | (reserved) |