Part 6 :
currently in construction.
6.1 Data Manipulation
6.1.1 Core Arithmetic
6.1.1.1 add
6.1.1.2 sub
6.1.1.3 mul
6.1.1.4 div
6.1.2 Optional Arithmetic
6.1.2.1 addi
6.1.2.2 subi
6.1.2.3 muli
6.1.2.4 divi
6.1.2.5 mod
6.1.2.6 modi
6.1.2.7 mac
6.1.2.8 popcount
6.1.3 Optional increment-based
6.1.3.1 inc
6.1.3.2 dec
6.1.3.3 neg
6.1.3.4 bit scan
6.1.3.5 cmpl
6.1.3.6 cmple
6.1.3.7 cmpli
6.1.3.8 cmplei
6.1.3.9 abs
6.1.3.10 max
6.1.3.11 min
6.1.3.12 maxi
6.1.3.13 mini
6.1.3.14 sort
6.1.4 Core Shift and Rotate
6.1.4.1 shiftl
6.1.4.2 shiftr
6.1.4.3 shiftra
6.1.4.4 rotl
6.1.4.5 rotr
6.1.5 Optional Shift and Rotate
6.1.5.1 shiftli
6.1.5.2 shiftri
6.1.5.3 shiftrai
6.1.5.4 rotli
6.1.5.5 rotri
6.1.5.6 bitop
6.1.5.7 bitopi
6.1.6 Core Logic
6.1.6.1 logic
6.1.7 Optional Logic
6.1.7.1 logici
6.1.8 Optional SIMD Packing
6.1.8.1 mix
6.1.8.2 expand
6.1.8.3 sdup
6.1.9 Floating Point Operations
6.1.10 Optional Misc.
ADDition
add r3, r2, r1 adds r3, r2, r1 addc r3, r2, r1 sadd r3, r2, r1 sadds r3, r2, r1 saddc r3, r2, r1 |
Computes r1 = r2 + r3
add performs an integer addition of the two source operands (r3 + r2) and puts the result in the destination operand (r1).
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_ADD | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | -s postfix | 1 if set | Saturation flag |
13 | -c postfix | 1 if set | Carry flag ( 2r2w ) |
Examples :
Scalar :
R1 contains 0xF8 (we only consider the lower byte in the registers)
R2 contains 0x0F
add.b r1,r2,r3 : r3 = 0x07 (default behaviour)
adds.b r1,r2,r3 : r3 = 0xFF (saturation)
addc.b r1,r2,r3 : r3 = 0x07, r4= 0x01 (carry)
SIMD :
R1 contains 0x000000F800000001 (in a 64-bit system)
R2 contains 0x0000000F00000002
sadd.b r1,r2,r3 : r3 = 0x0000000700000003 (default behaviour)
sadds.b r1,r2,r3 : r3 = 0x000000FF00000003 (saturation)
saddc.b r1,r2,r3 : r3 = 0x0000000700000003 , r4= 0x0000000100000000 (carry)
Execution Unit : Add/Sub Unit
Latency : 1 cycle for 8-bit data, 2 cycles for 16-bit to 64-bit data
Throughput : 1 operation per cycle per ASU.
SUBstraction
sub r3, r2, r1 subb r3, r2, r1 subf r3, r2, r1 ssub r3, r2, r1 ssubb r3, r2, r1 ssubf r3, r2, r1 |
Computes r1 = r2 - r3
sub performs an integer substraction of the two source operands (r3 - r2) and puts the result in destination operand (r1).
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_SUB | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | -f postfix | 1 if set | Floor flag |
13 | -b postfix | 1 if set | Borrow flag ( 2r2w ) |
Examples :
Scalar :
R1 contains 0x05 (we only consider the lower byte in the registers)
R2 contains 0x07
sub.b r1,r2,r3 : r3 = 0xFE (default behaviour)
subf.b r1,r2,r3 : r3 = 0x00 (floor)
subb.b r1,r2,r3 : r3 = 0xFE, r4= 0xFF (borrow)
SIMD :
R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001
ssub.b r1,r2,r3 : r3 = 0x0000000700000003 (default behaviour)
ssubf.b r1,r2,r3 : r3 = 0x0000000000000002 (floor)
ssubb.b r1,r2,r3 : r3 = 0x000000FE00000002, r4= 0x000000FF00000000 (borrow)
Execution Unit : Add/Sub Unit
Latency : 1 cycle for 8-bit data, 2 cycles for 16-bit to 64-bit data
Throughput : 1 operation per cycle per ASU.
MULtiplication
mul r3, r2, r1 mulh r3, r2, r1 muls r3, r2, r1 mulsh r3, r2, r1 smul r3, r2, r1 smulh r3, r2, r1 smuls r3, r2, r1 smulsh r3, r2, r1 |
Computes r1 = r2 x r3
mul performs an integer multiplication of the two source operands (r3 x r2) and puts the result in the destination operand (r1). The size flags indicate the size of the source operands.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_MUL | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | -s postfix | 1 if set | Sign flag |
13 | -h postfix | 1 if set | High flag ( 2r2w ) |
Examples :
Scalar :
R1 contains 0x23 (we only consider the lower byte in the registers)
R2 contains 0x36
mul.b r1,r2,r3 : r3 = 0x62 (default)
mulh.b r1,r2,r3 : r3 = 0x62 , r4 = 0x07 (High flag)
SIMD :
R1 contains 0x00 00 00 00 00 00 00 00 (in a 64-bit system)
R2 contains 0x00 00 00 00 00 00 00 00
smul.b r1,r2,r3 : r3 = 0x00 00 00 00 00 00 00 00
smulh.b r1,r2,r3 : r3 = 0x00 00 00 00 00 00 00 00 , r4 = 0x00 00 00 00 00 00 00 00
[Completed later, when all the errors will be corrected]
Execution Unit : Integer Multiply Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably 1 operation per cycle per IMU (pipelined multiplier).
DIVision
div r3, r2, r1 divs r3, r2, r1 divm r3, r2, r1 divms r3, r2, r1 sdiv r3, r2, r1 sdivs r3, r2, r1 sdivm r3, r2, r1 sdivms r3, r2, r1 |
Computes r1 = r3 / r2
div performs an integer division of the two source operands (r3 / r2) and puts the result in destination operand (r1). The size defined by the size flags corresponds to the size of the source operands.
Remark : the division computation is slow and heavy, try to use powers-of-two divisors as to simply shift the source operand, which takes only a cycle to perform in the FC0.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_DIV | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | -s postfix | 1 if set | Sign flag |
13 | -m postfix | 1 if set | Modulo flag ( 2r2w ) |
Examples :
Scalar :
R1 contains 0x10 (we only consider the lower byte in the registers)
R2 contains 0x05
div.b r1,r2,r3 : r3 = 0x03
divm.b r1,r2,r3 : r3 = 0x03 , r4 = 0x01
Execution Unit : Integer Divide Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably equal to the latency (not pipelined).
ADDition Immediate
addi Imm8, r2, r1 saddi Imm8, r2, r1 |
Computes r1 = r2 + Imm8.
This instruction is similar to the ``add'' instruction but it takes one of the source operands from the opcode and sign-extends it (subi ???). It has less room for the options and flags, so the usage of the reserved bit is still being discussed.
Remark : with wide operands, the latency may be higher than expected because the adder would use the full pipeline. In order to add or substract 1 from a large number (more than 8 bits) it is recommended to use the inc/dec instructions (when available) because they use the increment unit which has a lower latency.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_ADDI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Examples :
R2 contains 0x00F80F00F045FF82 (in a 64-bit system)
addi.b 0x87,r2,r3 : r3 = 0x00F80F00F045FF09
addi.d 0x87,r2,r3 : r3 = 0x00F80F00F0450009
saddi.b 0x87,r2,r3 : r3 = 0x877F968777CC8609
saddi.d 0x87,r2,r3 : r3 = 0x017F0F87F0CC0009
Execution Unit : Add/Sub Unit
Latency : 1 cycle for 8-bit data, 2 cycles for 16-bit to 64-bit data
Throughput : 1 operation per cycle per ASU.
SUBstraction Immediate
subi Imm8 , r2, r1 ssubi Imm8, r2, r1 |
Computes r2 = r1 - Imm8.
This instruction is similar to the ``sub'' instruction but it takes one of the source operands from the opcode (Imm8) and sign-extends it. It has less room for the options and flags, so the usage of the reserved bit is still being discussed.
Remark : with wide operands, the latency may be higher than expected because the adder would use the full pipeline. In order to add or substract 1 from a large number (more than 8 bits) it is recommended to use the inc/dec instructions (when available) because they use the increment unit which has a lower latency.
Problem : it is not sure that Imm8 will be sign-extended before being sent to the Xbar. Other instructions may need a 8-bit operand that is not sign-extended. Otherwise, subi would be simply aliased by the assembler to addi with the immediate data negated.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_SUBI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Add/Sub Unit
Latency : 1 cycle for 8-bit data, 2 cycles for 16-bit to 64-bit data
Throughput : 1 operation per cycle per ASU.
MULtiplication Immediate
muli imm8, r2, r1 smuli Imm8, r2, r1 |
Computes r1 = r2 x imm8.
This instruction is similar to the ``mul'' instruction but it takes one of the source operands from the opcode (Imm8) and sign-extends it. It has less room for the options and flags, so the usage of the reserved bit is still being discussed.
Remark : the multiply computation is slow and heavy, try to use powers-of-two multipliers as to simply shift the source operand, which takes only a cycle to perform in the FC0.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_MULI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Integer Multiply Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably 1 operation per cycle per IMU (pipelined multiplier).
DIVision Immediate
divi imm8, r2, r1 sdivi Imm8, r2, r1 |
Computes r1 = r2 / Imm8.
This instruction is similar to ``div'' but the second operand is the sign-extended value of imm8. This will trigger a math trap if Imm8 is cleared (=0).
Remark : the division computation is slow and heavy, try to use powers-of-two divisors as to simply shift the source operand, which takes only a cycle to perform in the FC0.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_DIVI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Integer Divide Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably equal to the latency (not pipelined).
MODulo
mod r3, r2, r1 mods r3, r2, r1 smod r3, r2, r1 smods r3, r2, r1 |
Computes r1 = r3 % r2
mod performs an integer modulo of the two source operands (r3 % r2) and puts the result in destination operand (r1).
Remark : the modulo computation is slow and heavy, try to use powers-of-two modulos as to simply mask the MSB of the source operand, which takes only a cycle to perform in the FC0.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_MOD | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | -s postfix | 1 if set | Signed flag |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Integer Divide Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably equal to the latency (not pipelined).
MODulo Immediate
modi Imm8, r2, r1 smodi Imm8, r2, r1 |
Computes r1 = r2 % Imm8
modi performs an integer modulo of the two source operands (r2 % Imm8) and puts the result in destination operand (r1). Imm8 is sign extended (?).
Remark : the modulo computation is slow and heavy, try to use powers-of-two modulos as to simply mask the MSB of the source operand, which takes only a cycle to perform in the FC0.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_MODI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Integer Divide Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably equal to the latency (not pipelined).
Multiply and ACcumulate
mac r3, r2, r1 macs r3, r2, r1 mach r3, r2, r1 machs r3, r2, r1 smac r3, r2, r1 smacs r3, r2, r1 smach r3, r2, r1 smachs r3, r2, r1 |
Computes r1 = r1 + ( r2 x r3 )
mac performs an integer multiplication of the two source operands (r3 x r2) and adds the result to the destination operand (r1). The size flags indicate the size of the source operands, the "granularity" of the destination operand is twice this size if the hardware can do it.
Remark 2 : this instruction reads three operands and therefore is a 3r1w operation that is not in the core. Its implementation depends on architectural parameters.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_MAC | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | -s postfix | 1 if set | Sign flag |
13 | -h postfix | 1 if set | High flag |
Example :
Scalar :
R1 contains 0x23 (we only consider the lower byte in the registers)
R2 contains 0x36
R3 contains 0x0136
mac.b r1,r2,r3 : r3 = 0x0868
[To be completed later, when all the other errors will be corrected]
Execution Unit : Integer Multiply Unit then Add/Sub Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably 1 operation per cycle per IMU+ASU
(pipelined multiplier and adder).
POPulation COUNT
popcount r2, r1 spopcount r2, r1 |
Computes r1 = nb_bits(r2)
popcount counts the number of set bits in r2 and writes the result to the destination operand (r1). The size flags indicate the size of the source operands.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_POPC | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Example :
R1 contains 0x0123456789ABCDEF
popcount r1,r2 : r2 = 0x0000000000000020
Execution Unit : Unknown
Latency : unknown, but it's O(log2(size)) if you wanted to
know (just in case you're not a spook).
Throughput : unknown.
INCrement
inc r2, r1 sinc r2, r1 |
Computes r1 = r2 + 1
This instruction increments the source operand in a special unit that is designed for low latency when large data are processed. The value wraps around when reaching the maximum value.
In the future, the increment value could be specified so keep the reserved fields cleared.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_INC | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Example :
R1 contains 0xFF05891213450100 (in a 64-bit system)
sinc.b r1,r2 : r2 = 0x00068A1314460201
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
DECrement
dec r2, r1 sdec r2, r1 |
Computes r1 = r2 - 1
This instruction decrements the source operand in a special unit that is designed for low latency when large data are processed. The value wraps around when reaching the minimum value.
In the future, the decrement value could be specified so keep the reserved fields cleared.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_DEC | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Example :
R1 contains 0xFF05891213450100 (in a 64-bit system)
sinc.b r1,r2 : r2 = 0xFE048811124400FF
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
NEGation
neg r2, r1 sneg r2, r1 |
Computes r1 = not(r2) + 1
This instruction negates the source operand in a special unit that is designed for low latency when large data are processed.
This instruction is designed to work in the 2s-complement numbering sytem (signed integer numbers) and is not subject to saturation/overflow problems.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_NEG | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Example :
R1 contains 0xFF05891213450100 (in a 64-bit system)
sneg.b r1,r2 : r2 = 0x01FB77EEEDBBFF00
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
scan[n][r] r2, r1 sscan[n][r] r2, r1 lsb1 r2, r1 lsb0 r2, r1 msb1 r2, r1 msb0 r2, r1 slsb1 r2, r1 slsb0 r2, r1 smsb1 r2, r1 smsb0 r2, r1 |
Computes r1 = scan_for_lsb(r2)
This instruction scans the source operand (r2) for the first set bit, starting from the LSB, and writes the position of this bit to the destination register (r1). If the source is cleared, the result is zero, otherwise the bit #0 is counted as position 1.
This instruction has options that bit reverse the source and/or complement the bits, so it can search for the last bit reset for example.
lsb1 is an alias for scan
lsb0 is an alias for scann
This instruction scans the source operand (r2) for the first reset bit, starting from the LSB,
and writes the position of this bit to the destination register (r1). If the source is set
(all ones), the result is zero, otherwise the bit #0 is counted as position 1.
msb1 is an alias for scanr
This instruction scans the source operand (r2) for the first set bit, starting from the MSB,
and writes the position of this bit to the destination register (r1). If
the source is cleared, the result is zero, otherwise the bit #0 is counted as position 1.
msb0 is an alias for scannr
This instruction scans the source operand (r2) for the first reset bit, starting from the MSB,
and writes the position of this bit to the destination register (r1). If the source is set
(all ones), the result is zero, otherwise the bit #0 is counted as position 1.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_SCAN | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | -n postfix | 1 if set | Negate the input |
13 | -r postfix | 1 if set | Bit-Reverse the input |
Examples :
R1 contains 0xFF05891213450100 (in a 64-bit system)
lsb1 r1,r2 : r2 = 0x9
lsb0 r1,r2 : r2 = 0x1
msb1 r1,r2 : r2 = 0x40
msb0 r1,r2 : r2 = 0x38
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
CoMPare for Lower
cmpl r3, r2, r1 scmpl r3, r2, r1 |
Compare the two source operands and sets or clear the destination register according to the result. This operation is performed in the Increment unit so no substraction is required and it is performed faster for large data. In order to compare for greater, simply swap the source operands or negate the result of CMPLE. The comparison is valid only for unsigned values (yet)
Remark : this instruction can't be used for IEEE floating point data (the comparison is not signed).
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_CMPL | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001
scmpl.b r1,r2,r3 : r3 = 0x00000000000000FF
scmpl.b r2,r1,r3 : r3 = 0x000000FF00000000
cmpl r1,r2,r3 : r3 = 0x0000000000000000
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
CoMPare for Lower or Equal
cmple r3, r2, r1 scmple r3, r2, r1 |
Compare the two source operands and sets or clear the destination register according to the result. This operation is performed in the Increment unit so no substraction is required and it is performed faster for large data. In order to compare for greater or equal, simply swap the source operands or negate the result of CMPL. The comparison is valid only for unsigned values (yet)
Remark : this instruction can't be used for IEEE floating point data (the comparison is not signed).
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_CMPLE | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001
scmpl.b r1,r2,r3 : r3 = 0xFFFFFF00FFFFFFFF
scmpl.b r2,r1,r3 : r3 = 0xFFFFFFFFFFFFFF00
cmpl r1,r2,r3 : r3 = 0x0000000000000000
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
CoMPare for Lower with Immediate
cmpli Imm8, r2, r1 scmpli r3, r2, r1 |
Similarly to CMPL, with an immediate operand (that is not sign-extended), compare the two source operands and sets or clear the destination register according to the result. The comparison is valid only for unsigned values (yet)
Remark : this instruction can't be used for IEEE floating point data (the comparison is not signed).
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_CMPLI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0000000500000003 (in a 64-bit system)
scmpli.b 0x04,r1,r2 : r2 = 0x00000000000000FF
cmpli 0x04,r1,r2 : r2 = 0x0000000000000000
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
CoMPare for Lower or Equal with Immediate
cmplei Imm8, r2, r1 scmplei r3, r2, r1 |
Similarly to CMPLE, with an immediate operand (that is not sign-extended), compare the two source operands and sets or clear the destination register according to the result. The comparison is valid only for unsigned values (yet)
Remark : this instruction can't be used for IEEE floating point data (the comparison is not signed).
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_CMPLEI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0000000500000003 (in a 64-bit system)
scmplei.b 0x04,r1,r2 : r2 = 0xFFFFFF00FFFFFFFF
cmplei 0x04,r1,r2 : r2 = 0x0000000000000000
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
ABSolute value
abs r2, r1 sabs r2, r1 |
Computes r1 = (not(r2) + 1) if MSB(r1)==1
This instruction negates the source operand in a special unit that is designed for low latency when large data are processed. If the sign bit (MSB) of the source is set (the number is negative) then the value is written back to the register set, or else (it is already positive) the result is cancelled.
This instruction is designed to work in the 2s-complement number sytem (signed integer numbers) and is not subject to saturation/overflow problems.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_ABS | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Example :
R1 contains 0xFF05891213450100 (in a 64-bit system)
sabs.b r1,r2 : r2 = 0x0105771213450100
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
MAXimum
max r3, r2, r1 smax r3, r2, r1 |
Computes r1 = r3 if ( r2 < r3 ) else r1 = r2
Compare the two source operands and writes the maximum of the two values to the destination register. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_MAX | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001
smax.b r1,r2,r3 : r3 = 0x0000000700000003
max r1,r2,r3 : r3 = 0x0000000700000003
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
MINimum
min r3, r2, r1 smin r3, r2, r1 |
Computes r1 = r3 if ( r2 > r3 ) else r1 = r2
Compare the two source operands and writes the minimum of the two values to the destination register. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_MIN | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001
smin.b r1,r2,r3 : r3 = 0x0000000500000001
min r1,r2,r3 : r3 = 0x0000000500000003
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
MAXimum Immediate
maxi Imm8, r2, r1 smaxi Imm8, r2, r1 |
Computes r1 = Imm8 if ( r2 < Imm8 ) else r1 = r2
Compare the two source operands and writes the maximum of the two values to the destination register. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_MAXI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Examples :
R2 contains 0x0000000500000003 (in a 64-bit system)
smaxi.b 0x04,r2,r3 : r3 = 0x0000000500000004
maxi 0x04,r2,r3 : r3 = 0x0000000500000003
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
MINimum Imemdiate
mini r3, r2, r1 smini Imm8, r2, r1 |
Computes r1 = Imm8 if ( r2 > Imm8 ) else r1 = r2
Compare the two source operands and writes the minimum of the two values to the destination register. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_MINI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Examples :
R2 contains 0x0000000500000003 (in a 64-bit system)
smini.b 0x04,r2,r3 : r3 = 0x0000000400000003
mini 0x04,r2,r3 : r3 = 0x0000000000000004
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
SORT
sort r3, r2, r1 ssort r3, r2, r1 |
Computes { r1 = r3 , r1+1 = r2 } if ( r2 > r3 ) else { r1 = r2 , r1+1 = r3 }
Compare the two source operands and writes the minimum of the two values to the destination register and the maximum to destination register+1. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data. This instruction is of the 2r2w form.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_SORT | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001
ssort.b r1,r2,r3 : r3 = 0x0000000500000001 , r4 = 0x0000000700000003
sort r1,r2,r3 : r3 = 0x0000000500000003 , r4 = 0x0000000700000001
Performance (FC0 only) :
Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.
SHIFT Left logical
shiftl r3, r2, r1 sshiftl r3, r2, r1 |
Computes r1 = r2 << r3.
The value of r3 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_SHIFTL | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
SHIFT Right logical
shiftr r3, r2, r1 sshiftr r3, r2, r1 |
Computes r1 = r2 >> r3
The value of r3 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_SHIFTR | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
SHIFT Right Arithmetic
shiftra r3, r2, r1 sshiftra r3, r2, r1 |
Computes r1 = r2 >> r3 and preserve the sign.
The value of r2 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_SHIFTRA | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
ROTation Left
rotl r3, r2, r1 srotl r3, r2, r1 |
Computes r1 = r2 <@ r3
The value of r2 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_ROTL | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
ROTation Right
rotr r3, r2, r1 srotr r3, r2, r1 |
Computes r1 = r2 @> r3
The value of r2 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_ROTR | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11-13 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
SHIFT Left Immediate
shiftli Imm8, r2, r1 sshiftli Imm8, r2, r1 |
Computes r1 = r2 << Imm8
The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_SHIFTLI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
SHIFT Right Immediate logic
shiftri Imm8, r2, r1 sshiftri Imm8, r2, r1 |
Computes r1 = r2 >> Imm8
The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_SHIFTRI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
SHIFT Right Arithmetic Immediate
shiftrai Imm8, r2, r1 sshiftrai Imm8, r2, r1 |
Computes r1 = r2 >> Imm8 and preserve the sign
The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_SHIFTRAI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
ROTate Left Immediate
rotli Imm8, r2, r1 srotli Imm8, r2, r1 |
Computes r1 = r2 <@ Imm8
The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_ROTLI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit :Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
ROTate Right Immediate
rotri Imm8, r2, r1 srotri Imm8, r2, r1 |
Computes r1 = r2 @> Imm8
The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_ROTRI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
Performance (FC0 only) :
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
single BIT OPeration
bitop[x/s/c/t] r3, r2, r1 sbitop[x/s/c/t] r3, r2, r1 bchg r3, r2, r1 bset r3, r2, r1 bclr r3, r2, r1 btst r3, r2, r1 sbchg r3, r2, r1 sbset r3, r2, r1 sbclr r3, r2, r1 sbtst r3, r2, r1 |
Computes r1 = F(function, r2, 1 << r3)
In the shifter, a 1 is shifted left r3 times and combined with the second operand (r2) according to the function F defined below :
Function number : | Logical function : | Operation : | Opcode : |
00 | OR | Bit Set | bset or bitops |
01 | ANDN | Bit Clear | bclr or bitopc |
10 | XOR | Bit Change | bchg or bitopx |
11 | AND | Bit Mask | btst or bitopt |
The value of r3 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_BITOP | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12-13 | x, c, t or s | 00-11 | F |
Example :
R1 contains 0x08
R2 contains 0xFF05891213450100 (in a 64-bit system)
bchg r1,r2,r3 : r3 = 0xFF05891213450000
bset r1,r2,r3 : r3 = 0xFF05891213450100
bclr r1,r2,r3 : r3 = 0xFF05891213450000
btst r1,r2,r3 : r3 = 0x0000000000000100
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
single BIT OPeration Immediate
bitop[x/s/c/t]i Imm6, r2, r1 sbitop[x/s/c/t]i Imm6, r2, r1 bchgi Imm6, r2, r1 bseti Imm6, r2, r1 bclri Imm6, r2, r1 btsti Imm6, r2, r1 sbchgi Imm6, r2, r1 sbseti Imm6, r2, r1 sbclri Imm6, r2, r1 sbtsti Imm6, r2, r1 |
Computes r1 = F(function, r2, 1 << Imm6)
In the shifter, a 1 is shifted left Imm6 times and combined with the second operand (r2) according to the function F defined below :
F : | Logical function : | Operation : | Opcode : |
00 | OR | Bit Set | bseti or bitopsi |
01 | ANDN | Bit Clear | bclri or bitopci |
10 | XOR | Bit Change | bchgi or bitopxi |
11 | AND | Bit Mask | btsti or bitopti |
One of the great practical advantages of this instruction is that it allows to create
SIMD constants with few instructions. This is why the immediate field is
reduced to 6 bits. For example :
sbset.d 0x01,r0,r1 ; r1 = 0x0002000200020002
sbset.d 0x04,r1,r2 ; r2 = 0x0012001200120012
The value of Imm6 is truncated to the number of bits needed by the bit shuffler unit.
size : | 8 | 4 | 2 | 6 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 13 | 14 19 | 20 25 | 26 31 |
function : | OP_BITOPI | Flags | F | Imm6 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12-13 | x, c, t or s | 00-11 | F |
Example :
R2 contains 0xFF05891213450100 (in a 64-bit system)
bchgi 0x08,r2,r3 : r3 = 0xFF05891213450000
bseti 0x08,r2,r3 : r3 = 0xFF05891213450100
bclri 0x08,r2,r3 : r3 = 0xFF05891213450000
btsti 0x08,r2,r3 : r3 = 0x0000000000000100
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
bitwise LOGIC
logic.xxxx r1, r2, r3 or r1, r2, r3 orn r1, r2, r3 and r1, r2, r3 andn r1, r2, r3 xor r1, r2, r3 nxor r1, r2, r3 not r1, r2, r3 nor r1, r2, r3 nand r1, r2, r3 |
Computes r3 = f(r1,r2) where f is a logic function whose truth table is defined in the flags.
Remark : XOR should be used to compare two numbers for equality, instead of sub.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_LOGIC | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Values | Function |
8-9 | [qdb] | Size flags |
10 | [01] | f(0,0) |
11 | [01] | f(1,0) |
12 | [01] | f(0,1) |
13 | [01] | f(1,1) |
or is an alias for logic.0111
and is an alias for logic.0001
xor is an alias for logic.0110
not is an alias for logic.1010
nor is an alias for logic.1000
nand is an alias for logic.1110
Execution Unit : ROP2 Unit
Latency : 1 cycle
Throughput : 1 result per cycle per ROP2.
bitwise LOGIC Immediate
logici.xxxx Imm8, r2, r3 andi Imm8, r2, r3 andni Imm8, r2, r3 ori Imm8, r2, r3 xori Imm8, r2, r3 |
Computes r1 = f(Imm8,r2) where f is a logic function whose truth table is defined in the flags.
Because there is less room than in the register form of the instruction, the logic functions are reduced to 4. I have chosen to use the same logic functions as in the bitop instructions. Yet, the SIMD flag is cruelly missing. The function could maybe be included in the opcode.
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_LOGICI | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Values | Function |
8-9 | [qdb] | Size flags |
10-11 | [xtcs] | logic function |
ori is an alias for logici.s
andi is an alias for logici.t
xori is an alias for logici.x
andni is an alias for logici.c
Execution Unit : ROP2 Unit
Latency : 1 cycle
Throughput : 1 result per cycle per ROP2.
MIX
mixl r3, r2, r1 mixh r3, r2, r1 |
Mix two halves of r3 and r2 and puts the result into r1.
Depending on the h flag, the lower or higher part of r3 and r2 are interleaved. The size of the source chunks is determined by the size flags. This instruction is useful to interleave words in a "butterfly" fashion or reverse a little matrix. Or simply it can be used to create an extended form of the result of an addition with carry.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_MIX | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Size flag | |
10-11 | (none yet) | 0 | Reserved |
12 | -l or -h postfix | 0 for -l 1 for -h |
High flag |
13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0001020304050607 (in a 64-bit system)
R2 contains 0x08090A0B0C0D0E0F
mixl.d r1,r2,r3 : r3 = 0x04050C0D06070E0F
mixh.d r1,r2,r4 : r4 = 0x0001080902030A0B
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
EXPAND
expandl r3, r2, r1 expandh r3, r2, r1 |
Mix chunks of r3 and r2 and puts the result into two halves of r1.
This is the reverse operation of the mix instruction. Depending on the h flag, the lower or higher part of r3 and r2 are interleaved. The size of the source chunks is determined by the size flags.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_EXPAND | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Size flag | |
10-11 | (none yet) | 0 | Reserved |
12 | -l or -h postfix | 0 for -l 1 for -h |
High flag |
13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0001020304050607 (in a 64-bit system)
R2 contains 0x08090A0B0C0D0E0F
expandl.b r1,r2,r3 : r3 = 0x09010B030D050F07
expandh.b r1,r2,r4 : r4 = 0x08000A020C040E06
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
Simd DUPlication
sdup r2, r1 |
Duplicates the lower part of r2 and put the result in r1. The size of the destination SIMD chunks is determined by the size flags.
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_SDUP | Flags | 0 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Size flag | |
10-13 | (none yet) | 0 | Reserved |
Examples :
R1 contains 0x0001020304050607 (in a 64-bit system)
sdup.b r1,r2 : r2 = 0x0707070707070707
sdup.d r1,r3 : r3 = 0x0607060706070607
sdup.q r1,r4 : r4 = 0x0405060704050607
Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.
Load/Store
There are different levels of implementation of floating point operations.
Level | Instructions implemented |
Level 0 | No FP |
Level 1 | fadd, fsub, fmul, int2f/f2int, finv_app, sqrt_inv_app |
Level 2 | fadd, fsub, fmul, int2f/f2int, finv, sqrt |
Level 3 | fadd, fsub, fmul, int2f/f2int, div, finv, sqrt, sqrt_inv |
Floating Point Addition
fadd r1, r2, r3 |
fadd performs a floating addition of the two source operands (r1 + r2) and puts the result in destination operand (r3). The operation should be IEEE-754 compliant.
| 7 | 8 | 13 | 14 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||
OP_FADD | FLAGS | Reg1 | Reg2 | Reg3 |
Flags | Values | Function |
8-9 | [f??] | Defines the size parameter |
10 | [s] | Defines if the operation is SIMD |
11 | [x] | Defines if IEEE compliance isn't required |
Floating Point Substraction
fsub r1, r2, r3 |
fsub performs a floating substraction of the two source operands (r1 - r2) and puts the result in destination operand (r3). The operation should be IEEE-754 compliant.
| 7 | 8 | 13 | 14 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||
OP_FSUB | FLAGS | Reg1 | Reg2 | Reg3 |
Flags | Values | Function |
8-9 | [f??] | Defines the size parameter |
10 | [s] | Defines if the operation is SIMD |
11 | [x] | Defines if IEEE compliance isn't required |
Floating Point Multiplication
fmul[f] r1, r2, C |
fmul performs a floating multiplication of the two source operands (r1 x r2) and puts the result in destination operand (r3). The operation should be IEEE-754 compliant.
| 7 | 8 | 13 | 14 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||
OP_FMUL | FLAGS | Reg1 | Reg2 | Reg3 |
Flags | Values | Function |
8-9 | [f??] | Defines the size parameter |
10 | [s] | Defines if the operation is SIMD |
11 | [x] | Defines if IEEE compliance isn't required |
Floating Point Division
fdiv r1, r2, r3 |
fdiv performs a floating division of the two source operands (r1 / r2) and puts the result in destination operand (r3). The operation should be IEEE-754 compliant.
| 7 | 8 | 13 | 14 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||
OP_FDIV | FLAGS | Reg1 | Reg2 | Reg3 |
Flags | Values | Function |
8-9 | [f??] | Defines the size parameter |
10 | [s] | Defines if the operation is SIMD |
11 | [x] | Defines if IEEE compliance isn't required |
Integer to Floating Point and Floating Point to Integer
int2f r1, r2 |
f2int r1, r2 |
``int2f'' converts integer number in register r1 into a floating point number and put it in register r2.
``f2int'' converts floating point number in register r1 into an integer number and put it in register r2.
| 7 | 8 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_FCONV | EMPTY | Reg1 | Reg2 |
Flags | Values | Function |
8-9 | [f??] | Defines the size parameter |
10 | Direction flag. | |
11 | [s] | Defines if the operation is SIMD (*) |
12 | [x] | Defines if IEEE compliance isn't required (*) |
13-15 | Rounding modes see table below. |
Value | Rounding mode |
000 | Nearest (default) |
001 | Towards 0 |
010 | Away from 0 |
011 | Towards -infinity |
100 | Towards +infinity |
Floating Point Inverse
finv r1, r2 |
Computes r2 = [1/r1]
| 7 | 8 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_FINV | EMPTY | Reg1 | Reg2 |
Flags | Values | Function |
8-9 | [f??] | Defines the size parameter |
10 | [s] | Defines if the operation is SIMD |
11 | [x] | Defines if IEEE compliance isn't required |
Floating Point Square Root
fsqrt r1, r2 |
Computes r2 = Ö[r1]
| 7 | 8 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_FSQRT | EMPTY | Reg1 | Reg2 |
Flags | Values | Function |
8-9 | [f??] | Defines the size parameter |
10 | [s] | Defines if the operation is SIMD |
11 | [x] | Defines if IEEE compliance isn't required |
Floating Point Inverse Square Root
finvsqrt r1, r2 |
Computes r2 = [1/(Ö[r1])]
| 7 | 8 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_FINVSQRT | EMPTY | Reg1 | Reg2 |
Flags | Values | Function |
8-9 | [f??] | Defines the size parameter |
10 | [s] | Defines if the operation is SIMD |
11 | [x] | Defines if IEEE compliance isn't required |
Reverses the bits from r1 and shifts them right by r2 bits and put the result in r3.
| 7 | 8 | 13 | 14 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||
OP_BITREV | FLAGS | Reg1 | Reg2 | Reg3 |
Reverses the bits from r1 and shifts the result to the right imm6 bits and put the result in r3.
| 7 | 8 | 13 | 14 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||
OP_BITREVI | FLAGS | imm6 | Reg1 | Reg2 |
Reverses the bytes in r1 (change the endianism) and stores the result in r2.
| 7 | 8 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_BYTEREV | EMPTY | Reg1 | Reg2 |
Flags | Values | Function |
8-9 | [qdb] | Defines the size parameter |
Load
load [r1 + r2 * size], r3 |
r3 = [r1 + r2 * size]
| 7 | 8 | 13 | 14 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||
OP_LOAD | FLAGS | Reg1 | Reg2 | Reg3 |
Flags | Values | Function |
8-9 | [qdb] | Defines the size parameter |
10 | [e] | Defines the endianness |
little-endian if cleared (default) | ||
big-endian if set | ||
11-13 | RESERVED |
Store
store r1, [r2 + r3 * size] |
[r2 + r3 * size] = r1
| 7 | 8 | 13 | 14 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||
OP_STORE | FLAGS | Reg1 | Reg2 | Reg3 |
Flags | Values | Function |
8-9 | [qdb] | Defines the size parameter |
10 | [e] | Defines the endianness |
little-endian if cleared (default) | ||
big-endian if set | ||
11-13 | RESERVED |
Move
mov [r1,] r2, r3 |
if (r1) r3 = r2
| 7 | 8 | 13 | 14 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||
OP_MOV | FLAGS | Reg1 | Reg2 | Reg3 |
Flags | Values | Function |
8-9 | [qdb] | Defines the size parameter |
10-11 | [sz] | Defines how the high part of the |
destination register will be. | ||
(See table below) |
Flag | Values | Function |
(default) | 00 | High part remains unchanged |
z | 01 | Zero extend |
s | 10 | Sign extend |
? | 11 | Reserved |
Load Constant
loadcons imm16, r1 |
Loads the imm16 constant into the register r1 at the specified location (shifts of 16 bits). The rest of the register remains unmodified.
Flags | Values | Function |
8-9 | [123] | Defines the shift parameter |
| 7 | 8 | 9 | 10 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_LOADCONS | EMPTY | imm16 | Reg1 |
Load Constant with Sign Extension
loadconsx imm16, r1 |
Loads the imm16 constant into the register r1 at the specified location (shifts of 16 bits). The higher part of the register is assigned the value of the most significant bit of the constant. The lower part of the register remains unmodified.
Flags | Values | Function |
8-9 | [123] | Defines the shift parameter |
| 7 | 8 | 9 | 10 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_LOADCONSX | EMPTY | imm16 | Reg1 |
/* LOADCONST.C by WHYGEE 14 septembre 1999 to be included in a compiler or an assembler, after some interface fixing : it currently outputs to stderr, it will output to a file the same way. */ #include "stdlib.h" #include "stdio.h" #define MAXSIZE (sizeof(long long int)) /* should be ideally 8 */ void emit_constant(unsigned long long int c, unsigned char reg) { unsigned short int data[MAXSIZE>>1]; signed long long int t,u; signed int s=0; if (reg==0) { fprintf(stderr,"\\n Error : can't write to register 0 \\n"); exit(-1); /* should be performed by an error routine that does this cleanly */ } if (c==0) { fprintf(stderr,"mov rd,r0\\n",reg); } else if (c==-1) { fprintf(stderr,"logic.1111 rd,r0,r0\\n",reg); } else if ((c>65535)&((c & -c)==c)) /* a power of two, but the latency of bitset is higher */ { do { s++; c>>=1; } while (c!=0); /* find the LSB */ if (s>63) { fprintf(stderr,"loadconsts rd,0x04X\\n",reg,s); fprintf(stderr,"bitset rd,r0,rd\\n",reg,reg); } else { fprintf(stderr,"bitset rd,r0,d\\n",reg,s); } } else /* any kind of number */ { u=c; do { t=u; data[s]=t & 0xFFFF; u=t>>16; s++; } while ((t!=u) & (s<MAXSIZE>>1)); s--; /* handle the case where the MSB of the highest data is not the sign */ if ((data[s]^data[s-1])& 0x8000) { fprintf(stderr,"loadconsts.d rd,0x04X\\n", s,reg,data[s]); s--; fprintf(stderr,"loadconst.d rd,0x04X\\n", s,reg,data[s]); s--; } else { s--; fprintf(stderr,"loadconsts.d rd,0x04X\n", s,reg,data[s]); s--; } while (s>=0) { fprintf(stderr,"loadconst.d rd,0x04X\\n", s,reg,data[s]); s--; } } }
Cache Memory Management
prefetch, flush a data block to/from a memory level.
cachemm r1, r2 |
Flags | Values | Function |
8-9 | [qdb] | Defines the size parameter |
10 | [fp] | Prefetch/Flush |
11 | [l] | Lock. This flag means that the data |
are static and will be used a lot | ||
12-14 | [0-7] | Memory level (see table below) |
D | 000 | data L1 cache |
I | 001 | instructions L1 cache |
C | 010 | onchip unified cache |
011 | [unused] | |
U | 100 | offchip unified cache |
L | 101 | local memory |
G | 110 | global memory |
V | 111 | virtual memory (hard disk) |
| 7 | 8 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_CACHEMM | EMPTY | Reg1 | Reg2 |
example : "flushg ra,rb" flushes rb bytes starting at address ra from every memory level until global memory. Any cache (L1, L2, local...) containing data that belong to the block is updated in main memory and the corresponding cache spaces are freed (available for future use). this should be executed everytime the programer knows that he won't use a block of data until a certain moment, and the cache level is a hint for performance.
"preftchu ra,rb" copies the data block at address ra and size rb that is present in lower memory levels (virtual, global, local) to the unified offchip memory (at least).
forms : rr or ri (size could be immediate)
These instructions are very important for memory management, and should be used when performing SMC (for memory coherency).
Load Immediate
loadi [r1 + imm9 * size], r2 |
r2 = [r1 + imm9 * size]
| 7 | 8 | 10 | 11 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||
OP_LOADI | FLAGS | imm9 | Reg1 | Reg2 |
Flags | Values | Function |
8-9 | [qdb] | Defines the size parameter |
10 | [e] | Defines the endianness |
little-endian if cleared (default) | ||
big-endian if set |
Store Immediate
storei r1, [r2 + imm9 * size] |
[r2 + imm9 * size] = r1
| 7 | 8 | 10 | 11 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||
OP_STOREI | FLAGS | imm9 | Reg1 | Reg2 |
Flags | Values | Function |
8-9 | [qdb] | Defines the size parameter |
10 | [e] | Defines the endianness |
little-endian if cleared (default) | ||
big-endian if set |
Get and Put internal register.
R/W | Description |
R | Number of cycles |
R | Number of cycles (countdown) |
R | Number of instructions executed |
R | Number of Pages Faults |
R | Number of traps/interrupts |
R | Number of FPU traps |
R | Number of Cache hit/misses |
R | Number of correct/incorrect branch predictions |
R | Number of pipeline bubbles |
R | Number of TLB hits/misses |
R/W | Description |
RW | Old Program Counter |
RW | Old Machine Status Word |
RW | Exception Vector |
RW | Temporary |
R | Exception Reason |
R | Exception Number/Type |
R/W | Description |
R | Processor ID |
Get Internal Register
get IR[r1], r2 |
Get internal register at index r1 and put its content in register r2. The whole register gets dumped. There is no size flag.
| 7 | 8 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_GET | EMPTY | Reg1 | Reg2 |
PUT Internal Register
put r1, IR[r2] |
Put contents of r1 and put into internal register at index r2. The whole register gets dumped. There is no size flag.
| 7 | 8 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_PUT | EMPTY | Reg1 | Reg2 |
Get Internal Register Immediate
geti IR[imm16], r1 |
Get internal register at index imm16 and put its content in register r1. The whole register gets dumped. There is no size flag.
p
| 7 | 8 | 9 | 10 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_GETI | EMPTY | imm16 | Reg1 |
Put Internal Register Immediate
puti r1, IR[imm16] |
Get internal register at index imm16 and put its content in register r1. The whole register gets dumped. There is no size flag.
| 7 | 8 | 9 | 10 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_PUTI | EMPTY | imm16 | Reg1 |
Absolute Jump.
jmpa [r1,] r2 |
If r1 contains a non-nil value jump to the address pointed by r2.
Flags | Values | Function |
8 | [n] | Negates the condition |
9-10 | [lm] | Test the MSB or the LSB |
| 7 | 8 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_JMPA | EMPTY | Reg1 | Reg2 |
Load Address
loadaddr imm18 r1 |
Stores PC+imm18 into r1.
The result is a 64 bit address. imm18 is a signed value.
| 7 | 8 | 25 | 26 | 31 | ||||||||||||||||||||||||||
OP_LOADADDR | imm18 | Reg1 |
Loop Entry
loopentry r1 |
Stores PC+4 into r1.
| 7 | 8 | 25 | 26 | 31 | ||||||||||||||||||||||||||
OP_LOOPENTRY | EMPTY | Reg1 |
Loop
loop r1, r2 |
Performs two parallel things :
This overlapping of the operations allows greater parallelism and lower latency : we can loop fast without compromising security.
| 7 | 8 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_LOOP | EMPTY | Reg1 | Reg2 |
Absolute Jump Immediate.
jmpi [r1,] r2, imm12 |
If r1 contains a non-nil value jump to address pointed by r2+4*imm12.
| 7 | 8 | 19 | 20 | 25 | 26 | 31 | ||||||||||||||||||||||||
OP_JMPI | imm12 | Reg1 | Reg2 |
Relative Jump Immediate.
jmpr [r1,] imm18 |
The imm18 is a signed value. Warning! All code is aligned on a 32bit boundary so the imm18 value will be shifted to the left 2 times.
| 7 | 8 | 25 | 26 | 31 | ||||||||||||||||||||||||||
OP_JMPR | imm18 | Reg1 |
syscall [r1,] imm18 |
trap [r1,] imm18 |
Syscall are two names for the same instruction.
[FIX ME] But what does it do ?
The argument is ignored by the hardware and may be used to encode information for system software. To retrieve the argument system software must load the instruction word from memory.
| 7 | 8 | 25 | 26 | 31 | ||||||||||||||||||||||||||
OP_SYSTRAP | imm18 | Reg1 |
halt [r1,] [imm18] |
Halts until an External Exception occurs
| 7 | 8 | 25 | 26 | 31 | ||||||||||||||||||||||||||
OP_HALT | imm18 | Reg1 |
rfe [r1,] [imm18] |
Return From Exception...
[FIX ME] Little short...
| 7 | 8 | 25 | 26 | 31 | ||||||||||||||||||||||||||
OP_RFE | imm18 | Reg1 |
Type 1 | Software exception |
Type 2 | External exception (interrupt) |
Type 3 | Privilege Violation |
Type 4 | Memory Error |
Type 5 | Syscall |
[FIXME] How many differents types do we need to have, each one corresponding to one pointer in the exception vector (except for the hardware that takes all the rest).
Exception pointer vector has 64 entries. [FIXME] Confirm that.
SR_OPC | Old Program Counter |
SR_OMSW | Old Machine Status Word |
SR_EV | Exception Vector |
SR_TMP | Temporary |
SR_ER | Exception Reason |
SR_ENT | Exception Number/Type |
1 | External Interupt |
2 | Illegal Opcode |
3 | Malformed Instruction |
4 | Priviledge Violation |
5 | Integer divide by ZERO |
6 | FP divide by ZERO |
7 | FP INF-INF |
8 | FP INF/INF |
9 | FP ZERO/ZERO |
10 | FP ZERO*INF |
11 | FP SQRT(NEG) |
12 | Memory Exception |
Upon the occurance of an exception, the proccessor performs the following..
Upon Call to the RFE instruction:
Operation
r1 , r2, r3 |
size : | 8 | 4 | 8 | 6 | 6 |
bits : | 0 7 | 8 11 | 12 19 | 20 25 | 26 31 |
function : | OP_ | Flags | Imm8 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | - postfix | 1 if set | flag |
13 | - postfix | 1 if set | ( 2r2w ) |
Examples :
Scalar :
R1 contains (we only consider the lower byte in the registers)
R2 contains
r1,r2,r3 : r3 =
SIMD :
R1 contains (in a 64-bit system)
R2 contains
r1,r2,r3 : r3 =
Execution Unit : Unit
Latency :
Throughput : .
Operation
r1 , r2, r3 |
size : | 8 | 6 | 6 | 6 | 6 |
bits : | 0 7 | 8 13 | 14 19 | 20 25 | 26 31 |
function : | OP_ | Flags | Reg 3 | Reg 2 | Reg 1 |
Flags | Syntax | Values | Function |
8-9 | .q, .d or .b postfix | Defines the size parameter | |
10 | s- prefix | 1 if set | Defines if the operation is SIMD |
11 | (none yet) | 0 | Reserved |
12 | - postfix | 1 if set | flag |
13 | - postfix | 1 if set | ( 2r2w ) |
Examples :
Scalar :
R1 contains (we only consider the lower byte in the registers)
R2 contains
r1,r2,r3 : r3 =
SIMD :
R1 contains (in a 64-bit system)
R2 contains
r1,r2,r3 : r3 =
Execution Unit : Unit
Latency :
Throughput : .