F-CPU Design Team
FCPU MANUAL REV. 0.1

 

       Part 6 :

 

 

 

F-CPU Instruction Set draft

 

 

 

 

 


 
Summary

Previous part (5)

 


 

6.1 Data Manipulation :

6.1.1 Core Arithmetic :

 


6.1.1.1 add :

ADDition

add r3, r2, r1

       Computes r1 = r2 + r3

       add performs an integer addition of the two source operands (r3 + r2) and puts the result in the destination operand (r1).

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_ADD Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved
12 -s postfix 1 if set Saturation flag
13 -c postfix 1 if set Carry flag ( 2r2w )

 

Examples :

Scalar :

R1 contains 0xF8 (we only consider the lower byte in the registers)
R2 contains 0x0F

add.b r1,r2,r3 : r3 = 0x07 (default behaviour)
adds.b r1,r2,r3 : r3 = 0xFF (saturation)
addc.b r1,r2,r3 : r3 = 0x07, r4= 0x01 (carry)

SIMD :

R1 contains 0x000000F800000001 (in a 64-bit system)
R2 contains 0x0000000F00000002

sadd.b r1,r2,r3 : r3 = 0x0000000700000003 (default behaviour)
sadds.b r1,r2,r3 : r3 = 0x000000FF00000003 (saturation)
saddc.b r1,r2,r3 : r3 = 0x0000000700000003 , r4= 0x0000000100000000 (carry)

Performance (FC0 only) :

Execution Unit : Add/Sub Unit
Latency : 1 cycle for 8-bit data, 2 cycles for 16-bit to 64-bit data
Throughput : 1 operation per cycle per ASU.

 


6.1.1.2 sub :

SUBstraction

sub r3, r2, r1

       Computes r1 = r2 - r3

       sub performs an integer substraction of the two source operands (r3 - r2) and puts the result in destination operand (r1).

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_SUB Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved
12 -f postfix 1 if set Floor flag
13 -b postfix 1 if set Borrow flag ( 2r2w )

 

Examples :

Scalar :

R1 contains 0x05 (we only consider the lower byte in the registers)
R2 contains 0x07

sub.b r1,r2,r3 : r3 = 0xFE (default behaviour)
subf.b r1,r2,r3 : r3 = 0x00 (floor)
subb.b r1,r2,r3 : r3 = 0xFE, r4= 0xFF (borrow)

SIMD :

R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001

ssub.b r1,r2,r3 : r3 = 0x0000000700000003 (default behaviour)
ssubf.b r1,r2,r3 : r3 = 0x0000000000000002 (floor)
ssubb.b r1,r2,r3 : r3 = 0x000000FE00000002, r4= 0x000000FF00000000 (borrow)

Performance (FC0 only) :

Execution Unit : Add/Sub Unit
Latency : 1 cycle for 8-bit data, 2 cycles for 16-bit to 64-bit data
Throughput : 1 operation per cycle per ASU.

 


6.1.1.3 mul :

MULtiplication

mul r3, r2, r1

       Computes r1 = r2 x r3

       mul performs an integer multiplication of the two source operands (r3 x r2) and puts the result in the destination operand (r1). The size flags indicate the size of the source operands.

       Remark : the multiplication computation is slow and heavy, try to use powers-of-two multipliers as to simply shift the source operand, which takes only a cycle to perform in the FC0.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_MUL Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved
12 -s postfix 1 if set Sign flag
13 -h postfix 1 if set High flag ( 2r2w )

 

Examples :

Scalar :

R1 contains 0x23 (we only consider the lower byte in the registers)
R2 contains 0x36

mul.b r1,r2,r3 : r3 = 0x62 (default)
mulh.b r1,r2,r3 : r3 = 0x62 , r4 = 0x07 (High flag)

SIMD :

R1 contains 0x00 00 00 00 00 00 00 00 (in a 64-bit system)
R2 contains 0x00 00 00 00 00 00 00 00

smul.b r1,r2,r3 : r3 = 0x00 00 00 00 00 00 00 00
smulh.b r1,r2,r3 : r3 = 0x00 00 00 00 00 00 00 00 , r4 = 0x00 00 00 00 00 00 00 00
[Completed later, when all the errors will be corrected]

Performance (FC0 only) :

Execution Unit : Integer Multiply Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably 1 operation per cycle per IMU (pipelined multiplier).

 


6.1.1.4 div :

DIVision

div r3, r2, r1

       Computes r1 = r3 / r2

       div performs an integer division of the two source operands (r3 / r2) and puts the result in destination operand (r1). The size defined by the size flags corresponds to the size of the source operands.

       This instruction triggers a math fault if the Reg2 operand is cleared (=0). This behaviour could be avoided with saturated arithmetics.

       Remark : the division computation is slow and heavy, try to use powers-of-two divisors as to simply shift the source operand, which takes only a cycle to perform in the FC0.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_DIV Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved
12 -s postfix 1 if set Sign flag
13 -m postfix 1 if set Modulo flag ( 2r2w )

 

Examples :

Scalar :

R1 contains 0x10 (we only consider the lower byte in the registers)
R2 contains 0x05

div.b r1,r2,r3 : r3 = 0x03
divm.b r1,r2,r3 : r3 = 0x03 , r4 = 0x01

Performance (FC0 only) :

Execution Unit : Integer Divide Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably equal to the latency (not pipelined).

 


6.1.2 Optional Arithmetic :


6.1.2.1 addi :

ADDition Immediate

addi Imm8, r2, r1

       Computes r1 = r2 + Imm8.

       This instruction is similar to the ``add'' instruction but it takes one of the source operands from the opcode and sign-extends it (subi ???). It has less room for the options and flags, so the usage of the reserved bit is still being discussed.

       Remark : with wide operands, the latency may be higher than expected because the adder would use the full pipeline. In order to add or substract 1 from a large number (more than 8 bits) it is recommended to use the inc/dec instructions (when available) because they use the increment unit which has a lower latency.

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_ADDI Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved

 

Examples :

R2 contains 0x00F80F00F045FF82 (in a 64-bit system)

addi.b 0x87,r2,r3 : r3 = 0x00F80F00F045FF09
addi.d 0x87,r2,r3 : r3 = 0x00F80F00F0450009
saddi.b 0x87,r2,r3 : r3 = 0x877F968777CC8609
saddi.d 0x87,r2,r3 : r3 = 0x017F0F87F0CC0009

Performance (FC0 only) :

Execution Unit : Add/Sub Unit
Latency : 1 cycle for 8-bit data, 2 cycles for 16-bit to 64-bit data
Throughput : 1 operation per cycle per ASU.

 


6.1.2.2 subi:

SUBstraction Immediate

subi Imm8 , r2, r1

       Computes r2 = r1 - Imm8.

       This instruction is similar to the ``sub'' instruction but it takes one of the source operands from the opcode (Imm8) and sign-extends it. It has less room for the options and flags, so the usage of the reserved bit is still being discussed.

       Remark : with wide operands, the latency may be higher than expected because the adder would use the full pipeline. In order to add or substract 1 from a large number (more than 8 bits) it is recommended to use the inc/dec instructions (when available) because they use the increment unit which has a lower latency.

       Problem : it is not sure that Imm8 will be sign-extended before being sent to the Xbar. Other instructions may need a 8-bit operand that is not sign-extended. Otherwise, subi would be simply aliased by the assembler to addi with the immediate data negated.

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_SUBI Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit : Add/Sub Unit
Latency : 1 cycle for 8-bit data, 2 cycles for 16-bit to 64-bit data
Throughput : 1 operation per cycle per ASU.

 


6.1.2.3 muli :

MULtiplication Immediate

muli imm8, r2, r1

       Computes r1 = r2 x imm8.

       This instruction is similar to the ``mul'' instruction but it takes one of the source operands from the opcode (Imm8) and sign-extends it. It has less room for the options and flags, so the usage of the reserved bit is still being discussed.

       Remark : the multiply computation is slow and heavy, try to use powers-of-two multipliers as to simply shift the source operand, which takes only a cycle to perform in the FC0.

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_MULI Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit : Integer Multiply Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably 1 operation per cycle per IMU (pipelined multiplier).

 


6.1.2.4 divi :

DIVision Immediate

divi imm8, r2, r1

       Computes r1 = r2 / Imm8.

       This instruction is similar to ``div'' but the second operand is the sign-extended value of imm8. This will trigger a math trap if Imm8 is cleared (=0).

       Remark : the division computation is slow and heavy, try to use powers-of-two divisors as to simply shift the source operand, which takes only a cycle to perform in the FC0.

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_DIVI Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit : Integer Divide Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably equal to the latency (not pipelined).

 


6.1.2.5 mod :

MODulo

mod r3, r2, r1

       Computes r1 = r3 % r2

       mod performs an integer modulo of the two source operands (r3 % r2) and puts the result in destination operand (r1).

       This instruction triggers a math fault if the Reg2 operand is cleared (=0). This behaviour could be avoided with saturated arithmetics.

       Remark : the modulo computation is slow and heavy, try to use powers-of-two modulos as to simply mask the MSB of the source operand, which takes only a cycle to perform in the FC0.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_MOD Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved
12 -s postfix 1 if set Signed flag
11 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit : Integer Divide Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably equal to the latency (not pipelined).

 


6.1.2.6 modi :

MODulo Immediate

modi Imm8, r2, r1

       Computes r1 = r2 % Imm8

       modi performs an integer modulo of the two source operands (r2 % Imm8) and puts the result in destination operand (r1). Imm8 is sign extended (?).

       This instruction triggers a math fault if the Imm8 operand is cleared (=0). This behaviour could be avoided with saturated arithmetics.

       Remark : the modulo computation is slow and heavy, try to use powers-of-two modulos as to simply mask the MSB of the source operand, which takes only a cycle to perform in the FC0.

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_MODI Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit : Integer Divide Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably equal to the latency (not pipelined).

 


6.1.2.7 mac :

Multiply and ACcumulate

mac r3, r2, r1

       Computes r1 = r1 + ( r2 x r3 )

       mac performs an integer multiplication of the two source operands (r3 x r2) and adds the result to the destination operand (r1). The size flags indicate the size of the source operands, the "granularity" of the destination operand is twice this size if the hardware can do it.

       Remark : This instruction is mostly used in computation kernels that involve some kind of convolution or frequency analysis. It will be extended later as the needs get clearer. The behaviour of the accumulation when the data overflow is still undefined so calibrate the input values so that the dynamic range is not exceeded in the computation loop. There is no "sticky saturation" either.

       Remark 2 : this instruction reads three operands and therefore is a 3r1w operation that is not in the core. Its implementation depends on architectural parameters.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_MAC Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved
12 -s postfix 1 if set Sign flag
13 -h postfix 1 if set High flag

 

Example :

Scalar :

R1 contains 0x23 (we only consider the lower byte in the registers)
R2 contains 0x36
R3 contains 0x0136

mac.b r1,r2,r3 : r3 = 0x0868

[To be completed later, when all the other errors will be corrected]

Performance (FC0 only) :

Execution Unit : Integer Multiply Unit then Add/Sub Unit
Latency : unknown ATM, depends on the size of the operands.
Throughput : unknown ATM, probably 1 operation per cycle per IMU+ASU (pipelined multiplier and adder).

 


6.1.2.8 popcount :

POPulation COUNT

popcount r2, r1

       Computes r1 = nb_bits(r2)

       popcount counts the number of set bits in r2 and writes the result to the destination operand (r1). The size flags indicate the size of the source operands.

       Remark : This instruction is not going to be supported by the first F-CPU chips because it requires a specialized unit that is not yet designed and integrated in the FC0. It requires a separate Execution Unit that is a crossover between the Inc Unit and the Add/Sub Unit, but it does not provide enough useful instructions (as the Inc Unit does) to justify the high transistor count in FC0. Anyway, it is going to be implemented at one time or another and a lot of algorithms benefit from this instruction so the opcode is reserved for the future.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_POPC Flags 0 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Example :

R1 contains 0x0123456789ABCDEF

popcount r1,r2 : r2 = 0x0000000000000020

Performance (FC0 only) :

Execution Unit : Unknown
Latency : unknown, but it's O(log2(size)) if you wanted to know (just in case you're not a spook).
Throughput : unknown.

 


6.1.3 Optional increment-based :


6.1.3.1 inc :

INCrement

inc r2, r1

       Computes r1 = r2 + 1

       This instruction increments the source operand in a special unit that is designed for low latency when large data are processed. The value wraps around when reaching the maximum value.

       In the future, the increment value could be specified so keep the reserved fields cleared.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_INC Flags 0 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Example :

R1 contains 0xFF05891213450100 (in a 64-bit system)

sinc.b r1,r2 : r2 = 0x00068A1314460201

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.2 dec :

DECrement

dec r2, r1

       Computes r1 = r2 - 1

       This instruction decrements the source operand in a special unit that is designed for low latency when large data are processed. The value wraps around when reaching the minimum value.

       In the future, the decrement value could be specified so keep the reserved fields cleared.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_DEC Flags 0 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Example :

R1 contains 0xFF05891213450100 (in a 64-bit system)

sinc.b r1,r2 : r2 = 0xFE048811124400FF

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.3 neg :

NEGation

neg r2, r1

       Computes r1 = not(r2) + 1

       This instruction negates the source operand in a special unit that is designed for low latency when large data are processed.

       This instruction is designed to work in the 2s-complement numbering sytem (signed integer numbers) and is not subject to saturation/overflow problems.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_NEG Flags 0 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Example :

R1 contains 0xFF05891213450100 (in a 64-bit system)

sneg.b r1,r2 : r2 = 0x01FB77EEEDBBFF00

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.4 lsb1 :

lsb1 r2, r1

       Computes r1 = scan_for_lsb(r2)

       This instruction scans the source operand (r2) for the first set bit, starting from the LSB, and writes the position of this bit to the destination register (r1). If the source is cleared, the result is zero, otherwise the bit #0 is counted as position 1.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_LSB1 Flags 0 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Example :

R1 contains 0xFF05891213450100 (in a 64-bit system)

lsb1 r1,r2 : r2 = 0x9

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.5 lsb0 :

lsb0 r2, r1

       Computes r1 = scan_for_lsb(not(r2))

       This instruction scans the source operand (r2) for the first reset bit, starting from the LSB, and writes the position of this bit to the destination register (r1). If the source is set (all ones), the result is zero, otherwise the bit #0 is counted as position 1.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_LSB0 Flags 0 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Example :

R1 contains 0xFF05891213450100 (in a 64-bit system)

lsb1 r1,r2 : r2 = 0x1

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.6 msb1 :

msb1 r2, r1

       Computes r1 = scan_for_lsb(bitrev(r2))

       This instruction scans the source operand (r2) for the first set bit, starting from the MSB, and writes the position of this bit to the destination register (r1). If the source is cleared, the result is zero, otherwise the bit #0 is counted as position 1.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_MSB1 Flags 0 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Example :

R1 contains 0xFF05891213450100 (in a 64-bit system)

msb1 r1,r2 : r2 = 0x40

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.7 msb0 :

msb0 r2, r1

       Computes r1 = scan_for_lsb(not(bitrev(r2)))

       This instruction scans the source operand (r2) for the first reset bit, starting from the MSB, and writes the position of this bit to the destination register (r1). If the source is set (all ones), the result is zero, otherwise the bit #0 is counted as position 1.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_MSB0 Flags 0 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Example :

R1 contains 0xFF05891213450100 (in a 64-bit system)

lsb1 r1,r2 : r2 = 0x38

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.8 cmpl :

CoMPare for Lower

cmpl r3, r2, r1

       Compare the two source operands and sets or clear the destination register according to the result. This operation is performed in the Increment unit so no substraction is required and it is performed faster for large data. In order to compare for greater, simply swap the source operands or negate the result of CMPLE. The comparison is valid only for unsigned values (yet)

       Remark : this instruction can't be used for IEEE floating point data (the comparison is not signed).

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_CMPL Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Examples :

R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001

scmpl.b r1,r2,r3 : r3 = 0x00000000000000FF
scmpl.b r2,r1,r3 : r3 = 0x000000FF00000000
cmpl r1,r2,r3 : r3 = 0x0000000000000000

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.9 cmple :

CoMPare for Lower or Equal

cmple r3, r2, r1

       Compare the two source operands and sets or clear the destination register according to the result. This operation is performed in the Increment unit so no substraction is required and it is performed faster for large data. In order to compare for greater or equal, simply swap the source operands or negate the result of CMPL. The comparison is valid only for unsigned values (yet)

       Remark : this instruction can't be used for IEEE floating point data (the comparison is not signed).

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_CMPLE Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Examples :

R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001

scmpl.b r1,r2,r3 : r3 = 0xFFFFFF00FFFFFFFF
scmpl.b r2,r1,r3 : r3 = 0xFFFFFFFFFFFFFF00
cmpl r1,r2,r3 : r3 = 0x0000000000000000

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.10 cmpli :

CoMPare for Lower with Immediate

cmpli Imm8, r2, r1

       Similarly to CMPL, with an immediate operand (that is not sign-extended), compare the two source operands and sets or clear the destination register according to the result. The comparison is valid only for unsigned values (yet)

       Remark : this instruction can't be used for IEEE floating point data (the comparison is not signed).

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_CMPLI Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved

 

Examples :

R1 contains 0x0000000500000003 (in a 64-bit system)

scmpli.b 0x04,r1,r2 : r2 = 0x00000000000000FF
cmpli 0x04,r1,r2 : r2 = 0x0000000000000000

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.11 cmplei :

CoMPare for Lower or Equal with Immediate

cmplei Imm8, r2, r1

       Similarly to CMPLE, with an immediate operand (that is not sign-extended), compare the two source operands and sets or clear the destination register according to the result. The comparison is valid only for unsigned values (yet)

       Remark : this instruction can't be used for IEEE floating point data (the comparison is not signed).

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_CMPLEI Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved

 

Examples :

R1 contains 0x0000000500000003 (in a 64-bit system)

scmplei.b 0x04,r1,r2 : r2 = 0xFFFFFF00FFFFFFFF
cmplei 0x04,r1,r2 : r2 = 0x0000000000000000

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.12 abs :

ABSolute value

abs r2, r1

       Computes r1 = (not(r2) + 1) if MSB(r1)==1

       This instruction negates the source operand in a special unit that is designed for low latency when large data are processed. If the sign bit (MSB) of the source is set (the number is negative) then the value is written back to the register set, or else (it is already positive) the result is cancelled.

       This instruction is designed to work in the 2s-complement number sytem (signed integer numbers) and is not subject to saturation/overflow problems.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_ABS Flags 0 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Example :

R1 contains 0xFF05891213450100 (in a 64-bit system)

sabs.b r1,r2 : r2 = 0x0105771213450100

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.13 max :

MAXimum

max r3, r2, r1

       Computes r1 = r3 if ( r2 < r3 ) else r1 = r2

       Compare the two source operands and writes the maximum of the two values to the destination register. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_MAX Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Examples :

R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001

smax.b r1,r2,r3 : r3 = 0x0000000700000003
max r1,r2,r3 : r3 = 0x0000000700000003

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.14 min :

MINimum

min r3, r2, r1

       Computes r1 = r3 if ( r2 > r3 ) else r1 = r2

       Compare the two source operands and writes the minimum of the two values to the destination register. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_MIN Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Examples :

R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001

smin.b r1,r2,r3 : r3 = 0x0000000500000001
min r1,r2,r3 : r3 = 0x0000000500000003

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.15 maxi :

MAXimum Immediate

maxi Imm8, r2, r1

       Computes r1 = Imm8 if ( r2 < Imm8 ) else r1 = r2

       Compare the two source operands and writes the maximum of the two values to the destination register. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data.

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_MAXI Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved

 

Examples :

R2 contains 0x0000000500000003 (in a 64-bit system)
smaxi.b 0x04,r2,r3 : r3 = 0x0000000500000004
maxi 0x04,r2,r3 : r3 = 0x0000000500000003

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.16 mini:

MINimum Imemdiate

mini r3, r2, r1

       Computes r1 = Imm8 if ( r2 > Imm8 ) else r1 = r2

       Compare the two source operands and writes the minimum of the two values to the destination register. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data.

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_MINI Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved

 

Examples :

R2 contains 0x0000000500000003 (in a 64-bit system)
smini.b 0x04,r2,r3 : r3 = 0x0000000400000003
mini 0x04,r2,r3 : r3 = 0x0000000000000004

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 


6.1.3.17 sort :

SORT

sort r3, r2, r1

       Computes { r1 = r3 , r1+1 = r2 } if ( r2 > r3 ) else { r1 = r2 , r1+1 = r3 }

       Compare the two source operands and writes the minimum of the two values to the destination register and the maximum to destination register+1. The comparison is valid only for unsigned values (yet) so this instruction can't be used for IEEE floating point data. This instruction is of the 2r2w form.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_SORT Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Examples :

R1 contains 0x0000000500000003 (in a 64-bit system)
R2 contains 0x0000000700000001

ssort.b r1,r2,r3 : r3 = 0x0000000500000001 , r4 = 0x0000000700000003
sort r1,r2,r3 : r3 = 0x0000000500000003 , r4 = 0x0000000700000001

Performance (FC0 only) :

Execution Unit : Increment Unit
Latency : 1 cycle
Throughput : 1 per cycle per IU.

 

 


6.1.4 Core Shift and Rotate :

 


6.1.4.1 shiftl :

SHIFT Left logical

shiftl r3, r2, r1

       Computes r1 = r2 << r3.

       The value of r3 is truncated to the number of bits needed by the bit shuffler unit.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_SHIFTL Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.

 


6.1.4.2 shiftr :

SHIFT Right logical

shiftr r3, r2, r1

       Computes r1 = r2 >> r3

       The value of r3 is truncated to the number of bits needed by the bit shuffler unit.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_SHIFTR Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.

 


6.1.4.3 shiftra :

SHIFT Right Arithmetic

shiftra r3, r2, r1

       Computes r1 = r2 >> r3 and preserve the sign.

       The value of r2 is truncated to the number of bits needed by the bit shuffler unit.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_SHIFTRA Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.

 


6.1.4.4 rotl :

ROTation Left

rotl r3, r2, r1

       Computes r1 = r2 <@ r3

       The value of r2 is truncated to the number of bits needed by the bit shuffler unit.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_ROTL Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.

 


6.1.4.5 rotr :

ROTation Right

rotr r3, r2, r1

       Computes r1 = r2 @> r3

       The value of r2 is truncated to the number of bits needed by the bit shuffler unit.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_ROTR Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.

 


6.1.5 Optional Shift and Rotate :

 


6.1.5.1 shiftli :

SHIFT Left Immediate

shiftli Imm8, r2, r1

       Computes r1 = r2 << Imm8

       The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_SHIFTLI Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.

 


6.1.5.2 shiftri :

SHIFT Right Immediate logic

shiftri Imm8, r2, r1

       Computes r1 = r2 >> Imm8

       The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_SHIFTRI Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.

 


6.1.5.3 shiftrai :

SHIFT Right Arithmetic Immediate

shiftrai Imm8, r2, r1

       Computes r1 = r2 >> Imm8 and preserve the sign

       The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_SHIFTRAI Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.

 


6.1.5.4 rotli :

ROTate Left Immediate

rotli Imm8, r2, r1

       Computes r1 = r2 <@ Imm8

       The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_ROTLI Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit :Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.

 


6.1.5.5 rotri :

ROTate Right Immediate

rotri Imm8, r2, r1

       Computes r1 = r2 @> Imm8

       The value of Imm8 is truncated to the number of bits needed by the bit shuffler unit.

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_ROTRI Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11-13 (none yet) 0 Reserved

 

Performance (FC0 only) :

Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.

 


6.1.5.6 bitop :

single BIT OPeration

bitop[x/s/c/t] r3, r2, r1

       Computes r1 = F(function, r2, 1 << r3)

       In the shifter, a 1 is shifted left r3 times and combined with the second operand (r2) according to the function F defined below :

Function number : Logical function : Operation : Opcode :
00 OR Bit Set bset or bitops
01 ANDN Bit Clear bclr or bitopc
10 XOR Bit Change bchg or bitopx
11 AND Bit Mask btst or bitopt

       The value of r3 is truncated to the number of bits needed by the bit shuffler unit.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_BITOP Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved
12-13 x, c, t or s 00-11 F

 

Example :

R1 contains 0x08
R2 contains 0xFF05891213450100 (in a 64-bit system)

bchg r1,r2,r3 : r3 = 0xFF05891213450000
bset r1,r2,r3 : r3 = 0xFF05891213450100
bclr r1,r2,r3 : r3 = 0xFF05891213450000
btst r1,r2,r3 : r3 = 0x0000000000000100

Performance (FC0 only) :

Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.

 


6.1.5.7 bitopi :

single BIT OPeration Immediate

bitop[x/s/c/t]i Imm6, r2, r1

       Computes r1 = F(function, r2, 1 << Imm6)

       In the shifter, a 1 is shifted left Imm6 times and combined with the second operand (r2) according to the function F defined below :

F : Logical function : Operation : Opcode :
00 OR Bit Set bseti or bitopsi
01 ANDN Bit Clear bclri or bitopci
10 XOR Bit Change bchgi or bitopxi
11 AND Bit Mask btsti or bitopti

       One of the great practical advantages of this instruction is that it allows to create SIMD constants with few instructions. This is why the immediate field is reduced to 6 bits. For example :
sbset.d 0x01,r0,r1 ; r1 = 0x0002000200020002
sbset.d 0x04,r1,r2 ; r2 = 0x0012001200120012

       The value of Imm6 is truncated to the number of bits needed by the bit shuffler unit.

size : 8 4 2 6 6 6
bits : 0                 7 8         11 12 13 14             19 20             25 26             31
function : OP_BITOPI Flags F Imm6 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved
12-13 x, c, t or s 00-11 F

 

Example :

R2 contains 0xFF05891213450100 (in a 64-bit system)

bchg 0x08,r2,r3 : r3 = 0xFF05891213450000
bset 0x08,r2,r3 : r3 = 0xFF05891213450100
bclr 0x08,r2,r3 : r3 = 0xFF05891213450000
btst 0x08,r2,r3 : r3 = 0x0000000000000100

Performance (FC0 only) :

Execution Unit : Bit Shuffling Unit
Latency : 1 cycle
Throughput : 1 per cycle per BSU.

 


6.1.6 Core Logic :


6.1.6.1 logic :

bitwise LOGIC

logic.xxxx r1, r2, r3

       Computes r3 = f(r1,r2) where f is a logic function whose truth table is defined in the flags.

Remark : XOR should be used to compare two numbers for equality, instead of sub.

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_LOGIC Flags Reg 3 Reg 2 Reg 1

FlagsValuesFunction
8-9[qdb]Size flags
10[01]f(0,0)
11[01]f(1,0)
12[01]f(0,1)
13[01]f(1,1)

or is an alias for logic.0111
and is an alias for logic.0001
xor is an alias for logic.0110
not is an alias for logic.1010
nor is an alias for logic.1000
nandis an alias for logic.1110

Performance (FC0 only) :

Execution Unit : ROP2 Unit
Latency : 1 cycle
Throughput : 1 result per cycle per ROP2.

 


6.1.7 Optional Logic :

 


6.1.7.1 logici :

bitwise LOGIC Immediate

logici.xxxx Imm8, r2, r3

       Computes r1 = f(Imm8,r2) where f is a logic function whose truth table is defined in the flags.

       Because there is less room than in the register form of the instruction, the logic functions are reduced to 4. I have chosen to use the same logic functions as in the bitop instructions. Yet, the SIMD flag is cruelly missing. The function could maybe be included in the opcode. Application : bitmap graphics and text processing.

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_LOGICI Flags Imm8 Reg 2 Reg 1

FlagsValuesFunction
8-9[qdb]Size flags
10-11[xtcs]logic function

ori is an alias for logici.s
andi is an alias for logici.t
xori is an alias for logici.x
andni is an alias for logici.c

Performance (FC0 only) :

Execution Unit : ROP2 Unit
Latency : 1 cycle
Throughput : 1 result per cycle per ROP2.

 


6.1.8 Optional SIMD Packing :

 

 


6.1.9 Floating Point Operations :

 

 


6.1.10 Optional Misc. :

 

Load/Store


4.5  Core Logic

4.5.1  

4.6  Optional SIMD Packing

4.6.1  mix

mix r1, r2, r3

Mix r1 and r2 into r3.

7 8 13 14 19 20 25 26 31
OP_MIX FLAGS Reg1 Reg2 Reg3

Flags Values Function
8 [hl] Defines which part of the words
should be mixed (high, low).

4.6.2  expand

expand r1, r2

Expand r1 into r2.

7 8 19 20 25 26 31
OP_EXPAND EMPTY Reg1 Reg2

Flags Values Function
8 [hl] Defines which in part of the word
the result should be put (high, low).

4.7  Floating Point Operations

4.7.1  Introduction

There are different levels of implementation of floating point operations.

Level Instructions implemented
Level 0 No FP
Level 1 fadd, fsub, fmul, int2f/f2int, finv_app, sqrt_inv_app
Level 2 fadd, fsub, fmul, int2f/f2int, finv, sqrt
Level 3 fadd, fsub, fmul, int2f/f2int, div, finv, sqrt, sqrt_inv

4.7.2  fadd

Floating Point Addition

fadd r1, r2, r3

fadd performs a floating addition of the two source operands (r1 + r2) and puts the result in destination operand (r3). The operation should be IEEE-754 compliant.

7 8 13 14 19 20 25 26 31
OP_FADD FLAGS Reg1 Reg2 Reg3

Flags Values Function
8-9 [f??] Defines the size parameter
10 [s] Defines if the operation is SIMD
11 [x] Defines if IEEE compliance isn't required

4.7.3  fsub

Floating Point Substraction

fsub r1, r2, r3

fsub performs a floating substraction of the two source operands (r1 - r2) and puts the result in destination operand (r3). The operation should be IEEE-754 compliant.

7 8 13 14 19 20 25 26 31
OP_FSUB FLAGS Reg1 Reg2 Reg3

Flags Values Function
8-9 [f??] Defines the size parameter
10 [s] Defines if the operation is SIMD
11 [x] Defines if IEEE compliance isn't required

4.7.4  fmul

Floating Point Multiplication

fmul[f] r1, r2, C

fmul performs a floating multiplication of the two source operands (r1 x r2) and puts the result in destination operand (r3). The operation should be IEEE-754 compliant.

7 8 13 14 19 20 25 26 31
OP_FMUL FLAGS Reg1 Reg2 Reg3

Flags Values Function
8-9 [f??] Defines the size parameter
10 [s] Defines if the operation is SIMD
11 [x] Defines if IEEE compliance isn't required

4.7.5  fdiv

Floating Point Division

fdiv r1, r2, r3

fdiv performs a floating division of the two source operands (r1 / r2) and puts the result in destination operand (r3). The operation should be IEEE-754 compliant.

7 8 13 14 19 20 25 26 31
OP_FDIV FLAGS Reg1 Reg2 Reg3

Flags Values Function
8-9 [f??] Defines the size parameter
10 [s] Defines if the operation is SIMD
11 [x] Defines if IEEE compliance isn't required

4.7.6  int2f and f2int

Integer to Floating Point and Floating Point to Integer

int2f r1, r2

f2int r1, r2

``int2f'' converts integer number in register r1 into a floating point number and put it in register r2.

``f2int'' converts floating point number in register r1 into an integer number and put it in register r2.

7 8 19 20 25 26 31
OP_FCONV EMPTY Reg1 Reg2

Flags Values Function
8-9 [f??] Defines the size parameter
10 Direction flag.
11 [s] Defines if the operation is SIMD (*)
12 [x] Defines if IEEE compliance isn't required (*)
13-15 Rounding modes see table below.

Rounding modes:

Value Rounding mode
000 Nearest (default)
001 Towards 0
010 Away from 0
011 Towards -infinity
100 Towards +infinity

4.7.7  finv

Floating Point Inverse

finv r1, r2

Computes r2 = [1/r1]

7 8 19 20 25 26 31
OP_FINV EMPTY Reg1 Reg2

Flags Values Function
8-9 [f??] Defines the size parameter
10 [s] Defines if the operation is SIMD
11 [x] Defines if IEEE compliance isn't required

4.7.8  fsqrt

Floating Point Square Root

fsqrt r1, r2

Computes r2 = Ö[r1]

7 8 19 20 25 26 31
OP_FSQRT EMPTY Reg1 Reg2

Flags Values Function
8-9 [f??] Defines the size parameter
10 [s] Defines if the operation is SIMD
11 [x] Defines if IEEE compliance isn't required

4.7.9  finvsqrt

Floating Point Inverse Square Root

finvsqrt r1, r2

Computes r2 = [1/(Ö[r1])]

7 8 19 20 25 26 31
OP_FINVSQRT EMPTY Reg1 Reg2

Flags Values Function
8-9 [f??] Defines the size parameter
10 [s] Defines if the operation is SIMD
11 [x] Defines if IEEE compliance isn't required

4.8  Optional Misc.

4.8.1  bitrev r1, r2, r3

Reverses the bits from r1 and shifts them right by r2 bits and put the result in r3.

7 8 13 14 19 20 25 26 31
OP_BITREV FLAGS Reg1 Reg2 Reg3

4.8.2  bitrevi r1, imm6, r3

Reverses the bits from r1 and shifts the result to the right imm6 bits and put the result in r3.

7 8 13 14 19 20 25 26 31
OP_BITREVI FLAGS imm6 Reg1 Reg2

4.8.3  byterev r1, r2

Reverses the bytes in r1 (change the endianism) and stores the result in r2.

7 8 19 20 25 26 31
OP_BYTEREV EMPTY Reg1 Reg2

Flags Values Function
8-9 [qdb] Defines the size parameter

5  Load/Store

5.1  Core Load/Store

5.1.1  load

Load

load [r1 + r2 * size], r3

r3 = [r1 + r2 * size]

7 8 13 14 19 20 25 26 31
OP_LOAD FLAGS Reg1 Reg2 Reg3

Flags Values Function
8-9 [qdb] Defines the size parameter
10 [e] Defines the endianness
little-endian if cleared (default)
big-endian if set
11-13 RESERVED

5.1.2  store

Store

store r1, [r2 + r3 * size]

[r2 + r3 * size] = r1

7 8 13 14 19 20 25 26 31
OP_STORE FLAGS Reg1 Reg2 Reg3

Flags Values Function
8-9 [qdb] Defines the size parameter
10 [e] Defines the endianness
little-endian if cleared (default)
big-endian if set
11-13 RESERVED

5.1.3  mov

Move

mov [r1,] r2, r3

if (r1) r3 = r2

7 8 13 14 19 20 25 26 31
OP_MOV FLAGS Reg1 Reg2 Reg3

Flags Values Function
8-9 [qdb] Defines the size parameter
10-11 [sz] Defines how the high part of the
destination register will be.
(See table below)

Flag Values Function
(default) 00 High part remains unchanged
z 01 Zero extend
s 10 Sign extend
? 11 Reserved

Remark: move r0, r0, r0 is an alias for NOP.

5.1.4  loadcons

Load Constant

loadcons imm16, r1

Loads the imm16 constant into the register r1 at the specified location (shifts of 16 bits). The rest of the register remains unmodified.

Flags Values Function
8-9 [123] Defines the shift parameter

7 8 9 10 25 26 31
OP_LOADCONS EMPTY imm16 Reg1

5.1.5  loadconsx

Load Constant with Sign Extension

loadconsx imm16, r1

Loads the imm16 constant into the register r1 at the specified location (shifts of 16 bits). The higher part of the register is assigned the value of the most significant bit of the constant. The lower part of the register remains unmodified.

Flags Values Function
8-9 [123] Defines the shift parameter

7 8 9 10 25 26 31
OP_LOADCONSX EMPTY imm16 Reg1


/*
  LOADCONST.C by WHYGEE  14 septembre 1999
  
  to be included in a compiler or an assembler, after some
  interface fixing : it currently outputs to stderr, it will
  output to a file the same way.
*/

#include "stdlib.h"
#include "stdio.h"

#define MAXSIZE (sizeof(long long int))
/* should be ideally 8 */

void emit_constant(unsigned long long int c, unsigned char reg)
{
  unsigned short int data[MAXSIZE>>1];
  signed long long int t,u;
  signed int s=0;
  
  if (reg==0)
    {
      fprintf(stderr,"\\n Error : can't write to register 0 \\n");
      exit(-1);  /* should be performed by an error routine that does this cleanly */ 
    }
  if (c==0) 
    {
      fprintf(stderr,"mov rd,r0\\n",reg); 
    }
  else if (c==-1) 
    { 
      fprintf(stderr,"logic.1111 rd,r0,r0\\n",reg); 
    }
  else if ((c>65535)&((c & -c)==c)) 
    /* a power of two, but the latency of bitset is higher */
    {
      do { s++; c>>=1; } while (c!=0); /* find the LSB */
      if (s>63) 
        {
          fprintf(stderr,"loadconsts rd,0x04X\\n",reg,s);
          fprintf(stderr,"bitset rd,r0,rd\\n",reg,reg);
        }
      else
        {
          fprintf(stderr,"bitset rd,r0,d\\n",reg,s);
        }   
    }
  else /* any kind of number */
    { 
      u=c;
      do {
        t=u;
        data[s]=t & 0xFFFF;
        u=t>>16;
        s++;
      } while ((t!=u) & (s<MAXSIZE>>1));
          
      s--;
          
      /* handle the case where the MSB of the highest data is not the sign */
      if ((data[s]^data[s-1])& 0x8000) 
        {
          fprintf(stderr,"loadconsts.d rd,0x04X\\n", s,reg,data[s]);
          s--;
          fprintf(stderr,"loadconst.d rd,0x04X\\n", s,reg,data[s]);
          s--;
        } 
      else 
        {
          s--;
          fprintf(stderr,"loadconsts.d rd,0x04X\n", s,reg,data[s]);
          s--; 
        }      
        
      while (s>=0)
        {
          fprintf(stderr,"loadconst.d rd,0x04X\\n", s,reg,data[s]);
          s--;
        }        
    }
}

5.1.6  cachemm

Cache Memory Management

prefetch, flush a data block to/from a memory level.

cachemm r1, r2

Flags Values Function
8-9 [qdb] Defines the size parameter
10 [fp] Prefetch/Flush
11 [l] Lock. This flag means that the data
are static and will be used a lot
12-14 [0-7] Memory level (see table below)

D 000 data L1 cache
I 001 instructions L1 cache
C 010 onchip unified cache
011 [unused]
U 100 offchip unified cache
L 101 local memory
G 110 global memory
V 111 virtual memory (hard disk)

7 8 19 20 25 26 31
OP_CACHEMM EMPTY Reg1 Reg2

[subject to changes as discussions go] also possible: ask for compression on the fly.

example : "flushg ra,rb" flushes rb bytes starting at address ra from every memory level until global memory. Any cache (L1, L2, local...) containing data that belong to the block is updated in main memory and the corresponding cache spaces are freed (available for future use). this should be executed everytime the programer knows that he won't use a block of data until a certain moment, and the cache level is a hint for performance.

"preftchu ra,rb" copies the data block at address ra and size rb that is present in lower memory levels (virtual, global, local) to the unified offchip memory (at least).

forms : rr or ri (size could be immediate)

These instructions are very important for memory management, and should be used when performing SMC (for memory coherency).

5.2  Optional Load/Store

5.2.1  loadi

Load Immediate

loadi [r1 + imm9 * size], r2

r2 = [r1 + imm9 * size]

7 8 10 11 19 20 25 26 31
OP_LOADI FLAGS imm9 Reg1 Reg2

Flags Values Function
8-9 [qdb] Defines the size parameter
10 [e] Defines the endianness
little-endian if cleared (default)
big-endian if set

5.2.2  storei

Store Immediate

storei r1, [r2 + imm9 * size]

[r2 + imm9 * size] = r1

7 8 10 11 19 20 25 26 31
OP_STOREI FLAGS imm9 Reg1 Reg2

Flags Values Function
8-9 [qdb] Defines the size parameter
10 [e] Defines the endianness
little-endian if cleared (default)
big-endian if set

5.3  Internal registers info

Get and Put internal register.

R/W Description
R Number of cycles
R Number of cycles (countdown)
R Number of instructions executed
R Number of Pages Faults
R Number of traps/interrupts
R Number of FPU traps
R Number of Cache hit/misses
R Number of correct/incorrect branch predictions
R Number of pipeline bubbles
R Number of TLB hits/misses

Table 1: Performance Counters

R/W Description
RW Old Program Counter
RW Old Machine Status Word
RW Exception Vector
RW Temporary
R Exception Reason
R Exception Number/Type

Table 2: Special Register for Exceptions

R/W Description
R Processor ID

Table 3: Diverse Special Registers

5.4  Core Internal registers

5.4.1  get

Get Internal Register

get IR[r1], r2

Get internal register at index r1 and put its content in register r2. The whole register gets dumped. There is no size flag.

7 8 19 20 25 26 31
OP_GET EMPTY Reg1 Reg2

5.4.2  put

PUT Internal Register

put r1, IR[r2]

Put contents of r1 and put into internal register at index r2. The whole register gets dumped. There is no size flag.

7 8 19 20 25 26 31
OP_PUT EMPTY Reg1 Reg2

5.5  Optional Internal Registers

5.5.1  geti

Get Internal Register Immediate

geti IR[imm16], r1

Get internal register at index imm16 and put its content in register r1. The whole register gets dumped. There is no size flag.

p

7 8 9 10 25 26 31
OP_GETI EMPTY imm16 Reg1

5.5.2  puti

Put Internal Register Immediate

puti r1, IR[imm16]

Get internal register at index imm16 and put its content in register r1. The whole register gets dumped. There is no size flag.

7 8 9 10 25 26 31
OP_PUTI EMPTY imm16 Reg1

6  Flow Control

6.1  Core Branch

6.1.1  jmpa

Absolute Jump.

jmpa [r1,] r2

If r1 contains a non-nil value jump to the address pointed by r2.

Flags Values Function
8 [n] Negates the condition
9-10 [lm] Test the MSB or the LSB

7 8 19 20 25 26 31
OP_JMPA EMPTY Reg1 Reg2

6.1.2  loadaddr

Load Address

loadaddr imm18 r1

Stores PC+imm18 into r1.

The result is a 64 bit address. imm18 is a signed value.

7 8 25 26 31
OP_LOADADDR imm18 Reg1

6.1.3  loopentry

Loop Entry

loopentry r1

Stores PC+4 into r1.

7 8 25 26 31
OP_LOOPENTRY EMPTY Reg1

This instruction is a special form of loadaddr.

6.1.4  loop

Loop

loop r1, r2

Performs two parallel things :

This overlapping of the operations allows greater parallelism and lower latency : we can loop fast without compromising security.

7 8 19 20 25 26 31
OP_LOOP EMPTY Reg1 Reg2

6.2  Optional Branch

6.2.1  jmpi

Absolute Jump Immediate.

jmpi [r1,] r2, imm12

If r1 contains a non-nil value jump to address pointed by r2+4*imm12.

7 8 19 20 25 26 31
OP_JMPI imm12 Reg1 Reg2

6.2.2  jmpr

Relative Jump Immediate.

jmpr [r1,] imm18

The imm18 is a signed value. Warning! All code is aligned on a 32bit boundary so the imm18 value will be shifted to the left 2 times.

7 8 25 26 31
OP_JMPR imm18 Reg1

6.3  Core CPU Control

6.3.1  syscall and trap

syscall [r1,] imm18

trap [r1,] imm18

Syscall are two names for the same instruction.

[FIX ME] But what does it do ?

The argument is ignored by the hardware and may be used to encode information for system software. To retrieve the argument system software must load the instruction word from memory.

7 8 25 26 31
OP_SYSTRAP imm18 Reg1

6.3.2  halt

halt [r1,] [imm18]

Halts until an External Exception occurs

7 8 25 26 31
OP_HALT imm18 Reg1

6.3.3  rfe

rfe [r1,] [imm18]

Return From Exception...

[FIX ME] Little short...

7 8 25 26 31
OP_RFE imm18 Reg1

A  Exception Handling (out of date)

Type 1 Software exception
Type 2 External exception (interrupt)
Type 3 Privilege Violation
Type 4 Memory Error
Type 5 Syscall

Table 4: Types of exceptions

[FIXME] How many differents types do we need to have, each one corresponding to one pointer in the exception vector (except for the hardware that takes all the rest).

Exception pointer vector has 64 entries. [FIXME] Confirm that.

SR_OPC Old Program Counter
SR_OMSW Old Machine Status Word
SR_EV Exception Vector
SR_TMP Temporary
SR_ER Exception Reason
SR_ENT Exception Number/Type

Table 5: Special Registers for Exception handling

1 External Interupt
2 Illegal Opcode
3 Malformed Instruction
4 Priviledge Violation
5 Integer divide by ZERO
6 FP divide by ZERO
7 FP INF-INF
8 FP INF/INF
9 FP ZERO/ZERO
10 FP ZERO*INF
11 FP SQRT(NEG)
12 Memory Exception

Table 6: Possible Exception Reasons Values

Upon the occurance of an exception, the proccessor performs the following..

Upon Call to the RFE instruction:

 

 


6.1.1. :

Operation

r1 , r2, r3

      

size : 8 4 8 6 6
bits : 0                 7 8         11 12                 19 20             25 26             31
function : OP_ Flags Imm8 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved
12 - postfix 1 if set flag
13 - postfix 1 if set ( 2r2w )

 

Examples :

Scalar :

R1 contains (we only consider the lower byte in the registers)
R2 contains

r1,r2,r3 : r3 =

SIMD :

R1 contains (in a 64-bit system)
R2 contains

r1,r2,r3 : r3 =

Performance (FC0 only) :

Execution Unit : Unit
Latency :
Throughput : .

 




6.1.1. :

Operation

r1 , r2, r3

      

size : 8 6 6 6 6
bits : 0                 7 8             13 14             19 20             25 26             31
function : OP_ Flags Reg 3 Reg 2 Reg 1

    Flags     Syntax Values Function
8-9 .q, .d or .b postfix   Defines the size parameter
10 s- prefix 1 if set Defines if the operation is SIMD
11 (none yet) 0 Reserved
12 - postfix 1 if set flag
13 - postfix 1 if set ( 2r2w )

 

Examples :

Scalar :

R1 contains (we only consider the lower byte in the registers)
R2 contains

r1,r2,r3 : r3 =

SIMD :

R1 contains (in a 64-bit system)
R2 contains

r1,r2,r3 : r3 =

Performance (FC0 only) :

Execution Unit : Unit
Latency :
Throughput : .


 


6.1.11 :

 

 


6.1.11 :

 

 


6.1.12 :

 

 


6.1.13 :

 

 


6.1.14 :

 



part6.html nov.17 by Whygee