(C) Yann Guidon 2001 (whygee@f-cpu.org)
"Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved."
created : Fri Feb 2 19:54:43 MET 2001
rev. Nov. 7th 2001

Paper submission

This is the text submitted to the managers of the Embedded Processor Forum 2001, June 11 - 15, in San Jose.

You can read the first slides if you follow this link.

1. The name of the chip, product or technology 

the F-CPU Instruction Set Architecture and the F-CPU Core #0 ("FC0")

2. General features of the chip, product or technology

* Fully parameterizable VHDL'93 source code distributed under GPL
  and F-CPU Charter.
* retargetable to a broad range of technologies
* SIMD, "undetermined register size" (32, 64, 128, 256 ...)
* 64 registers (64-bit wide in the first implementation)
* 32-bit instructions with 4 instruction formats only
* superpipelined (very short matched pipeline stages)
* 1 instruction per cycle issued In-Order
   and Out Of Order Completion (for the FC0)
* Execution pipeline freed from exceptions (no rollback possible)
* Zero-delay branch instruction (in the ideal case)
* Smooth Register Back ("SRB") mechanism lowers context switch time

3. Specific features that will be discussed in detail at
   Embedded Processor Forum 

* instruction scheduling
* memory interface

4. Whether the chip will be formally announced or
   disclosed for the first time at Embedded Processor
   Forum (if not, what prior disclosures or announcements
   are planned) 

It is the first public presentation at a congress in the US.
The F-CPU team was presented during two conferences in Berlin
(17C3 : http://www.ccc.de/congress/fahrplan/event/153.en.html
 16C3 : http://www.ccc.de/events/congress99/doku/fcpu.html)
but there was only a short technical introduction, while
some key details (see 3.) will be discussed.

5. For non-chip announcements, please provide details
   on the product to be announced, its relevance to the
   event, and particulars that will help evaluate the

Today, the F-CPU is the only configurable SIMD 64-bit
general-purpose RISC core aimed at high performance
applications (multimedia or scientific computations).
It is a very recent design (1999) originally meant to
provide a free, ununcumbered alternative to the Merced.

Contrarily to the LEON, the F-CPU defines a new instruction
set and is designed to last much longer. It is "by design"
extensible and scalable in several ways, providing forward
and backward compatibility for several decades. The first
core (FC0) could be used as replacement for the aging MIPS,

Contrarily to the ARC Cores and similar "configurable cores"
offers, the F-CPU is not a "black box" but a completely
transparent design that allows complete visibility to the
engineer. The F-CPU is based around a clean ISA and some simple
scheduling rules, leaving a lot of freedom to the implementor.

6. Other descriptive information about the presentation 

{not defined yet. maybe a free distribution of CDROMs}

7. Company name 

It is a community development on the Internet, not a company "product".

8. Speaker's name (should be the chief architect or lead
   designer of the product) 

Yann Guidon

9. Speaker's job title 

 F-CPU contributor.

10. Speaker's biography (30 words or less)

Joined the F-CPU Design Team on the Internet in Dec. 1998,
influenced the design when the F-CPU ISA became RISC and
defined the FC0 in mid-1999.

11. Speaker's address, phone, fax, email address and
    administrator (if any) 

email : whygee@f-cpu.org (personal)

12. Abstract title

proposed title :
Instruction scheduling and the memory interface of the F-CPU Core #0

13. P.R. or Marcom contact (include phone, fax, and email) 

no marketing. see http://www.f-cpu.org or email me.

14. Your abstract or outline should not exceed 1000 words. 

This technical presentation will focus on one of the many particular
characteristics of the F-CPU, a superpipelined SIMD RISC CPU that is
being developped in a community on the Web, and distributed under GPL.
The first implementation of the F-CPU is called FC0 and the strong
constraints of the F-CPU ISA have led to redesign from scratch the
way instructions are decoded and memory is accessed. Although it
almost tastes like a classic "MIPS" RISC core, the memory units
are completely upside-down compared to other usual designs. The
instruction path is uncommon and the instruction set is highly
influenced. However, the number of critical or difficult cases
is highly reduced, a strong memory protection policy is enforced
and jumps/calls can be done with no overhead. Finally, the instruction
set is kept orthogonal and easy to schedule in almost any kind
of CPU core, even in the presence of faults and interrupts.

(will be converted to HTML & illustrated).

Slide 0  Logo

Slide 1  Title
 "Instruction scheduling and the memory interface of the F-CPU Core #0"

  Disclaimers :

*  This is a technical (not marketing) overview of some major
characteristics of the F-CPU design philosophy and the
implementation in the FC0. 

*  It presents the backgrounds and design choices of an ongoing
(unfinished) work.

*  Not all the details are presented here !
    RTFM @ http://www.f-cpu.seul.org/manual

Slide 2  Short introduction to the F-CPU
  F-CPU Project goals (voted 1999) :
     To develop and make freely available an architecture, and all
other intellectual property necessary to fabricate one or more
implementations of that architecture, with the following priorities,
in decreasing order of importance:
   1)  Versatility and usefulness in as wide a range of applications
       as possible
   2)  Performance, emphasizing user-level parallelism and derived
       through intelligent architecture rather than advanced silicon process
   3)  Architecture lifespan and forward compatibility
   4)  Cost, including monetary and thermal considerations

In the end, the goal is not to make the fastest ever CPU, because it's an
endless race. The purpose of this project is more to design the
"coolest/sexiest" CPU possible.

Note the difference between "design/develop" and "build/market/sell/distribute..."

Slide 3  Short introduction to the F-CPU (2)
  Some features of the F-CPU Instruction Set Architecture  :

* 32-bit instructions with 4 instruction formats only
   [illustration of the formats]
* Instruction set census : ~110 reserved opcodes as of today,
  with several variations, "core" and "optional" features.
* 64 registers (64-bit wide in the first implementation) with R0=0
* SIMD flag in most instructions allows parallel operation
  on registers with "undetermined" width (32, 64, 128, 256 bits ...)
* Support for saturated arithmetics, IEEE Floating point, IEEE LNS
  and fractional integers (optional)
* 2r1w, 3r1w and 2r2w instructions, 3r2w register set
* memory : 3 "stream hint bits" + cachability bit (7 "memory streams"
   can be implemented -> more bandwidth because the different accesses
   are separated)
* One addressing mode only for code and data : register
   (post-increment possible for data)

example of some instructions : [follows the binary format]
  Opcode Example                  Explicitly
- add    add r1,r2,r3             r3=r1+r2
- load   load r1,r2,r3            r3=[r2], r2+=r1
- jump   if r1=0 jump r2 link r3  if cond, r3=PC, PC=r2

Slide 4  Starting point of the FC0 : Clock speed issue
  Technology handicap ---> "carpaccio" CPU

* faster clock -> shorter Critical DataPath -> 6 4-input gates in the CDP
   or approximatively 10 transistors between two pipeline flip-flops.
   (This goal is here to emphasize on the simplicity of the pipeline stages,
    it's not a "golden rule" because a lot of other parameters must be also considered)

* stage is deep enough to perform a 8-bit add/sub or a 64-bit increment.
   When a unit becomes too complex, it must be split.

* CDP determined from the start :
   - no "hidden CDP" that slows the clock down in the middle of the project
   - very simple and easy design of the units (fortunately because
     everything must be redesigned)
   - emphasis on short stage pipelining for every part of the project
     favours "matched" and balanced pipeline stage design.
   - the number of pipeline stages is not the issue.

Slide 5  Variable latency
  Every Execution Unit of the F-CPU is dedicated to
a certain kind of operation :

* ROP2 : bit-to-bit boolean operations : 1 cycle
* SHL  : bit/byte shuffling, 1 or 2 cycles
* ASU  : SIMD integer addition and substraction with
   carry and saturation : 2 cycles in the general case,
   1-cycle for 8-bit operations
* IMU  : SIMD integer multiply and MAC : 6 cycles / 64-bit (MR design)

EUs are "LEGO bricks" with a particular and deterministic latency.
They are all SIMD-capable and are duplicated when the register
width increases. The SIMD size and "chunck" size are implementation-dependent.

Slide 6  Connexion of the EUs around the Xbar


Slide 7  General layout of the FC0


Slide 8  Scheduling of one instruction


Fetcher : 1 cycle (transparent) {bypass possible when cache miss}
decoder/R7 read : 1 cycles (more if stall)
Issue/Xbar read : 1 cycle (unless bypass possible)
EU : 1+ cycle (latency depends on the unit, see slide #5)
Xbar : 1 cycle (here we don't care about the scheduling...)
Register write : 1 cycle

Slide 9  Scheduling of one instruction (2)
 hazard detection with a scoreboard


* Every register is associated to a "not ready bit"
* The bit is set when the instruction is issued
* The bit is reset when the scheduler commands a write
  to the register.
* The decode pipeline is stalled if all the necessary bits
  for the current instruction are not cleared.
* This scheme is scalable almost in O(n) with the number of
    concurrently decoded/executed instructions

Slide 10  Scheduling of one instruction (3)
 allocation of the Xbar slots with a FIFO


* The opcode and the flags of the current instruction are the
  inputs of a hardwired lookup table. The output indicates
  how many slots are required (1 or 2 write buses ?)
  and how many cycles are required for the completion
  of the operation.
* When the operation is issued (if no hazard on the
  scoreboard and the FIFO is detected), the selected
  slot is filled with the number of the register that must
  be written to.
* When the FIFO shifts down, the number reaches the bottom
  and commands the control wires of the Xbar and the Register Set.
* The depth of the FIFO is 8 cycles. It corresponds to
  the highest latency of the execution units (IMU: 6 cycles)
  plus the 2 x Xbar cycles.
* An additional down-counter extends the FIFO for
  the long and high-latency divide unit (not pipelined version).
  When the counter is elapsed, the behaviour is normal for the
  rest of the FIFO (the register number is injected on the top
  of the FIFO).
* Load instructions require special handling when the data
  is not present in the LSU's buffer.

Slide 11  Memory protection
  All the pointers are available from the Xbar and checked by a TLB
that translates the task's logical address into a physical address.


* TLB entries are SW-replaced through Special Registers with the help of LRU HW
* 4 page sizes : 4KB, 32KB, 256KB and 2MB, plus probably a set of
   "fence registers" for larger pages.
* TLB entries contain cachability informations, access rights (R/W/RW/X),
   access counter and subdomain usage counters, VMID tag ...

Slide 12  Connexion of the address bus


* If there is a TLB miss, the pointer register is marked as invalid
  and a future use of this register as pointer will trigger a trap
  at the decode stage.

* The address output by the TLB is compared to the address tags of the 
  Fetcher, the L/SU, the internal Icache and Dcache.

* If there is a cache or buffer hit
   - the register is marked as valid
   - the data is brought to the LSU or the Fetcher (depending on the
     access type) if the data is not available in the required buffer.

Slide 13  LSU/Fetcher/Decoder coupling to detect faulty pointers
  The R2 field of the instruction is compared to a set of 8*2 6-bit comparators.


* Each 256-bit line of the Fetcher and the LSU is tagged with an address and
the number of the associated register.
* The association of a line and a register is determined with two rules :
   - LRU
   - double-buffering (to avoid thrashing when scanning a large linear
       region of memory) with a pair of line
* When the register number is compared to all the entries, a signal is sent
  to the decoder, saying whether the data is available or not.
   - if the data is present AND the instruction is a load/store or jump
     (plus all the other scheduling, scoreboarding etc. rules),
     the instruction is issued and can complete in 2 cycles.
   - if the data is not present in any buffer AND it is a load/store or jump,
     a "prefetch" instruction is simulated in the decoder.
       * the pointer's value is present on the Xbar because it is accessed
         in parallel with the decode
       * the pointer is checked in the TLB and the result goes the usual way.

Slide 14  Pipelining of a load instruction
  Due to the relatively high latency of the load and store instruction,
some particular coding techniques must be used to benefit from the FC0's

* loop unrolling and interleaving (2x or 3x depending on the size
  of the loop, the available registers...)
* explicit and agressive prefetch. Example :
   usual code :                asm code :
                                 loadimm @item1, r4   ; prefetch
                                 load [r4],r0         ; prefetch (2)
      load [item1],r1            loadi item2-item1,[r4], r1
      load [item2],r2            loadi item3-item2,[r4], r2
      load [item3],r3            loadi item4-item3,[r4], r3

* pointer duplication : a pointer with post-increment takes approx.
   6 cycles to complete, other pointers must be used and interleaved
   if the same memory block must be accessed within this 6-cycle period.

* Of course, a memory access without prefetch will work but at the cost
  of a substantial stall (it is necessary because a task switch will flush
  the "register tags" in the LSU and the Fetcher)

* Similar rules apply for the jump/call/return instructions but the
  constraints are looser (no pointer update usually).

Slide 15  Code example : vector loop
   These are exemples of the typical code that runs best on the F-CPU :

   1) Vector style

Pseudocode :
   #define N 1024
   char A[N], B[N], S[N]
   loop (N)

   ; r4 = base address of A
   ; r6 = base address of B
   ; r8 = base address of S
   load.s1 r4,r0 ; prefetch
   load.s2 r6,r0 ; prefetch
   load.s3 r8,r0 ; prefetch
   get SR_MAX_SIZE,r1     ; r1 is loaded with 8, 16, 32...
    ; depending on the chip version (number of bytes per register)
   get SR_LOG_MAX_SIZE,r2 ; ie r2 = 3 if MAX_SIZE=8 (64-bit)
   loadimm N/2,r3         ; N/2 because we unroll the loop twice
   add r1,r4,r5           ; r5 = r4+max_size (pointer duplication)
   add r1,r6,r7           ; idem
   add r1,r8,r9           ; idem
   shl 1,r1,r1            ; r1 = r1*2
   shr r2,r3,r3           ; r3 = final loop count
   loopentry r2           ; r2 = PC+4
     loadf.s1 r1,r4,r10
     loadf.s1 r1,r5,r12
     loadf.s2 r1,r6,r11
     loadf.s2 r1,r7,r13
     sadd.8  r10,r11,r11 ; SIMD add on 8-bit chuncks
     sadd.8  r12,r13,r13
     ; stall
     storef.s3 r1,r8,r11          
     storef.s3 r1,r9,r13
   loop r2,r3      ; decrement r3 // jump to r1 if r3 is not zero

   {remaining of the loop here, in case of an odd loop count}

   2) Packed style

Pseudocode :
   #define N 1024
   struct {char A,B} M[N]
   char S[N]
   loop (N)
     S[i]=M[i].A + M[i].B

   ; r4 = base address of A
   ; r6 = base address of S
   load.s1 r4,r0 ; prefetch
   load.s2 r6,r0 ; prefetch
   get SR_MAX_SIZE,r1     ; r1 is loaded with 8, 16, 32...
    ; depending on the chip version (number of bytes per register)
   get SR_LOG_MAX_SIZE,r2 ; ie r2 = 3 if MAX_SIZE=8 (64-bit)
   loadimm N,r7
   add r1,r4,r5           ; r5 = r4+max_size (pointer duplication)
   shr r2,r7,r7           ; loop count
   shl 1,r1,r2            ; r1 = r1*2

   loopentry r3
     loadf.s1 r2,r4,r10
     loadf.s1 r2,r5,r11
     mixhi.8 r10,r11,r12
     mixlo.8 r10,r11,r13
     ; stall
     sadd.8 r13,r12,r12
     ; stall
     storef.s2 r1,r6,r13          
   loop r4,r3      ; decrement r3 // jump to r1 if r3 is not zero

Slide 16  The future of the FC0

* Adding new execution units : LNS Add/Sub/Conversion, IEEE floating point units,
  Popcount/ECC ... 
* adding other memory interfaces, ie DDR SDRAM
* Extension to 2-issue then 4-issue
* Split/pipelining of the decode/issue unit
* Coding constraint for a superscalar F-CPU :
   2-issue : 64/(2*3)= 10 instructions without dependency
   3-issue : 64/(3*3)= 7 ...
   4-issue : 64/(3*4)= 5   <-- current limit of the FC0

Slide 17  Evolution of the F-CPU

* explicit register renaming -> more usable registers

* Possible directions :
   - SMT (Simultaneous Multi Threading),
   - multi-core chips
   - RISC->TTA translator
   - eDRAM (embedded DRAM)
   - F-BUS, ...

* New instructions : Scatter-Gather, Permute ...

* F-CPU II ?

Slide 18  Conclusion

*  URL of the slides  http://www.f-cpu.de/epf2001/ (not yet created)
*  Coming milestones : complete VHDL and manual v1 on CVS
*  F-CPU official web site : http://www.f-cpu.org
*  F-CPU manual available from http://www.f-cpu.seul.org