F-CPU Design Team
FCPU MANUAL REV. 0.1

 

       Part 7 :

 

 

 

Programming the F-CPU

 

 

 

 

 


 
Summary

Previous part (6)

currently in construction.

 


 

7.1 Introduction :

       As written before, programming the F-CPU has a different "taste" or "feeling" because of the particular processor structure and the design philosophy. Not only scheduling the individual instructions is important, but scheduling the use of each unit and the memory accesses is yet more important than ever before. Here, the key to performance, architectural simplicity and security in the FC0 is the use of many "speculative flags" that are not accessible to the user, but that influence the behaviour of the whole CPU.

       The F-CPU goes even further by allowing the user to explicitely indicate some "hints" like the "stream flags". An individual F-CPU can ignore these flags but their use will dramatically enhance the performance of the application if a few simple rules are respected, whatever the CPU type or core is used.

 

7.2 Pseudo-superscalar :

       The FC0 uses a crossbar ("Xbar") in order to reduce the register port number and provide a fast and universal register bypass mechanism. This central part of the FC0 is not complex but spans on a large part of the CPU. Each port has a relative high fanout and drives long wires, which justifies by itself the fact that the Xbar has its own cycle in the pipeline, when the operands are brought to the Execution Units and when the results are written back to the register set. This last part is used when "bypassing" the register, with the help of the scoreboard that keeps trace of the use of the different Xbar channels.

       In practice, the Xbar adds a 1-cycle latency to any normal computation instruction. This means that at least another independent instruction must be interleaved between two dependent instructions. From this point of view, programming a single-issue FC0 is similar to programming a 2- or 3-way superscalar processor, because of the very short pipeline stages.

       While this applies for the computational instructions, this doesn't apply to other data movement instructions that typically use the Xbar only once : they can be pipelined and don't suffer from instruction pairing restrictions as in superscalar CPUs.

       The scoreboard checks the data dependencies and prevents multiple-cycle-latency instructions from giving wrong results. It is therefore very interesting to unroll loops at least twice, and if possible "dephase" the different copies, as to get the most of the FC0 architecture. On the other hand, this reduces the number of available registers and a loop unrolling might not yield a good win with more than 4 copies.

       Curiously, loop unrolling also applies to the pointers. Each new address value must be valid before entering the execution pipeline. One must duplicate the pointer registers because the [register+immediate offset] addressing mode is potentially dangerous. The "pointer duplication" technique must be used when a high memory bandwidth must be sustained because it benefits from the fully pipelined pointer update and checking mechanism. Again, at least 4 pointers are necessary to achieve the peak instant bandwidth. The problem is that only two registers can point to the same cache line at a time, the four register must reference two different streams.

       Because of the previously explained mechanisms (speculative and background checking of the pointers in order to catch faultive instructions at decode stage) only post-increment addressing and direct register jump[/call] are supported, because the address is known before the instruction is executed. One must prefetch the locations from memory before use, by "associating" a pointer register to a memory location. When this prefetch is scheduled enough in advance, this give the CPU time to check the pointer in the TLB, prefetch the necessary data from the memory hierarchy or prefetch the TLB replacement code if the pointer is invalid.

       In "vector loops" where linear arrays of data are processed, the prefetch mechanism is helped by the "stream hint" which help the CPU determine (following the architecture) which L/S Unit contains the data and/or which memory stride (or SDRAM bank) must be used. The "cache hint"ed L/S instructions further reduce the cache memory thrashing by specifying which data should reside on-chip, which data can be flushed after use and which data must bypass the cache and go to the main memory.

       It is also recommended to use the L/S post-incremented instructions in order to prefetch data that are not accessed linearly. For example, a program that reads non-contiguous operands in memory with only one pointer register (r2) can do the following :

loadi (operand2-operand1),r2,r3

..(several instructions here)..

loadi (operand3-operand2),r2,r3

..(several instructions here)..

loadi (operand4-operand3),r2,r3

..(several instructions here)..

loadi (operand5-operand4),r2,r3

Of course, several conditions must be present : the difference between the addresses must be known and fit in the immediate field. If the difference is below 2^16, one can use a loadconsx inside a stall cycle. Ultimately, the data addresses or the access order can be changed.

 

[ to be continued ! yg. ]

 


part7.html dec. 16 by Whygee