/*

DCT1D_05.c
version mar jan  1 08:45:57 GMT 2002
whygee@f-cpu.org

1* 8-bin DCT for a "baseline" JPG compressor.

Optimisation : part 5 = register reallocation

Once the data dependencies are free from
pipeline stalls, we can "optimise" the
register allocation. It is not a critical part
in F-CPU because there are 64 registers :
it's well enough to avoid complex algorithms.

We could associate each register to a temporary
variable but this would reduce the available
ressources for the rest of the program, particularly
for the previous and next code blocks. We have
to keep data pointers, constants (index advance
and twiddle factors) and some headroom, in case
loop unrolling appears necessary in the future.
For example, a dual-issue FC0 would probably
require a special 2x unrolling : the duplication
of every instruction (not forgetting to rename
the registers), forming perfectly independent and
scheduled pairs. This is the simplest solution if
you don't want to recalculate all the data
dependecy distances and the register numbers.
Of course, this only works when all the execution
units are duplicated and the multiply unit
is too heavy to accept two instructions per cycle.
Some local reordering would then be necessary by
"shifting" the code stream by a certain amount
of instructions, so that there is no conflict.

Back to register re-allocation.
This algorithm is very simple (once you have
understood it). Furthermore there is no register
pressure so we can follow it quietly.

The basic principle is to identify a
datum, from the time it is created until the
last use. The end of a datum's life is
identified when the associated register
is written to, that is : when it is associated
to another datum (or more exactly, before, and
this detail becomes meaningful later).

For example :
  a = b + c       <-- c contains a first value
--- anything ---  <-- time during which c is unused
  c = d + e       <-- c contains a new value 

The algorithm uses a "scoreboard" approach
where each register has a flag (either allocated
and to which old register, or free).
In a forward scan, as in the P6 core family,
the scoreboard's entry is cleard whenever
the register is written to. If it is used before
being overwritten, this means that the value
it contained has died and we have thus detected
a datum's death. In order to detect this case,
it is more efficient to run the algorithm
from the end of the block (backwards),
otherwise the registers will contain unused
values during a significant fraction of the
program. This is an important difference
with out-of-order cores such as the P6 where
the instruction flow must be decoded in-order
for obvious reasons.

This register re-allocation method is optimal
when there is no register pressure and if
the coded algorithm is already optimised.
If there are not enough registers for a
given code block, things are getting much
more difficult because one has to "swap"
data to memory or split/slice the block
into pipelineable chunks. If it is not
possible, it is necessary to count
which data are most used to keep them in
registers. This allocation algorithm is
not funny at all because its complexity depends
on the instruction set's orthogonality...

No such problem here : we have enough registers
in the F-CPU to run an even more complex computation.
When our scoreboard needs a new entry and all the
entries are "busy", we can simply create a new entry.


The reallocation algorithm goes as follows:
 1- point at the end of the instruction stream
 2- identify the source operands as "dying data" 
 3- for each "dying data" :
      4 allocate a register for each data
     (set the corresponding scoreboard entry)
      5 go to the previous instruction
      6 if the datum is written
        7  replace the old reference with the
          new register name
        8 leave this loop (this is the end of datum)
      9 replace all source references with the
        new register name (carefully checking
        that there is no numbering conflict)
      10 loop back to 5
 11- If we have reached the beginning of
     the code block, exit this algorithm
 12- When all the curent instruction's references
     are replaced, the instruction can be "committed".
 13- go to the previous instruction (upper in the stream)
 14- identify the yet already re-allocated register
      references.
 15- if all the sources are already allocated, go to 11
 16- all the unallocated sources are "dying data", go to 3

There is a necessity to distinguish newly allocated
references from the original register references.
It is safer to use different symbolic names, such
as a1-a7 for the original names and R1-R63 for the
new names. While this algorithm is easy, it is also
very easy to make fatal errors, particularly if you
are sleep-deprived... You are warned : from experience
i can say that it can hurt very very harshly.


Here is a small example block :
  a0 = a1 + a2  (1)
  a1 = a2 + a3  (2)
  a2 = a3 + a4  (3)
  a5 = a6 + a1  (4)

We start at the end, where a5 = a6 + a1 and the scoreboard
is initially cleared. a5 can be replaced by anything a priori,
let's say R1. Then we propagate the new names of a1=R2 and a6=R3
(we have created new scoreboard entries) until these values are
"associated" with their value. 

  R1=a4  R2=a1  R3=a6

  a0 = a1 + a2  (1')
  R2 = a2 + a3  (2')
  a2 = a3 + a4  (3')
  R1 = R3 + R2  (4')

Now that all the references are replaced in the instruction,
we go up and see what references are not yet replaced. For
each replacement, we look in the scoreboard for unused entries :
R1 is free (since it is written later), we need to create two new
entries. The scoreboard then contains :

  R1=a4  R2=a1  R3=a6  R4=a3  R5=a2

  a0 = a1 + a2  (1'')
  R2 = a2 + R4  (2'')
  R5 = R4 + R1  (3'')

The process continues until all instructions are processed.
However we have to process inputs and outputs (terminals)
differently : here a2 is used both internally and externally.
Depending on the case, it is maybe necessary to allocate a new
register.

  R1=a4  R2=a1  R3=a6  R4=a3  R5=a2(terminal) R6=a2(internal)

  a0 = a1 + R6  (1''')
  R2 = R6 + R4  (2''')

and finally, the result is

  R1=a4  R2=a1  R3=a6  R4=a3  R5=a2(terminal) R6=a2(internal)
  R7=a1  R8=a0

  R8 = R7 + R6  (1'''')
  R2 = R6 + R4  (2''')
  R5 = R4 + R1  (3'')
  R1 = R3 + R2  (4')

This small example shows how it works but it becomes efficient
for only larger code blocks, such as our DCT. By changing
the allocation policy, we can require more or less registers
than previously. In our DCT where a lot of temporary variables
are used, this will "compress" the number used registers.
If we omit m1-m4, a1-a7 and S0-S7, we needed originally 30
registers and now we only need 10 after two hours of hacking.
This means that despite the instruction reordering which
increases the lifelength of some variables, we are still
near the minimum of 9 as written in the original algorithm.

* Note : the recoded algorithm is now completely
obfuscated ! You will not be able to understand it unless
you perform a complex and exhaustive analysis.
* Note 2: The algorithm must be modified if the CPU has
2-address instructions (ie x86). It can't be applied on the
instruction stream and it requires a higher level of
optimisation, using dataflow graph analysis. This is
much more complex because we have to determine which data
must be copied before the register is overwritten by a
result. It results in a higher register pressure and
a greater programming difficulty.

Do not forget that you have to read from the end of the file
if you want to understand the comments ;-)

Warning :
This manual algorithm is not difficult but any error can
be fatal. It is relatively easy to implement it in software
but no source processor exists yet. After some tens of lines,
it is impossible to avoid one or two errors, but here are
some advices :
 - run regression tests during all the process to detect
   errors as soon as possible
 - save the result under a new name often, in order to always
   have a failsafe and working source code.
 - keep the old and new values in two consecutive lines,
   one being commented out, this helps to understand what
   happens and visually detect errors.
 - implement some "gards" that will help detect the errors.
   In our example, it is done with the variable's naming :
   it is illogical to find a name like "e4" in your scoreboard
   when you are manipulating the b block, because e4 appears
   later and should not have been created. This means that
   you have forgotten to discard it from the scoreboard
   or erased the line where it is defined, or something even
   more difficult to trace back.
 - another example of error detection is when an instruction
   writes to a register that is not present in the scoreboard
   but this is not a "terminal".

*/


sample
  R1, R2, R3, R4, R5, R6, R7, R8, R9, R10; /* reallocated temporary registers */
/*b0, b1, b2, b3, b4, b5, b6, b7,  deprecated
  c0, c1, c2, c3, c4, c5, c6,
  d0, d1,     d3, d4,
          e2, e3, e4,     e6, e7,
          f2, f3, f4, f5, f6, f7; */

/*
 here, we assume that there is no pending operation
   that deals with aX.
 */

/* Step 1 */
/* b0 = a0 + a7; */
R7 = a0 + a7;
/* b1 = a1 + a6; */
R6 = a1 + a6;
/* b2 = a2 - a4; */
R4 = a2 - a4;
/* b3 = a1 - a6; */
R5 = a1 - a6; /* and here it's a simple substitution :-) */
/* b4 = a2 + a5; */
R9 = a2 + a5;
/* b5 = a3 + a4; */
R10 = a3 + a4;
/* b6 = a2 - a5; */
R3 = a2 - a5;
/* b7 = a0 - a7; */
R8 = a0 - a7;

/* Step 2 */
/* c0 = b0 + b5; R1=(c0) R2=free R3=b6 R4=b2 R5=b3 R6=b1 R7=b0 R8=b7 R9=b4 R10=b5 */
   R1 = R7 + R10; /* c0 is removed/"created, b0 and b5 survive */
/* c1 = b1 - b4; R1=c0 R2=(c1) R3=b6 R4=b2 R5=b3 R6=b1 R7=b0 R8=b7 R9=b4 R10=b5 */
   R2 = R6 - R9; /* c1 is removed <= c1 is "created, b1 and b4 survive */
/* c2 = b2 + b6; R1=c0 R2=c1 R3=b6 R4=c2/b2 R5=b3 R6=b1 R7=b0 R8=b7 R9=b4 R10=b5 */
   R4 = R4 + R3; /* replacement b2/c2/R4 <= c2 is "created", b2 dies, b6 survives */
/* c6 = b3 + b6; R1=c0 R2=c1 R3=c6/b6 R4=c2 R5=b3 R6=b1 R7=b0 R8=b7 R9=b4 R10=b5 */
   R3 = R5 + R3; /* replacement c6/b6/R3 <= b3 survives, b6 dies, c6 is created */
/* c4 = b0 - b5; R1=c0 R2=c1 R3=c6 R4=c2 R5=b3 R6=b1 R7=c4/b0 R8=b7 R9=b4 R10=b5 */
   R7 = R7 - R10; /* replacement c4/b0/R7 <= c4 is "created", b0 and b5 die */
/* c5 = b3 + b7; R1=c0 R2=c1 R3=c6 R4=c2 R5=c5/b3 R6=b1 R7=c4 R8=b7 R9=b4 */
   R5 = R5 + R8; /* replacement c5/b3/R5 <= c5 is "created", b3 dies, b7 survives */
/* c3 = b1 + b4; R1=c0 R2=c1 R3=c6 R4=c2 R5=c5 R6=b1 R7=c4 R8=b7 R9=c3/b4 */
   R9 = R6 + R9; /* c3 is "created", b4 and b1 die */

/* Step 3 */
/* e3 = m1 * c6; R1=c0 R2=c1 R3=c6/e3 R4=c2 R5=c5 R6=free R7=c4 R8=b7 R9=c3 */
   R3 = m1 * R3; /* replacement c6/e3/R3 */
/* d3 = c1 + c4; R1=c0 R2=c1/d3 R3=e3 R4=c2 R5=c5 R6=free R7=c4 R8=b7 R9=c3 */
   R2 = R2 + R7; /* replacement c1/d3/R2 */
/* d4 = c2 - c5; R1=c0 R2=d3 R3=e3 R4=c2 R5=c5 R6=free R7=c4 R8=b7 R9=c3 */
   R6 = R4 - R5; /* d4/R6 dies */

/* Step 4 */
/* e2 = m3 * c2; R1=c0 R2=d3 R3=e3 R4=c2 R5=c5 R6=d4 R7=c4 R8=b7 R9=c3 */
   R4 = m3 * R4; /* same as next line with e2/c2/R4 */
/* e4 = m4 * c5; R1=c0 R2=d3 R3=e3 R4=e2 R5=e4/c5 R6=d4 R7=c4 R8=b7 R9=c3 */
   R5 = m4 * R5; /* same as next line with e4/c5/R5 */

/* e6 = m1 * d3; R1=c0 R2=e6/d3 R3=e3 R4=e2 R5=e4 R6=d4 R7=c4 R8=b7 R9=c3 */
   R2 = m1 * R2; /* same as next line with e6/d3/R2 */
/* e7 = m2 * d4; R1=c0 R2=e6 R3=e3 R4=e2 R5=e4 R6=e7/d4 R7=c4 R8=b7 R9=c3 */
   R6 = m2 * R6; /* d4 is "created", its entry is replaced by d4 which dies. */

/* S0 = c0 + c3; R1=c0 R2=e6 R3=e3 R4=e2 R5=e4 R6=e7 R7=c4 R8=b7 R9=c3 */
   S0 = R1 + R9; /* nothing new. */
/* S4 = c0 - c3; R1=c0 R2=e6 R3=e3 R4=e2 R5=e4 R6=e7 R7=c4 R8=b7 R9=c3 */
   S4 = R1 - R9; /* c0 needs a slot, it replaces f4. new slot created for c3 */

/* Step 5 */
/* f4 = e3 + b7; R1=f4 R2=e6 R3=e3 R4=e2 R5=e4 R6=e7 R7=c4 R8=b7 */
   R1 = R3 + R8;  /* R1/f4 created and nothing new is allocated :
                     R1 disappears from the scoreboard */
/* f5 = b7 - e3; R1=f4 R2=e6 R3=f5/e3 R4=e2 R5=e4 R6=e7 R7=c4 R8=b7 */
   R3 = R8 - R3;  /* f5 appears, e3 dies, R3 is associated to both */
/* one free slot here : the multiply is not ready */
/* S2 = c4 + e6; R1=f4 R2=e6 R3=f5 R4=e2 R5=e4 R6=e7 R7=c4 */
   S2 = R7 + R2;
/* f6 = e2 + e7; R1=f4 R2=f7 R3=f5 R4=f6/e2 R5=e4 R6=e7 R7=c4 */
   R4 = R4 + R6;  /* e7 exists but not e2. f6 is being created.
                     R4 is both associated to f6 (new) and e2 (old).
                     We spare one new register allocation !
		     However, in CPUs with imprecise exception
                     (ALPHA for example) this trick is not allowed
                     because on FPU fault the exception handler can't
                     know which value (old or new) the register contains. */
/* S6 = c4 - e6; R1=f4 R2=e6 R3=f5 R4=f6 R5=e4 R6=e7 R7=c4 */
   S6 = R7 - R2; /* e6 replaces f7 */
/* f7 = e4 + e7; R1=f4 R2=f7 R3=f5 R4=f6 R5=e4 R6=e7 */
   R2 = R5 + R6; /* f7 is "created" so we remove it from the scoreboard */

/* Step 6 */
/* S3 = f5 - f6; R1=f4 R2=f7 R3=f5 R4=f6 */
   S3 = R3 - R4;
/* S5 = f5 + f6; R1=f4 R2=f7 R3=f5 R4=f6 */
   S5 = R3 + R4;
/* S1 = f4 + f7; R1=f4 R2=f7 */
   S1 = R1 + R2;
/* S7 = f4 - f7; R1=f4 R2=f7 */
   S7 = R1 - R2;
