/*

DCT1D_03.c
version mar jan  1 08:45:57 GMT 2002
whygee@f-cpu.org

1* 8-bin DCT for a "baseline" JPG compressor.

Optimisation : part 3 = distance verification

Before the registers are renamed (down to around
16 registers), we have to check that the original
order preserves all the dependency distances.

Here we use the following rules of thumb :
 * addition or substraction : the result is not
   available before 2 cycles
 * multiplication : "mulh" is used, let's consider
   that it takes 6 cycles.

Here, "cycles" are the number of instructions
that separate those with dependencies. For example :
 a = b + c  <= takes 2 cycles
 d = e + f
 g = h + i
 k = l + a  <= a is ready now.

If there is a conflict, we have to swap several
operations. If the swap's range is too large,
there is an avalanche effect with the dependencies
which forces to re-analyse everything through
a global dataflow-graph reordering and flattening.

If you don't have the courage to do all these
heavy operations, then you are forced
to wait during the pipeline stall. However,
note that no compiler is able to do all our
manual operations automatically :-P

*/

sample
  b0, b1, b2, b3, b4, b5, b6, b7,
  c0, c1, c2, c3, c4, c5, c6,
  d0, d1,     d3, d4,
          e2, e3, e4,     e6, e7,
          f2, f3, f4, f5, f6, f7;

/*
 here, we assume that there is no pending operation
   that deals with aX.
 */

/* Step 1 */
b0 = a0 + a7;
b1 = a1 + a6;
b2 = a2 - a4;
b3 = a1 - a6;
b4 = a2 + a5;
b5 = a3 + a4;
b6 = a2 - a5;
b7 = a0 - a7;  /***/

/* Step 2 */
c0 = b0 + b5;        /* b5 = 2 cycles : limit...
  if there was a conflict we could have swapped with c1. */
c1 = b1 - b4;        /* b4 = 4 cycles */
c2 = b2 + b6;  /***/ /* b4 = 4 cycles */
c3 = b1 + b4;     /* no more dangerous dependencies here. */
c4 = b0 - b5;  /***/
c5 = b3 + b7;  /***/
c6 = b3 + b6;  /***/

/* Step 3 */
S0 = c0 + c3;  /***/ /* c3=3 */
S4 = c0 - c3;  /***/ /* c3=4 */
d3 = c1 + c4;        /* c4=4 */
d4 = c2 - c5;        /* c5=4 */

/* Step 4 */
e2 = m3 * c2;  /*c2*/
e3 = m1 * c6;  /*c6*/
e4 = m4 * c5;  /*c5*/
e6 = m1 * d3;         /* d3=4 */
e7 = m2 * d4;         /* d4=4 */

/* Step 5 */
S2 = c4 + e6;  /*c4*/ /* e6=1 : OUCH !!! */
S6 = c4 - e6;  /*c4*/ /* e6=2 : ... */
f4 = e3 + b7;  /*b7*/ /* e3=5 */
f5 = b7 - e3;  /*b7*/ /* e3=6 */
f6 = e2 + e7;         /* e7=4 */
f7 = e4 + e7;         /* e7=5 */

/* Step 6 */
/*S0 = d0;*/
S1 = f4 + f7;
/*S2 = f2;*/
S3 = f5 - f6;
/*S4 = d1;*/
S5 = f5 + f6;
/*S6 = f3;*/
S7 = f4 - f7;
