/*

DCT1D_04.c
version mar jan  1 08:45:57 GMT 2002
whygee@f-cpu.org

1* 8-bin DCT for a "baseline" JPG compressor.

Optimisation : part 4 = rescheduling

It appears that there is a problem at the end
of the block, mainly because of the multiply's
latency and a slight order problem at the beginning
of step 5. We have to reorder the operations
to reduce the overall latency.

End-of-block and start-of-block contentions
are common because the dependencies exhaust.
On top of that, Winograd's algorithm is very
optimised but the dependency diagram is not
symmetrical as in the usual "butterfly" FFT.
The problem that arises here is not surprising
because the dependendy tree narrows.

In a "real" case, there are other things to
interleave with, such as load/stores, counter
or pointer manipulations... However we will
reorder the instructions manually.

When we can't recover the lost cycles,
the solution is to interleave some other "useful
work" to fill the pipeline stalls. As noted before,
the first way is to interleave data movement
or transformations from the next code block.
The other (desperate) way is to unroll the loop :
  1) copy-paste the block
  2) rename all the registers in the second block
  3) interleave the two blocks in such a position
     that the stalls are filled.
  4) copy-paste the resulting block to form
     the starting and ending of the loop
  5) don't forget to divide the loop count by two

Fortunately, in the current case, it is possible
to manually reorder the instructions. It took a few
minutes and it spans over a larger part of the block
than expected but only one cycle is lost in the end,
which is not enough to justify loop unrolling.
This deep swapping increases the register usage
and a few more temporary variables are necessary,
maybe 20 or 22 instead or 16 or 18.

*/

typedef signed short int sample;

sample
  b0, b1, b2, b3, b4, b5, b6, b7,
  c0, c1, c2, c3, c4, c5, c6,
  d0, d1,     d3, d4,
          e2, e3, e4,     e6, e7,
          f2, f3, f4, f5, f6, f7;  /* the outputs */

/*
 here, we assume that there is no pending operation
   that deals with aX.
 */

/* Step 1 */
b0 = a0 + a7;
b1 = a1 + a6;
b2 = a2 - a4;
b3 = a1 - a6;
b4 = a2 + a5;
b5 = a3 + a4;
b6 = a2 - a5;
b7 = a0 - a7;  /***/

/* Step 2 */
c0 = b0 + b5;
c1 = b1 - b4;
c2 = b2 + b6;  /***/
c6 = b3 + b6;         /* moved */
c4 = b0 - b5;  /***/
c5 = b3 + b7;  /***/
c3 = b1 + b4;         /* moved */

/* Step 3 */
e3 = m1 * c6;          /*3*/ /* moved */
d3 = c1 + c4;          /*3*/
d4 = c2 - c5;          /*3*/

/* Step 4 */
e2 = m3 * c2;  /*c2*/  /*7*/
e4 = m4 * c5;  /*c5*/  /*5*/

e6 = m1 * d3;          /*3*/
e7 = m2 * d4;          /*3*/
S0 = c0 + c3;          /*7*/
S4 = c0 - c3;          /*8*/

/* Step 5 */
f4 = e3 + b7;  /*b7*/ /* e3=8 */
f5 = b7 - e3;  /*b7*/ /* e3=9 */
/* one free slot here : the multiply is not ready */
S2 = c4 + e6;  /*c4*/ /* e6=5 : OUCH !!! */
f6 = e2 + e7;         /* e7=5(6 after the stall) */
S6 = c4 - e6;  /*c4*/ /* e6=already ok */
f7 = e4 + e7;         /* e7=already ok */

/* Step 6 (reordered) */
S3 = f5 - f6;         /* f6=2 */   
S5 = f5 + f6;         /* f6=3 */
S1 = f4 + f7;         /* f7=2 */
S7 = f4 - f7;         /* f7=3 */
