Compiler Optimizations

Compiler Optimizations#

  • Assume the following latencies (# of cycles between instructions)
    • FP ALU Op to Store double (2)
    • Load double to FP ALU Op (1)
    • Load double to Store double (0)
    • FP ALU Op to another FP ALU Op (3)
    • ALU to ALU (1)

Example 1#

for (i=1000; i>0; i--) {
x[i] = x[i] + s; // s is stored in F2
}
LOOP: ; 9 cycles per interation
L.D F0, 0 (R1) ; 1 cycle delay inserted below this instruction
ADD.D F4, F0, F2 ; 2 cycle delay inserted below this instruction
S.D F4, 0 (R1)
DADDI R1, R1, -8 ; 1 cycle delay inserted below this instruction
BNE R1, R2, LOOP

Loop Unrolling#

LOOP: ; 15 cycles over two iterations
L.D F0, 0 (R1) ; first iteration here
ADD.D F4, F0, F2
S.D F4, 0 (R1)
L.D F0, -8 (R1) ; second interation here
ADD.D F4, F0, F2
S.D F4, -8 (R1)
DADDI R1, R1, -16
BNE R1, R2, LOOP

This optimization is called loop unrolling: generate code with multiple copies of the loop

Code Scheduling#

LOOP: ; 8 cycles over two iterations
L.D F0, 0 (R1)
L.D F10, -8 (R1)
ADD.D F4, F0, F2
ADD.D F12, F10, F2
DADDI R1, R1, -16
S.D F4, 16 (R1)
S.D F12, 8 (R1)
BNE R1, R2, LOOP

This is called code scheduling: Reorder instructions to eliminate stalls. A limitation is the number of registers

Example 2: Software Pipelining#

Running instructions for different iterations of the original loop in one iteration of the optimized loop

for (i=1000; i>0; i++) {
x[i] = x[i] + s; // s is stored in F2
}
LOOP: ; 9 cycles per interation
L.D F0, 0 (R1) ; 1 cycle delay inserted below this instruction
ADD.D F4, F0, F2 ; 2 cycle delay inserted below this instruction
S.D F4, 0 (R1)
DADDI R1, R1, 8 ; 1 cycle delay inserted below this instruction
BNE R1, R2, LOOP

Load - Add - Store

Each time you run the loop, you run a SAL. In the very beginning, you need to do the first Load and add. This technique is called software pipelining.

PROLOG:
L.D F0, 0 (R1)
ADD.D F4, F0, F2
L.D F0, 8 (R1)
DADDI R1, R1, 16
LOOP:
S.D F4, -16 (R1)
ADD.D F4, F0, F2
L.D F0, 0 (R1)
DADDI R1, R1, 8
BNE R1, R2, LOOP
EPILOG:
S.D F4, -16 (R1)
ADD.D F4, F0, F2
S.D. F4, -8 (R1)
Last updated on