In this lecture, I talked about loop unrolling. The code that I gave was:

void wcopy(int *dp, int *sp, int imax) { int i; for (i = 0; i < imax; i++) *dp++ = *sp++; }which is equivalent to

void wcopy(int *dp, int *sp, int imax) { int i; i = 0; while (i < imax) { *dp++ = *sp++; i++; } }and is naively translated into

li $t0, 0 loop: bge $t0, $a2, loopdone # overhead lw $t1, 0($a1) # real work add $a1, 4 # real work sw $t1, 0($a0) # real work add $a0, 4 # real work add $t0, 1 # overhead j loop # overhead loopdone: jr $rawhich has 3 instructions of overhead for 4 instructions of work, giving a loop overhead of 3/7 or 43%. The total loop is 7 cycles, ignoring for the moment delay slots.

Most compilers are smarter, and would translate the code into

li $t0, 0 j test top: lw $t1, 0($a1) # real work add $a1, 4 # real work sw $t1, 0($a0) # real work add $a0, 4 # real work add $t0, 1 # overhead test: ble $t0, $a2, top # overhead jr $raThis gives 2 instructions of overhead out of 4 real work or 6 total, or an overhead percentage of 34%. Total loop length of 6 cycles.

The C code itself could be optimized:

void wcopy(int *dp, int *sp, int imax) { int *dpend = dp + imax; while (dp != dpend) *dp++ = *sp++; }which translates to

sll $a2, 2 # imax * sizeof(int) add $t0,$a0,$a2 # dpend j test top: lw $t1, 0($a1) # real work add $a1, 4 # real work sw $t1, 0($a0) # real work add $a0, 4 # real work test: bne $a0, $t0, top # overhead jr $ragiving an overhead percentage of 20%, a body length of 5 cycles.

With *loop unrolling*, we can achieve a much lower overhead:

void wcopy(int *dp, int *sp, int imax) { int *dpend; switch (imax & 0x3) { /* imax % 4 */ case 3: *dp++ = *sp++; case 2: *dp++ = *sp++; case 1: *dp++ = *sp++; } imax &= ~0x3; /* adjust imax for the work we did above */ dpend = dp + imax; while (dpend != dp) { dp[0] = sp[0]; /* 4 copies of basic loop body */ dp[1] = sp[1]; /* but using small indices instead */ dp[2] = sp[2]; /* of bumping the pointer each time */ dp[3] = sp[3]; dp += 4; sp += 4; } }This translates to:

andi $t1, $a2, 0x3 beq $t1, 0, L0 # switch beq $t1, 1, L1 beq $t1, 2, L2 lw $t0, 0($a1) add $a1, 4 sw $t0, 0($a0) add $a0, 4 L2: lw $t0, 0($a1) add $a1, 4 sw $t0, 0($a0) add $a0, 4 L1: lw $t0, 0($a1) add $a1, 4 sw $t0, 0($a0) add $a0, 4 L0: and $a2, $a2, 0xfffffffc sll $a2, 2 add $t0, $a0, $a2 j test loop: lw $t1, 0($a1) # real work sw $t1, 0($a0) # real work lw $t1, 4($a1) # real work # small indices require no extra sw $t1, 4($a0) # real work # work to be done, since the lw $t1, 8($a1) # real work # instruction allows it sw $t1, 8($a0) # real work lw $t1, 0xc($a1) # real work sw $t1, 0xc($a0) # real work add $a1, 0x10 # real work # 4 * sizeof(int) add $a0, 0x10 # real work test: bne $a0, $t0, loop # overhead jr $ragiving an overhead of 1 instruction per loop iteration, or only 1/11 or 9%. Extending the unroll to 8 times instead of 4 will drive the overhead per iteration even smaller. Note that to schedule this code to take load delay and branch delay slots into account, it would really look like:

# # initial portion the same # andi $t1, $a2, 0x3 beq $t1, 0, L0 # switch beq $t1, 1, L1 beq $t1, 2, L2 lw $t0, 0($a1) add $a1, 4 sw $t0, 0($a0) add $a0, 4 L2: lw $t0, 0($a1) add $a1, 4 sw $t0, 0($a0) add $a0, 4 L1: lw $t0, 0($a1) add $a1, 4 sw $t0, 0($a0) add $a0, 4 L0: and $a2, $a2, 0xfffffffc sll $a2, 2 add $t0, $a0, $a2 # # more optimization below # .noreorder j test sub $a1, 0x10 # added to permit motion into delay slot # itself in delay slot of the j loop: lw $t1, 0($a1) # real work lw $t2, 4($a1) # real work lw $t3, 8($a1) # real work lw $t4, 0xc($a1) # real work sw $t1, 0($a0) # real work sw $t2, 4($a0) # real work sw $t3, 8($a0) # real work sw $t4, 0xc($a0) # real work add $a0, 0x10 # real work test: bne $a0, $t0, loop # overhead add $a1, 0x10 # real work, 4 * sizeof(int), in delay slot # of the bne .reorder jr $ra

In loop unrolling, the idea is to make copies of the loop body so the loop control overhead per basic copy is smaller (by the number of copies made). The unrolling factor may be anything, but usually is a power of two to make the rounding part easier. The goal is that if we run the loop several million times, the loop will be consuming most of the cycles: the extra work before we enter the loop is comparatively small and may be ignored.

Let's see what are the effects of the optimization. If we were to
copy twenty million words using the original, naively translated
version, we need to run through the loop twenty million times, so
would use 140,000,000 cycles (ignoring delay slots). The output of a
real compiler would get the job done in 120,000,000 cycles. Either is
an appreciable fraction of a second -- or over a second -- on many
processors (there are d-cache and memory bandwidth effects too, but
we're ignoring these for now; these effects would slow this down). In
the best tuned version, we need to run through the loop five million
times (each iteration copies 4 elements), with a loop body of 11
cycles, so we use a total of 55,000,000 cycles; this is more than
**twice** as fast as the original.

(D-cache and memory bandwidth limitations will mask some of the effects of the loop unrolling, since both a naive implementation and a clever implementation would be memory-bandwidth limited, and the effects will be smaller -- the total per iteration time will have to take the cache miss overhead into account, so the percentage overhead reduction will be smaller.)

[ CSE 80 | ACS home | CSE home | CSE calendar | bsy's home page ]

bsy@cse.ucsd.edu, last updated