# CSE 30 -- Lecture 12 -- Nov 10

## Loop Unrolling

The loop to copy an array, in C,
```for (i = 0; i < N; i++) {
dst[i] = src[i];
}
```
is translated into MIPS as
```		li \$t0, 0
b test
bod:		sll \$t1,\$t0,2
lw \$t4,0(\$t3)
sw \$t4,0(\$t2)
test:		blt \$t0,\$a3,bod
```
with the obvious register assignments. The runtime of this code is 3 + 7 N cycles.

To unroll this loop, first we assume divisibility of N by 4:

```for (i = 0, sp = src, dp = dst; i < N; i += 4) {
dp[i+0] = sp[i+0];
dp[i+1] = sp[i+1];
dp[i+2] = sp[i+2];
dp[i+3] = sp[i+3];
dp += 4; sp += 4;
}
```
which would be translated into MIPS code as
```		li \$t0, 0
move \$t8,\$a0
move \$t9,\$a1
b test
bod:		lw \$t1,0(\$t9)
lw \$t2,4(\$t9)
lw \$t3,8(\$t9)
lw \$t4,12(\$t9)
sw \$t1,0(\$t8)
sw \$t2,4(\$t8)
sw \$t3,8(\$t8)
sw \$t4,12(\$t8)
test:		blt \$t0,\$a2,bod
```
which has a run time of 5 + 12 (N/4) = 5 + 3 N cycles. This could actually be improved a little still, without unrolling any more:
```		move \$t8,\$a0
move \$t9,\$a1
sll \$t1,\$a2,2
b test
bod:		lw \$t1,0(\$t9)
lw \$t2,4(\$t9)
lw \$t3,8(\$t9)
lw \$t4,12(\$t9)
sw \$t1,0(\$t8)
sw \$t2,4(\$t8)
sw \$t3,8(\$t8)
sw \$t4,12(\$t8)
test:		blt \$t9,\$t0,bod
```
What did I do there?

To handle the cases when N is not a multiple of 4, we do

```		move \$t8,\$a0
move \$t9,\$a1
#
and \$t1,\$a2,3
sll \$t1,\$t1,2	# was missing
lw \$t1,jtbl(\$t1)
jr \$t1
L3:		lw \$t1,0(\$t9)
sw \$t1,0(\$t8)
L2:		lw \$t1,0(\$t9)
sw \$t1,0(\$t8)
L1:		lw \$t1,0(\$t9)
sw \$t1,0(\$t8)
and \$t1,\$a2,~3	# 0xfffffffc
sll \$t1,\$t1,2
.data
jtbl:		.word test
.word L1, L2, L3
.text
#
bod:		lw \$t1,0(\$t9)
lw \$t2,4(\$t9)
lw \$t3,8(\$t9)
lw \$t4,12(\$t9)
sw \$t1,0(\$t8)
sw \$t2,4(\$t8)
sw \$t3,8(\$t8)
sw \$t4,12(\$t8)
test:		blt \$t9,\$t0,bod
```
This is roughly the C code:
```		sp = src; dp = dst;
switch (N % 4) {
case 3:	*dp++ = *sp++;
case 2:	*dp++ = *sp++;
case 1:	*dp++ = *sp++;
}
N = N & ~3;
for (endptr = sp + N; sp < endptr; ) {
dp[i+0] = sp[i+0];
dp[i+1] = sp[i+1];
dp[i+2] = sp[i+2];
dp[i+3] = sp[i+3];
dp += 4; sp += 4;
}
```