# CSE 30 -- Lecture 13 -- Nov 13

In this lecture, I talked about loop unrolling. The code that I gave was:
```void wcopy(int *dp, int *sp, int imax)
{
int	i;

for (i = 0; i < imax; i++)
*dp++ = *sp++;
}
```
which is equivalent to
```void wcopy(int *dp, int *sp, int imax)
{
int	i;

i = 0;
while (i < imax) {
*dp++ = *sp++;
i++;
}
}
```
and is naively translated into
```	li \$t0, 0
loop:	bge \$t0, \$a2, loopdone	# overhead
lw \$t1, 0(\$a1)		#  real work
add \$a1, 4		#  real work
sw \$t1, 0(\$a0)		#  real work
add \$a0, 4		#  real work
loopdone:
jr \$ra
```
which has 3 instructions of overhead for 4 instructions of work, giving a loop overhead of 3/7 or 43%. The total loop is 7 cycles, ignoring for the moment delay slots.

Most compilers are smarter, and would translate the code into

```	li \$t0, 0
j test
top:	lw \$t1, 0(\$a1)		#  real work
add \$a1, 4		#  real work
sw \$t1, 0(\$a0)		#  real work
add \$a0, 4		#  real work
test:	ble \$t0, \$a2, top	# overhead
jr \$ra
```
This gives 2 instructions of overhead out of 4 real work or 6 total, or an overhead percentage of 34%. Total loop length of 6 cycles.

The C code itself could be optimized:

```void wcopy(int *dp, int *sp, int imax)
{
int	*dpend = dp + imax;

while (dp != dpend)
*dp++ = *sp++;
}
```
which translates to
```	sll \$a2, 2		# imax * sizeof(int)
j test
top:	lw \$t1, 0(\$a1)		#  real work
add \$a1, 4		#  real work
sw \$t1, 0(\$a0)		#  real work
add \$a0, 4		#  real work
test:	bne \$a0, \$t0, top	# overhead
jr \$ra
```
giving an overhead percentage of 20%, a body length of 5 cycles.

With loop unrolling, we can achieve a much lower overhead:

```void wcopy(int *dp, int *sp, int imax)
{
int	*dpend;

switch (imax & 0x3) {	/* imax % 4 */
case 3:	*dp++ = *sp++;
case 2:	*dp++ = *sp++;
case 1:	*dp++ = *sp++;
}
imax &= ~0x3;		/* adjust imax for the work we did above */
dpend = dp + imax;
while (dpend != dp) {
dp[0] = sp[0];	/* 4 copies of basic loop body */
dp[1] = sp[1];	/* but using small indices instead */
dp[2] = sp[2];	/* of bumping the pointer each time */
dp[3] = sp[3];
dp += 4;
sp += 4;
}
}
```
This translates to:
```	andi \$t1, \$a2, 0x3
beq \$t1, 0, L0		# switch
beq \$t1, 1, L1
beq \$t1, 2, L2
lw \$t0, 0(\$a1)
sw \$t0, 0(\$a0)
L2:	lw \$t0, 0(\$a1)
sw \$t0, 0(\$a0)
L1:	lw \$t0, 0(\$a1)
sw \$t0, 0(\$a0)
L0:	and \$a2, \$a2, 0xfffffffc
sll \$a2, 2
j test
loop:	lw \$t1, 0(\$a1)		#  real work
sw \$t1, 0(\$a0)		#  real work
lw \$t1, 4(\$a1)		#  real work # small indices require no extra
sw \$t1, 4(\$a0)		#  real work # work to be done, since the
lw \$t1, 8(\$a1)		#  real work # instruction  allows it
sw \$t1, 8(\$a0)		#  real work
lw \$t1, 0xc(\$a1)	#  real work
sw \$t1, 0xc(\$a0)	#  real work
add \$a1, 0x10		#  real work # 4 * sizeof(int)
add \$a0, 0x10		#  real work
test:	bne \$a0, \$t0, loop	# overhead
jr \$ra
```
giving an overhead of 1 instruction per loop iteration, or only 1/11 or 9%. Extending the unroll to 8 times instead of 4 will drive the overhead per iteration even smaller. Note that to schedule this code to take load delay and branch delay slots into account, it would really look like:
```	#
# initial portion the same
#
andi \$t1, \$a2, 0x3
beq \$t1, 0, L0		# switch
beq \$t1, 1, L1
beq \$t1, 2, L2
lw \$t0, 0(\$a1)
sw \$t0, 0(\$a0)
L2:	lw \$t0, 0(\$a1)
sw \$t0, 0(\$a0)
L1:	lw \$t0, 0(\$a1)
sw \$t0, 0(\$a0)
L0:	and \$a2, \$a2, 0xfffffffc
sll \$a2, 2
#
# more optimization below
#
.noreorder
j test
sub \$a1, 0x10		# added to permit motion into delay slot
# itself in delay slot of the j
loop:	lw \$t1, 0(\$a1)		#  real work
lw \$t2, 4(\$a1)		#  real work
lw \$t3, 8(\$a1)		#  real work
lw \$t4, 0xc(\$a1)	#  real work
sw \$t1, 0(\$a0)		#  real work
sw \$t2, 4(\$a0)		#  real work
sw \$t3, 8(\$a0)		#  real work
sw \$t4, 0xc(\$a0)	#  real work
add \$a0, 0x10		#  real work
test:	bne \$a0, \$t0, loop	# overhead
add \$a1, 0x10		#  real work, 4 * sizeof(int), in delay slot
#  of the bne
.reorder
jr \$ra
```

In loop unrolling, the idea is to make copies of the loop body so the loop control overhead per basic copy is smaller (by the number of copies made). The unrolling factor may be anything, but usually is a power of two to make the rounding part easier. The goal is that if we run the loop several million times, the loop will be consuming most of the cycles: the extra work before we enter the loop is comparatively small and may be ignored.

Let's see what are the effects of the optimization. If we were to copy twenty million words using the original, naively translated version, we need to run through the loop twenty million times, so would use 140,000,000 cycles (ignoring delay slots). The output of a real compiler would get the job done in 120,000,000 cycles. Either is an appreciable fraction of a second -- or over a second -- on many processors (there are d-cache and memory bandwidth effects too, but we're ignoring these for now; these effects would slow this down). In the best tuned version, we need to run through the loop five million times (each iteration copies 4 elements), with a loop body of 11 cycles, so we use a total of 55,000,000 cycles; this is more than twice as fast as the original.

(D-cache and memory bandwidth limitations will mask some of the effects of the loop unrolling, since both a naive implementation and a clever implementation would be memory-bandwidth limited, and the effects will be smaller -- the total per iteration time will have to take the cache miss overhead into account, so the percentage overhead reduction will be smaller.)

[ CSE 80 | ACS home | CSE home | CSE calendar | bsy's home page ]

bsy@cse.ucsd.edu, last updated Thu Nov 14 00:27:05 PST 1996.

email bsy