CSE 30 -- Lecture 12 -- Nov 10

Loop Unrolling

The loop to copy an array, in C,
for (i = 0; i < N; i++) {
	dst[i] = src[i];
is translated into MIPS as
		li $t0, 0
		b test
bod:		sll $t1,$t0,2
		add $t2,$t1,$a0
		add $t3,$t1,$a1
		lw $t4,0($t3)
		sw $t4,0($t2)
		add $t0,$t0,1
test:		blt $t0,$a3,bod
with the obvious register assignments. The runtime of this code is 3 + 7 N cycles.

To unroll this loop, first we assume divisibility of N by 4:

for (i = 0, sp = src, dp = dst; i < N; i += 4) {
	dp[i+0] = sp[i+0];
	dp[i+1] = sp[i+1];
	dp[i+2] = sp[i+2];
	dp[i+3] = sp[i+3];
	dp += 4; sp += 4;
which would be translated into MIPS code as
		li $t0, 0
		move $t8,$a0
		move $t9,$a1
		b test
bod:		lw $t1,0($t9)
		lw $t2,4($t9)
		lw $t3,8($t9)
		lw $t4,12($t9)
		sw $t1,0($t8)
		sw $t2,4($t8)
		sw $t3,8($t8)
		sw $t4,12($t8)
		add $t0,$t0,4
		add $t9,$t9,16
		add $t8,$t8,16
test:		blt $t0,$a2,bod
which has a run time of 5 + 12 (N/4) = 5 + 3 N cycles. This could actually be improved a little still, without unrolling any more:
		move $t8,$a0
		move $t9,$a1
		sll $t1,$a2,2
		add $t0,$t9,$t1
		b test
bod:		lw $t1,0($t9)
		lw $t2,4($t9)
		lw $t3,8($t9)
		lw $t4,12($t9)
		sw $t1,0($t8)
		sw $t2,4($t8)
		sw $t3,8($t8)
		sw $t4,12($t8)
		add $t9,$t9,16
		add $t8,$t8,16
test:		blt $t9,$t0,bod
What did I do there?

To handle the cases when N is not a multiple of 4, we do

		move $t8,$a0
		move $t9,$a1
		and $t1,$a2,3
		sll $t1,$t1,2	# was missing new!
		lw $t1,jtbl($t1)
		jr $t1
L3:		lw $t1,0($t9)
		sw $t1,0($t8)
		add $t9,$t9,4
		add $t8,$t8,4
L2:		lw $t1,0($t9)
		sw $t1,0($t8)
		add $t9,$t9,4
		add $t8,$t8,4
L1:		lw $t1,0($t9)
		sw $t1,0($t8)
		add $t9,$t9,4
		add $t8,$t8,4
		and $t1,$a2,~3	# 0xfffffffc
		sll $t1,$t1,2
		add $t0,$t9,$t1
jtbl:		.word test
		.word L1, L2, L3
bod:		lw $t1,0($t9)
		lw $t2,4($t9)
		lw $t3,8($t9)
		lw $t4,12($t9)
		sw $t1,0($t8)
		sw $t2,4($t8)
		sw $t3,8($t8)
		sw $t4,12($t8)
		add $t9,$t9,16
		add $t8,$t8,16
test:		blt $t9,$t0,bod
This is roughly the C code:
		sp = src; dp = dst;
		switch (N % 4) {
		case 3:	*dp++ = *sp++;
		case 2:	*dp++ = *sp++;
		case 1:	*dp++ = *sp++;
		N = N & ~3;
		for (endptr = sp + N; sp < endptr; ) {
			dp[i+0] = sp[i+0];
			dp[i+1] = sp[i+1];
			dp[i+2] = sp[i+2];
			dp[i+3] = sp[i+3];
			dp += 4; sp += 4;

Multitasking, multithreading

Operating system concepts: virtual memory, MMU, translation of physical addresses into physical addresses, address spaces. The use of VM for protection, allowing programs to be oblivious to actual memory size. The concept of VM being the idea of using physical memory as cache for disk memory, just like the cache is cache for RAM. Locality of reference again.

Multitasking is the ability to run several (usually unrelated) programs at once; the programs typically have separate address spaces. Multithreading is having several virtual CPUs, typically sharing the same address space.

[ search CSE | CSE home | bsy's home page | webster i/f | yahoo | hotbot | lycos | altavista ]
picture of bsy

bsy+www@cs.ucsd.edu, last updated Mon Nov 30 21:53:17 PST 1998.

email bsy & tutors

Don't make me hand over my privacy keys!