CSE 30 -- Lecture 12 -- Nov 10

Loop Unrolling

The loop to copy an array, in C,

for (i = 0; i < N; i++) {
	dst[i] = src[i];
}

is translated into MIPS as

		li $t0, 0
		b test
bod:		sll $t1,$t0,2
		add $t2,$t1,$a0
		add $t3,$t1,$a1
		lw $t4,0($t3)
		sw $t4,0($t2)
		add $t0,$t0,1
test:		blt $t0,$a3,bod

with the obvious register assignments. The runtime of this code is 3 + 7 N cycles.

To unroll this loop, first we assume divisibility of N by 4:

for (i = 0, sp = src, dp = dst; i < N; i += 4) {
	dp[i+0] = sp[i+0];
	dp[i+1] = sp[i+1];
	dp[i+2] = sp[i+2];
	dp[i+3] = sp[i+3];
	dp += 4; sp += 4;
}

which would be translated into MIPS code as

		li $t0, 0
		move $t8,$a0
		move $t9,$a1
		b test
bod:		lw $t1,0($t9)
		lw $t2,4($t9)
		lw $t3,8($t9)
		lw $t4,12($t9)
		sw $t1,0($t8)
		sw $t2,4($t8)
		sw $t3,8($t8)
		sw $t4,12($t8)
		add $t0,$t0,4
		add $t9,$t9,16
		add $t8,$t8,16
test:		blt $t0,$a2,bod

which has a run time of 5 + 12 (N/4) = 5 + 3 N cycles. This could actually be improved a little still, without unrolling any more:

		move $t8,$a0
		move $t9,$a1
		sll $t1,$a2,2
		add $t0,$t9,$t1
		b test
bod:		lw $t1,0($t9)
		lw $t2,4($t9)
		lw $t3,8($t9)
		lw $t4,12($t9)
		sw $t1,0($t8)
		sw $t2,4($t8)
		sw $t3,8($t8)
		sw $t4,12($t8)
		add $t9,$t9,16
		add $t8,$t8,16
test:		blt $t9,$t0,bod

What did I do there?

To handle the cases when N is not a multiple of 4, we do

		move $t8,$a0
		move $t9,$a1
		#
		and $t1,$a2,3
		sll $t1,$t1,2	# was missing 
		lw $t1,jtbl($t1)
		jr $t1
L3:		lw $t1,0($t9)
		sw $t1,0($t8)
		add $t9,$t9,4
		add $t8,$t8,4
L2:		lw $t1,0($t9)
		sw $t1,0($t8)
		add $t9,$t9,4
		add $t8,$t8,4
L1:		lw $t1,0($t9)
		sw $t1,0($t8)
		add $t9,$t9,4
		add $t8,$t8,4
		and $t1,$a2,~3	# 0xfffffffc
		sll $t1,$t1,2
		add $t0,$t9,$t1
		.data
jtbl:		.word test
		.word L1, L2, L3
		.text
		#
bod:		lw $t1,0($t9)
		lw $t2,4($t9)
		lw $t3,8($t9)
		lw $t4,12($t9)
		sw $t1,0($t8)
		sw $t2,4($t8)
		sw $t3,8($t8)
		sw $t4,12($t8)
		add $t9,$t9,16
		add $t8,$t8,16
test:		blt $t9,$t0,bod

This is roughly the C code:

		sp = src; dp = dst;
		switch (N % 4) {
		case 3:	*dp++ = *sp++;
		case 2:	*dp++ = *sp++;
		case 1:	*dp++ = *sp++;
		}
		N = N & ~3;
		for (endptr = sp + N; sp < endptr; ) {
			dp[i+0] = sp[i+0];
			dp[i+1] = sp[i+1];
			dp[i+2] = sp[i+2];
			dp[i+3] = sp[i+3];
			dp += 4; sp += 4;
		}

Multitasking, multithreading

Operating system concepts: virtual memory, MMU, translation of physical addresses into physical addresses, address spaces. The use of VM for protection, allowing programs to be oblivious to actual memory size. The concept of VM being the idea of using physical memory as cache for disk memory, just like the cache is cache for RAM. Locality of reference again.

Multitasking is the ability to run several (usually unrelated) programs at once; the programs typically have separate address spaces. Multithreading is having several virtual CPUs, typically sharing the same address space.

bsy+www@cs.ucsd.edu, last updated Mon Nov 30 21:53:17 PST 1998.

email bsy & tutors