CSE 30 -- Lecture 9 -- Oct 27

Endianism

Big-endian byte order: on a 32-bit machine, the most significant byte of a 4 byte word is stored into the lowest address byte of a consecutive 4 byte region of memory. (On a 64-bit processor, a word is 8 bytes wide, so it would be a consecutive 8 byte region.)

Little-endian byte order: the reverse. The least significant byte is stored in the lowest address.

The MIPS architecture is bi-endian -- the processor chip can be configured to be big-endian (as in the SGI machines) or little-endian (as in the old DEC machines).

Word alignment

RISC processors typically require that the load/store instructions reference memory with ``aligned'' addresses. On a 32-bit machine, a word is 4 bytes wide, and alignment means that the address must be a multiple of 4 -- or equivalently, the low order 2 bits of the address are 0.

The reason for doing this is performance/simplicity: the data bus is typically the width of a cache line (or a small integer fraction of a cache line for lower cost implementations), and aligned memory references mean that in all cases a single bus transaction will suffice to obtain the word in a cache miss. If words do not have to be aligned, then two bus transfers would be needed if the word spans two cache lines. Requiring the hardware to detect when this is needed and handle such transfers makes the processor implementation more complex and thus slower.

Recursion assm example

Factorial example:

fact:		sub $sp,$sp,12
		sw $fp,4($sp)
		add $fp,$sp,12
		sw $ra,-4($fp)
		bgt $a0,1,rec_fact
		li $v0,1
		b rec_done
rec_fact:	sw $a0,0($fp)
		sub $a0,$a0,1
		jal fact
		lw $a0,0($fp)
		mult $v0,$v0,$a0
rec_done:	move $t0,$fp
		lw $fp,-8($fp)
		lw $ra,-4($fp)
		move $sp,$t0
		jr $ra

This is the same as:

int	fact(int	n)
{
	if (n <= 1) return 1;
	else return n * fact(n-1);
}

Loop invariant as proof technique

int	fact2(int	n)
{
	int	v = 1 ,i;

	for (i = n; i > 1; i--)
		v = v * i;
	return v;
}

The loop invariant -- at the test -- is n! = v * i!. We prove this using induction: the base case is when i = n, and v = 1 which obviously satisfies the invariant expression. We do induction counting down: assume that the invariant holds at the test, so after some k iterations we have i = i_k and v = v_k satisfying n! = v_k * i_k!. We run the loop body once, see that the variable v is updated to contain v_k * i, and the variable i is updated to contain i_k-1. We check whether the new values in the variables satisfy the invariant:

v * i! = (v_k * i_k) * i!
       = v_k * i_k * (i_k - 1)!
       = v_k * i_k!
       = n!

Compiling loops into assembler

We look at a for loop as in the above example, and first convert it into a while loop:

	i = n;
	while (i > 1) {
		v = v * i;
		i--;
	}

The naive way to compile this loop into assembly is to write it as:

	move $t0,$a0	# $t0 is i, $a0 is n, $t1 is v
top:	ble $t0,1,loop_done
	mult $t1,$t0	# mult $t1,$t1,$t0 expands into this two
	mflo $t1	# instruction sequence
	subi $t0,$t0,1
	b top

but it is more efficiently translated as:

	move $t0,$a0
	b test
top:	mult $t1,$t0
	mflo $t1
	sub $t0,$t0,1
test:	bgt $t0,1,top

This is more efficient, since the loop body is shorter.

Strength Reduction

We next looked at an algorithmic way to improve the speed of your code. The example code is for initializing an array so that each element contains the square of its index:

	int	i;
	for (i = 0; i < N; i++) {
		tbl[i] = i * i;
	}

We take advantage of the algebraic identity (i + 1)² = i² + 2 i + 1 :

	int	i, isq;
	for (i = isq = 0; i < N; ) {
		tbl[i] = isq;
		isq = isq + 2 * i + 1;
		i++;
	}

We have gotten rid of the general multiplication and replaced it with a multiplication by a power of 2 and two adds. The multiplication by 2 is implemented as a simple left by 1 bit, so all three operations are single cycle operations. The run time for mult was given in the following table:

Implementation	mult	multu	div	divu
R2000	12	12	35	35
R3000	12	12	35	35
R4000	10	10	69	69
R6000	17	18	38	37

Thus, we replaced a 10-18 cycle multiply with a 3 cycle instruction sequence. The overall speedup is not, however 3-6 times, since the store to tbl[i] require two cycles for the address calculation, plus the store-to-memory overhead, and there are other loop overhead as well (testing and decrementing the loop control variable i).

At the end of the class, I asked what you would do for

	int	i;
	for (i = 0; i < N; i++) {
		tbl[i] = i * i * i;
	}

to eliminate the multiplications. You should think this through.

bsy@cse.ucsd.edu, last updated Wed Oct 29 14:40:52 PST 1997.

email bsy