CSE 30 -- Lecture 9 -- Oct 27


Endianism

Big-endian byte order: on a 32-bit machine, the most significant byte of a 4 byte word is stored into the lowest address byte of a consecutive 4 byte region of memory. (On a 64-bit processor, a word is 8 bytes wide, so it would be a consecutive 8 byte region.)

Little-endian byte order: the reverse. The least significant byte is stored in the lowest address.

The MIPS architecture is bi-endian -- the processor chip can be configured to be big-endian (as in the SGI machines) or little-endian (as in the old DEC machines).

Word alignment

RISC processors typically require that the load/store instructions reference memory with ``aligned'' addresses. On a 32-bit machine, a word is 4 bytes wide, and alignment means that the address must be a multiple of 4 -- or equivalently, the low order 2 bits of the address are 0.

The reason for doing this is performance/simplicity: the data bus is typically the width of a cache line (or a small integer fraction of a cache line for lower cost implementations), and aligned memory references mean that in all cases a single bus transaction will suffice to obtain the word in a cache miss. If words do not have to be aligned, then two bus transfers would be needed if the word spans two cache lines. Requiring the hardware to detect when this is needed and handle such transfers makes the processor implementation more complex and thus slower.

Recursion assm example

Factorial example:
fact:		sub $sp,$sp,12
		sw $fp,4($sp)
		add $fp,$sp,12
		sw $ra,-4($fp)
		bgt $a0,1,rec_fact
		li $v0,1
		b rec_done
rec_fact:	sw $a0,0($fp)
		sub $a0,$a0,1
		jal fact
		lw $a0,0($fp)
		mult $v0,$v0,$a0
rec_done:	move $t0,$fp
		lw $fp,-8($fp)
		lw $ra,-4($fp)
		move $sp,$t0
		jr $ra
This is the same as:
int	fact(int	n)
{
	if (n <= 1) return 1;
	else return n * fact(n-1);
}

Loop invariant as proof technique

int	fact2(int	n)
{
	int	v = 1 ,i;

	for (i = n; i > 1; i--)
		v = v * i;
	return v;
}
The loop invariant -- at the test -- is n! = v * i!. We prove this using induction: the base case is when i = n, and v = 1 which obviously satisfies the invariant expression. We do induction counting down: assume that the invariant holds at the test, so after some k iterations we have i = ik and v = vk satisfying n! = vk * ik!. We run the loop body once, see that the variable v is updated to contain vk * i, and the variable i is updated to contain ik-1. We check whether the new values in the variables satisfy the invariant:
v * i! = (vk * ik) * i!
       = vk * ik * (ik - 1)!
       = vk * ik!
       = n!

Compiling loops into assembler

We look at a for loop as in the above example, and first convert it into a while loop:
	i = n;
	while (i > 1) {
		v = v * i;
		i--;
	}
The naive way to compile this loop into assembly is to write it as:
	move $t0,$a0	# $t0 is i, $a0 is n, $t1 is v
top:	ble $t0,1,loop_done
	mult $t1,$t0	# mult $t1,$t1,$t0 expands into this two
	mflo $t1	# instruction sequence
	subi $t0,$t0,1
	b top
but it is more efficiently translated as:
	move $t0,$a0
	b test
top:	mult $t1,$t0
	mflo $t1
	sub $t0,$t0,1
test:	bgt $t0,1,top
This is more efficient, since the loop body is shorter.

Strength Reduction

We next looked at an algorithmic way to improve the speed of your code. The example code is for initializing an array so that each element contains the square of its index:
	int	i;
	for (i = 0; i < N; i++) {
		tbl[i] = i * i;
	}
We take advantage of the algebraic identity (i + 1)2 = i2 + 2 i + 1 :
	int	i, isq;
	for (i = isq = 0; i < N; ) {
		tbl[i] = isq;
		isq = isq + 2 * i + 1;
		i++;
	}
We have gotten rid of the general multiplication and replaced it with a multiplication by a power of 2 and two adds. The multiplication by 2 is implemented as a simple left by 1 bit, so all three operations are single cycle operations. The run time for mult was given in the following table:
Implementationmultmultudivdivu
R2000 12 12 35 35
R3000 12 12 35 35
R4000 10 10 69 69
R6000 17 18 38 37
Thus, we replaced a 10-18 cycle multiply with a 3 cycle instruction sequence. The overall speedup is not, however 3-6 times, since the store to tbl[i] require two cycles for the address calculation, plus the store-to-memory overhead, and there are other loop overhead as well (testing and decrementing the loop control variable i).

At the end of the class, I asked what you would do for

	int	i;
	for (i = 0; i < N; i++) {
		tbl[i] = i * i * i;
	}
to eliminate the multiplications. You should think this through.
[ CSE home | CSE talks | bsy's home page | webster i/f | yahoo | lycos | altavista | pgp key svr | spam | commerce ]
picture of bsy

bsy@cse.ucsd.edu, last updated Wed Oct 29 14:40:52 PST 1997.

email bsy


Don't make me hand over my privacy keys!