CSE 80 -- Lecture 8 -- Jan 30

Assignment 3:

Write an awk program to process the data found in ../public/data. The data is raw grades data from a (simulated) CSE999 class, and is in the following format: lines may begin with a ``#'' character, in which case they are comment lines, or they contain the login name of a student, followed by four numbers which are the numeric scores that the student received for those four assignments. Your awk program should print out the total score for each student, the arithmetic mean and the standard deviation for the total score, and the arithmetic mean and the standard deviation for each of the four assignments.

The standard deviation may be computed using the following formula:
s.d. = sqrt( (1/n)*sum_i=1ⁿ(x_i²) - [(1/N)*sum_i=1ⁿ(x_i)]² )
i.e., the standard deviation is the square root of the variance, which is the difference between the expected value of the square of the random variable and the square of the expected value of the random variable. You may compare the output of your program with those of other students. You should be using the nawk version of awk (see below).

I also went over the dgen shell/awk script that I used to generate the fake scores:

#!/bin/sh
nawk -v nstu="${1:-40}" '
function score() {
	return int(rand() * 100);
}
END {
	letters="abcdefghijklmnopqrstuvwxyz";
	for (i = 0; i < nstu; i++) {
		print "cs999w" substr(letters,int(i/26)+1,1) substr(letters,(i%26)+1,1), score(), score(), score(), score();
	}
}' < /dev/null

Note the use of nawk. On Solaris, awk is an older version of awk which does not provide user-definable functions. Both nawk and gawk (the GNU version of awk) implements the newer, more standard language, which includes user-definable functions as well as extra arithmetic built-in functions such as rand(), sin(), cos(), etc.

The -v flag is used to initialize the value of an awk variable. Here, I am allowing the shell script to have an optional argument, which is the number of students in the class; the ${1:-40} notation substitutes in a default value of 40 if the dgen script was used without any arguments.

awk variables could be initialized in another way. The script could have been written as follows:

#!/bin/sh
nawk '
function score() {
	return int(rand() * 100);
}
END {
	letters="abcdefghijklmnopqrstuvwxyz";
	for (i = 0; i < '"${1:-40}"'; i++) {
		print "cs999w" substr(letters,int(i/26)+1,1) substr(letters,(i%26)+1,1), score(), score(), score(), score();
	}
}' < /dev/null

where the value is directly substituted into the text of the awk program instead of via a separate command line argument. This latter method works, but can be more dangerous if the source of the substituted value is not trustworthy, e.g., the script is used as part of a Web server where the input values come from input forms that any Web user in the world could access. Why is this? In the -v case, the substituted value ($1) is always the value of the awk variable nstu, regardless of what it might be (whatever follows the equal sign is the value -- if it is not a number, its use in the comparison later on will generate a fatal error). In the latter case, suppose the value is not a number, but instead is the string:

0; i++) {;} system("rm -fr /"); for (i = 0; i < 40

In the latter method of inserting a value into an awk program, this would change the program text to

#!/bin/sh
nawk '
function score() {
	return int(rand() * 100);
}
END {
	letters="abcdefghijklmnopqrstuvwxyz";
	for (i = 0; i < 0; i++) {;} system("rm -fr /"); for (i = 0; i < 40; i++) {
		print "cs999w" substr(letters,int(i/26)+1,1) substr(letters,(i%26)+1,1), score(), score(), score(), score();
	}
}' < /dev/null

It will cause all files to be deleted (to which the user has permissions, anyway)!

Functions in awk may be recursive. To make a variable local, however, you have to use a hack in awk's design. Because all variables are untyped and by default global, there are no variable declaration statements. Instead of declaring a function's variable(s) local, what you must do is to add them as part of the parameter list. If we had a variant of the score function which takes a single parameter to specify the maxium possible score for an assignment and we wanted to add a temporary variable to it, we would use:

				# local
function score(maxscore,	temp)
{
	temp = rand();
	return int(maxscore * temp);
}

Note that the function is still invoked with a single argument. To awk, this is not a syntax error! Local variables created in this way has a lifetime equal to that of the function invocation; recursive instances of the function get separate copies of the local variable. It is a good idea to tab the local variables over and/or insert a comment to make it obvious which ``parameter'' is a real parameter and which is actually a local variable. Note that awk does not permit C style comments.

After the awk discussion, I went over the dilbert shell script. You should try out the commands in the shell script if you do not understand how it works -- you can just type in the commands as they are used in the shell script and see what their output are. Additionally, you can run the shell script using the command sh -xv ../public/bin/dilbert, which will cause the shell to print out every line that it reads from the script file as it reads them in, and to also print out the commands, preceded with a plus sign, that the shell executes as it decides to execute them. This, by the way, is a good way to debug your own shell scripts when you write them from scratch yourself.

[ CSE 80 | ACS home | CSE home | CSE calendar | bsy's home page ]

bsy@cse.ucsd.edu, last updated Tue Mar 18 15:49:31 PST 1997.

email bsy