University of Alaska Fairbanks
Geophysical Institute

Beyond the Mouse 2010 - The geoscientist's computational chest.

Lab 9: Unix / Shell Introduction II

"Programming is legitimate and necessary academic endeavor."
Donald E. Knuth

Lab slides

none.

Note

As solution, send me your scripts and the answers to the questions. No datafiles please, I have plenty of those :)

Running the VirtualBox

Check here if you forgot how that works. Really. Go there if you forgot something.

Exercise 0: Permanently changing your Path and stuff

To allow for you to have your shell scripts accessible on all machines they should go to a special directory on the shared drive; N:\ on the Windows machines which is mounted to /home/btm_user/N_DRIVE inside the VirtualBox. Since there is only one user btm_user for the VirtualBox, but many Windows users that will use this log-in, everybody will have to use the same directory name for the scripts.

Go to /home/btm_user/N_DRIVE and create a directory btm_unix_scripts. Check here to see how to do this.

Now that you all have this directory you will have to edit the .tcshrc file. This is the "Run Command" (rc) file for the tcsh-shell. It is executed every time you log into a shell or open a new terminal Window or subshell. All environment variables, aliases, etc. will therefore be available in any shell session you start on this system. Here is a brief description of things that happen during the login process (for a shell). You might see that you can easily configure your working environment using this file. If it does not exist, you will have to create it. (The leading dot is important; it's part of the filename and 'hides' the file in normal ls listings. This is generally used for configuration files and directories that have to be in your home directory; but you don't have much business messing around with them (or so the developer thought) To see all the stuff that's in your directory, try ls -lisa. The options l,i,s,a are explained in the man pages of ls.)

After all this talk, here's what to do (assuming you created /home/btm_user/N_DRIVE/btm_unix_scripts):

go to your home directory (a cd w/o arguments will get you there)
> gedit .tcshrc & to open the runConfig file in an editor
You will find that earlier I already modified the path so that MATLAB can be found
recall the syntax: setenv VARIABLE value
first add another environment variable called BTM_BIN and set the value to /home/btm_user/N_DRIVE/btm_unix_scripts
now modify the value of PATH: put .:${BTM_BIN}: before all the other stuff
the ':' is a field separator which the shell uses to tell different directories in the path apart.
the '.' will also include the current directory into the search path.
save the file
for these changes to take effect you need to open a new terminal window
TEST 1: > cd $BTM_BIN should beam you into ~/N_DRIVE/btm_unix_scripts
TEST 2: > env | grep btm_unix will show you whether your changes were successful for the path ... and you learn something about variables
If either test fails, fix it!
Question: What does env | grep btm_unix do? Consult man-pages, lecture notes, the Internet.

Exercise 1: Exercise 1 of last week's lab, altered by a little bit -- Writing a Shell script (Commented solution)

go to your home directory in a Terminal
retrieve a package with data files using wget:
wget http://www.gps.alaska.edu/programming/lectures/lab08.tar.gz
unpack a *.tar.gz archive using tar: tar xfz lab08.tar.gz
check whether you have a directory lab08
keep the downloaded tar file as a backup in case you mess your data directory up further down.

In this directory you will find GPS data for a certain day. That's not essential. The key is that there are many, many files. Some of which are gzipped, others are duplicates: gzipped and unzipped. What I want you to do now is find all the duplicates and rename the unzipped files to all upper case:

open a text editor (gedit)
write a shell script which you will save to $BTM_BIN that will:

loop over all qm files in the current directory (use foreach, and backtick operator in which you do a ls)
check whether the file exists in gzipped version (hint: if statement)
echo the duplicate

While developing this script you might wanna test it. To test it you will have to make it an executable:

> cd $BTM_BIN
> chmod u+x NAME_OF_YOUR_SCRIPT
Question: What does > chmod u+x NAME_OF_YOUR_SCRIPT do? Consult man-pages, lecture notes, the Internet.

You will have to repeat this for any script you want to be executed on the command line. Otherwise you'll get a "could not find ..." response.

The testing happens in ~/lab08! Open a new terminal window (to refresh the path contents with your new excutable), go to ~/lab08, and execute whatever you called your script. And yes, I will do exactly that and expect your script to work, no matter where it is stored.

Now the funky part: Rename the file such that all lower case letters are upper case. You find almost a full solution at this website. You will have to do the conversion from lower case to upper case since they convert from upper case to lower case. I find this task challenging and yet rewarding enough that giving the solution away is fine with me. However, you still have to find the correct line on the website, copy it correctly into your script, modify and explain to me what this line does (use man pages and Internet to find answers).

As a guideline: my neatly formatted, yet uncommented solution script is 8 lines long.

Once you're done try > ls *QM | wc -l
The result should be 944.

You've just changed the name of 944 files. Given the boredom caused by doing the actual conversion by hand and the number of files, writing the script, testing, failing, fixing, testing, succeeding was still a lot faster.

Now you could go ahead and remove all files that are upper case using rm ./*.QM. A safe way to mark files and hopefully avoid deleting precious data accidentally - good thing you have backups. The result should be:
> ls *qm
ls: No match.
Good! All the unneccessary stuff has been removed.

To be fair though, in real life you would simply call: gzip *qm and let gzip complain about existing files. But the point of this exercise was to introduce you to a few unix tools, get you to do some scripting and do a simple task on many, many files. I hope this objective was accomplished.

Exercise 2: Data Handling with awk (Commented solution, Commented log file of some runs of the code)

Hopefully you remember Exercise 2 of Lab 05. Most likely you will remember though that at some point in the past we had you fiddle with pesky formatting strings to extract some data from a file with a lot more data. This was Exercise 2 of Lab 05. Now we'll go back to the FAIR.pfiles text file and treat it with Unix tools to extract the information we want.

Download the file FAIR.pfiles to somewhere in your home directory
On the command line, extract the epoch, longitude, latitude, and height from the FAIR.pfiles file that has been passed using awk (HINT: separate variables with a comma in the print statement and they will be separated by a space in the output).
Now figure out how to redirect this output into a new file FAIR2.llh

Now that you know how to do those two key actions, create a new tcsh script pfiles2llh in $BTM_BIN that generalizes this for any .pfiles file it gets handed over as a command line argument. The format for executing this script at the command line should be like this:

> pfiles2llh STATION_NAME.pfiles

Command line arguments are given to a script in various forms. ONE is using the built-in variables $0, $1 ... $N. Inside your script $0 is the program name that has been called. $1, $2, ..., $N are the first argument, 2nd, ..., n-th argument for the program that has been called. This convention is generally used when you have a few arguments that you expect to be handed to the script in a certain order. Next week, we'll do something more fancy. Here is an interesting article that tells you how to find the maximum number of arguments for a shell command.

Here's what your script is expected to do:

Your script should expect ONE file to be the FIRST argument when it is called. Bonus exercise: use $# to test for the existance of an argument (i.e. give error/usage message when there is NO argument).
extract the epoch, longitude, latitude, and height from the .pfiles file that has been passed using awk (you basically did this above)
redirect the output into a new file STATION_NAME.llh:

you will assume that a pfiles file is named following the convention STATION_NAME.pfiles
Great, so you can extract the STATION_NAME using the program basename; save the value to a variable inside your script (e.g. sta_name)
The output file should end up in the same directory as the input file. You can extract the path or directory name of the input file using dirname. Again, save the result to a variable in the script (e.g. sta_path).
now redirect the output of your awk call into ${sta_path}/${sta_name}.llh

HINT: echo the values of the variables you set for testing, to make sure you're doing things right!
You should be able to call your script from anywhere (e.g. other scripts) having data sit any other place.
BONUS: Allow for an optional second argument that will be the output directory (not the filename though).

ronni <at> gi <dot> alaska <dot> edu | last changed: December 27, 2010

University of Alaska Fairbanks Geophysical Institute Beyond the Mouse 2010 - The geoscientist's computational chest.