Ronni Grapenthin - Notes

twitter: @rngrp
New Mexico Tech
Dept. of Earth &
Environmental Science
801 Leroy Place
Socorro, NM-87801

When in grad school - fill your tool chest.

Published: 2015/09/03

Grad school puts a new level of strain on students compared to the good old undergraduate times. It between taking classes, figuring out research targets, outlining a career (never too early), you should not forget to convert the code / scripts you'll be writing over the coming 2-N years into tools that are applicable to future problems. These problems could be popping up in chapter 2 and 3 of your thesis/dissertation or during your post-doc / whatever comes after. You should put some time and effort into understanding software you didn't write at a level that enables you to install and run it at your new job. Your tools should become transportable and accompany you to future endeavors; much like a pirate's parrot.

This post is a natural expansion to my previous call to teach students how to make tools. Here, I go into how students should approach the -at first- intimidating task of using existing research style software and writing their own.

The central questions are:

  • Will you be able to use your work after you leave your alma mater?
  • Is your code generalized to a degree that allows analysis of similar datasets without altering the source code?
  • Does your code solve a problem others will encounter? Consider making it available!

Research software continues, for most of it, to live far outside of traditional software development wisdom. When I transitioned from computer science into geophysics, I was exposed to some FORTRAN code and couldn't believe what desolate condition software development was in. Fast-forward 10 years and I've learned that code that answers a specific question operates under a different set of rules. Often someone develops a theory and provides a well tested FORTRAN code along with the journal article - and people will keep using this until a better theory accompanied by code comes along. Note that theories without accompanying code have a harder time establishing themselves! The reason? It is very expensive to sit down and (re)write someones code, or transform math into software.

There's no point in rewriting if you can dig up a compiler that translates it and somehow get it to work. Particularly, since academic hero-code often comes without comments and creative variable names. Both don't necessarily boost confidence when touching the code. And so our virtual labs turn into a duct-taped scaffolding that may very well collapse with the next compiler or OS upgrade. Once out of the protective shelter that grad school is (I know, it doesn't feel like it; enjoy it anyway) and drowning in -hopefully?- a tenure-track position, plenty of other tasks compete for your attention. Withdrawal to your coding lair will be a rare treat.

One of the worst things that can happen at that point is to lose access to the analysis tools you used when working on your dissertation. This can happen quite easily: your school provides a proprietary / licensed software, which you depend on like addicts on their meth. Once you leave, you may face thousands of dollars in license fees, or you have to downgrade to a partial version of the software, or you dropped your bucket of tools in a glacial stream. At least your source code is still yours; maybe there's a cross-compiler to another, free language; or a free interpreter/compiler that runs directly on your code. Else, you're out of luck and you get to translate yourself; manually. (How about creating a cross-compiler and making it available for others to use?)

This gets at the first critical question you should ask when it comes to tool choices (see list above): pick a development environment that you can carry home when you leave, along with your degree. Sure, the lab you're working in may set tight constraints (policy, highly specialized commercial software package, ...). If so, operate as freely in these constraints as you can. More than likely, your adviser will not care too much about the minute details of your if-statements and instead debug your code with you via figures that illustrate the tests you've run. Then choose whatever is best for the future you.

The second, no less important point, lies with generalization and concurrent organization of your tools:

  • Turn that big fat script that does a dozen things into a dozen functions, each well parameterized, and reassemble the functionality of the original script through calls to these functions. Use variables at the top to set problem specific values and/or file names. If those values are the only thing that needs to change, you're set to analyze a new data set without having to touch the bulk of the code at all - less risk for new bugs!
  • This applies to compiled codes as well, maybe even more so. The worst that could happen is a binary of the same code base for each project just because you needed to change some parameters. Well, a copy of the entire source tree for each project may be worse (I've seen it!). As above, project specific parameters are easily recorded persistently in a calling script.
  • If your code writes out results, it's a good idea to attach the critical parameters to the output. In a text file, add one or more header lines (e.g., with the command line that created the output), some binary formats like netCDF invite this practice naturally. And for crying out loud, add the units!
  • Place all your generalized code into a central directory in your file system. Avoid having your code spread all over your hard drive or you'll lose track of what you actually possess (particularly those handy helper scripts).

Once you've generalized and organized your tools you're fairly well set. Superstars go one step further and place their (academic) codes into publicly accessible repositories. Why? Because you may have solved a problem others encounter and ultimately make their lives easier as they can use your tool to find solutions; rather than wasting time, energy, resources through reinventing the wheel. This tends to be particularly infuriating in academia where the math is usually communicated through papers and the reader is left to write the code on their own.

If you don't want to deal with code hosting yourself, github is one mechanism that does it for you, for free; bitbucket is another one. Github, in collaboration with Zenodo, provides a mechanism to attach doi's to your code. That makes it citable; the ultimate motivation for an academic, I guess. Bitbucket allows you to keep your repositories non-public without payment. I use both for different purposes.

Now you're a little more ready for grad school.

rg <at> nmt <dot> edu | Created: 2015/09/03 | Last modified: September 04 2015 04:18.