New Mexico Tech
Earth and Environmental Science

ERTH 401 / GEOP 501 - Computational Methods & Tools

Lab 7: Data Input and Output

Being able to import data from user input, text files, excel spreadsheets, etc is an important component of the computing you are likely to do. This lab will give you practice with a variety of Python input and output tools. You will also have an opportunity to try out many of the tools covered in previous labs on larger datasets.

Deliverables for this lab are outlined in section 5. These will be well-commented Python scripts that contain the various pieces of Python code needed to handle the tasks in section 5. The earlier sections (1-4) are practice with the various input/output methods - we encourage you to do these before jumping into section 5.

0) Package installation as an individual user

Sometimes you want to expand the functionality of your existing Python installation. Maybe you're interested to convert physical addresses to lat/lon coordinates and vice versa. The Python Package Index, or PyPI is a good place to look. One recommended tool for package installation (and there's a plethora of ways to do this) is pip. It's usually as easy as:

    $>pip install PACKAGE
  

which you would call on the commandline. However, often you need administrator privileges to perform this installation, which we don't have. There's an easy way around this: Installation in the user directory, because you definitely have write access to this:

    $>pip install --user PACKAGE
  

For today, you actually need to do this for the Excel Reader library, which pandas depends upon (for some reason this dependency wasn't resolved during pandas installation). It's a good opportunity to give this a test drive so you can expand your own Python environment in the future (if the package is really worthwhile, you should still give the admin a nudge to install it for everyone.). On the command line, please run:

    $>pip install --user xlrd
  

1) Screen Inputs/Outputs (raw_input(), input(), print())

Documentation:
raw_input
print

One of the simplest ways to get data into Python is using user input in the terminal. Similarly, unformatted output to the screen is also easy, and you've already used this in earlier labs.

The raw_input() function will read text from standard input and returns the input value as a string. The print() function will write values to the screen. An example is

>>> txt = raw_input("Enter text here: ")
Enter text here:      #where you will be prompted to enter some value

You can then check the contents of your variable txt

>>> txt   # whatever you entered above

Similarly, you can use the print() function to retrieve the value of txt:

>>> print(txt)  #whatever you entered above

There is also the input() function. In Python 2.7, the input() function differs from raw_input() in that Python will evaluate this input (tries to run it as an expression; note this changes in Python 3). Explore the differences (using the type() function) and resulting errors using raw_input() vs input() by entering:

a) a string
b) an integer

2) File Reading and Writing (open(), read(), write(), close(), readlines())

Documentation

In many cases, you will have more data than you want to type in manually. Instead you should store this data in files. There are a variety of options for reading data files, and we'll step through several of them here.

Using the Python built-in functions require first to open a file, then read it, using 2 separate commands. Similarly, you can write a file, and then close a file at the end.

Open

>>> f1 = open(file_name, mode)

where filename is a string (so include it in single quotes), mode is what you want to do with the file (also include this in single quotes).

Mode options include:
'r': only reading (note this is the default if you forget to include a mode)
'w': only writing (note that you will erase any existing file with the same name using this)
'a': appending a file with new data at the end
'r+': reading and writing

Read

Once you have a file open, you can read all its contents at once:

f1.read()

This command will return an ugly dump of the entire file's contents to the screen, with '\n' printing out for the line breaks. It may also not be what you want if you are trying to process data line by line.

There are a few options for reading lines of a file:
Single line: readline()
Each line individually: loop over the file contents

>>> f1 = open('W7_P2_file.txt','r')
>>> for line in f1:
        print line

Write

Here again you need to first have a file open, now in writing mode, before writing to it.

>>> f2 = open('file_write.txt','w')
>>> f2.write('Practice at writing to a new file')

If you want to write something that isn't a string, you need to convert it to a string first before writing it.

Close

After writing or reading to files, you need to close it using close()

>>> f1.close()
>>> f2.close()

You should now have a new file in the working directory called "file_write.txt"

with

Finally, remember the with shortcut we used in Lab 4. This allows you to open, read, and close the file with very little code.

3) NumPy Tools (loadtxt(), genfromtxt(), savetxt())

Documentation:
loadtxt
genfromtxt
savetxt

For reading in files using NumPy, there are 2 main options: loadtxt and genfromtxt. Using genfromtxt allows more flexibility in terms of handling missing data and handling different data types.

An example here:

>>> import numpy as np
>>> example_array = np.genfromtxt('station.txt', dtype=None, delimiter=" ", skip_header=1)

This will import all the values in file station.txt, starting after the first line, and put them into a NumPy array called example_array. You now have all the NumPy tools available for use on this NumPy array. Look through the documentation to find the other parameters used by genfromtxt() - many of these are quite helpful for pulling data out of files (like unpack, usecols, names). Another example using the same datafile:

>>>example_array = np.genfromtxt('station.txt', dtype=None, delimiter=" ", names=True)

where the difference is the "names" parameter set to True, meaning that it pulls the column names from the header line in the file. You can then access various columns using the names, for example:

>>> print example_array['Name']

gives you an array of the values in the Name column:

['ANMO' 'BAR' 'BMT' 'CAR' 'CBET' 'CL2B' 'CL7' 'CPRX' 'DAG' 'GDL2' 'HTMS' 'LAZ' 'LEM' 'LPM' 'MLM' 'SBY' 'SMC' 'SRH' 'SSS' 'Y22A' 'Y22D' 'WTX']

To write a file using NumPy, you can use savetxt() to save arrays using defined formats. An example to write out the example_array to a new file called file.out:

>>> np.savetxt('file.out', example_array, fmt='%s %f %f %i %i %i')

In this example, fmt is defining the variable types for each column in example_array to be saved to file.out(%s = string, %f = float, %i = integer).

4) Pandas (read_csv(), to_csv(), read_excel())

Documentation:
read_csv
read_excel
to_csv

We spent time in the last lab working with DataFrames in the pandas library, so it's worth knowing how to bring data from files directly into a DataFrame. It is fairly straightforward to bring this in, similar to what we've used in the earlier sections, although Pandas allows us to bring in data from text files and Excel (as well as a lot of other file formats).

>>> import pandas as pd
>>> station1_df = pd.read_csv('station.txt', sep=" ", header=0)

You now have a DataFrame called station_df that contains all the information from the file station.txt.

>>> station1_df
    Name      Lat      Lon  Elevation  Type  Number
0   ANMO  34.9500 -106.460       1820     1       2
1    BAR  34.1500 -106.628       2121     1       3
2    BMT  34.2750 -107.260       1987     1       4
3    CAR  33.9525 -106.734       1658     1       5
4   CBET  32.4200 -103.990       1042     1       6
5   CL2B  32.2300 -103.880       2121     1       7
6    CL7  32.4400 -103.810       1032     1       8
7   CPRX  33.0308 -103.867       1356     1       9
8    DAG  32.5913 -104.691       1277     1      10
9   GDL2  32.2003 -104.364       1213     1      11
10  HTMS  32.4700 -103.600       1192     1      12
11   LAZ  34.4020 -107.139       1878     1      13
12   LEM  34.1660 -106.972       1698     1       1
13   LPM  34.3117 -106.632       1737     1      14
14   MLM  34.8100 -107.145       2088     1      15
15   SBY  33.9752 -107.181       3230     1      16
16   SMC  33.7787 -107.019       1560     1      17
17   SRH  32.4914 -104.515       1276     1      18
18   SSS  32.3500 -103.410       1072     1      19
19  Y22A  33.9370 -106.965       1674     1      20
20  Y22D  34.0739 -106.921       1436     1      21
21   WTX  34.0722 -106.946       1555     1      22

You can then work with the DataFrame as we discussed last week, pulling out values in the named columns as needed, using indexing, labels, etc.

Importing data from Excel files is similarly easy (make sure you installed xlrd in Step 0):

>>> station2_df = pd.read_excel('station.xlsx', sheetname='Sheet1')

with the sheetname parameter required for this function.

Saving the dataframe contents to a new, maybe comma-separated file can be done using to_csv, for example:

>>> station1_df.to_csv('station_out.txt', sep=" ")

will write the station1_df dataframe to a new space-delimited file called station_out.txt

5) Integration of Previous Tools with File I/O

You now have a large suite of Python tools to deploy on various datasets. The deliverables for this lab will be 4 Python scripts that will build on all the material covered so far to perform some typical tasks you might face.

A) Use built-in Python tools (input, open, read, write, close) to request user input for a filename (here W7_P5A.txt), read in that file, create a dictionary from the entries (first column is key, second is value), write out a new file that contains only the entries with keys starting with letter A. Before doing anything, you may want to look at the file in a text editor.

B) Use the station.txt file to define all stations that are within the region defined below and output results to a new file, with the output filename including the latitude and longitude values that define the range. Use either numpy or pandas to do this, and feel free to use functions defined in previous weeks.

Region:
max_lat = 34.
max_lon = -103.
min_lat = 32.
min_lon = -106.

C) Download the earthquake.txt file that I downloaded from the USGS earthquakes website, which contains earthquake information for a recent sequence in Mexico. Use Python tools to read in the file and do the following:

D) Work through importing data from the research file of your choosing and perform at least one data processing function to the data (including plotting). Submit a Python script showing your import technique, analysis, and output of results, along with the data file used.

rg <at> nmt <dot> edu | Last modified: October 16 2017 17:33.