New Mexico Tech
Earth and Environmental Science

ERTH 401 / GEOP 501 - Computational Methods & Tools

Lab 6: Data Structures

As mentioned in the Data Structures lecture, Python provides a number of built-in and extra containers (i.e. data structures) to hold data. Depending on what you need to do with your data, certain structures are better suited than others, but this lab will give you some experience with most of them.

"Deliverable" for this lab will be 1 Python script that contains the various pieces of Python code needed to answer the questions below. Please label the various sections in your script based on the numbering system linked to the topics below.

0) Built-in: Lists

Documentation

We covered this in lab 3, so please review lists as needed. We'll spend the bulk of today's lab working on the new data structures.

1) Built-in: Sets

Documentation

Sets are containers that contain unique values. To define a set, provide comma-separated values between {}. (Note: to run this in Python 2.7 on the Linux machines, you will have to use the line scl enable python27 bash in the terminal)

Create 2 sets:

s = {1, 2, 3} and t = {3, 4, 5}

Compute the following:
a) Union of the sets
b) Intersection of the sets
c) Difference (elements in s but not in t)
d) Difference (elements in t but not in s)
e) Add "6" to set s and check the output
f) Add "2" to set s and check the output (what happened?)

2) Built-in: Tuple

Documentation

Tuples hold comma separated values. A tuple basically acts like a list except that you can't change it (it is immutable).

Create a tuple:

t2 = 'c','d','e'

a) Add 'f' to t2 (describe what happens)
b) If there was a problem with adding the value directly. How could you add the value to t2?

3: Built-in: Dict

Documentation

We introduced dictionaries in Lab 4. Dictionaries are mutable (meaning you can change them). Importantly, they are also unordered! They are built from groups of key:value pairs (the Python version of a hash-table for those of you familiar with those). You can access entries by the key (which needs to be unique).

Create a dictionary:

topics = {'lab1':'flowchart','lab2':'data types','lab3':'flow control','lab4':'functions'}

a) Report the topic covered in lab2.
b) Write a for loop that will print out all the keys and values, followed by a marker line "---". Describe the order of the output.
c) Update the dictionary by adding the topic for lab5 - matplotlib.

4: NumPy Array

For this, you'll need to access tools within the NumPy library, so please import the package at this point. (import numpy as np)

Documentation

Create a NumPy array:

n1 = np.array([[4,5,6,7],[8,9,10,11],[12,13,14,15],[16,17,18,19]])

a) Use slicing rules to grab out values in the even rows and odd columns (should be [5,7],[13,15]). This should be done in one step, so you cannot use print n1[0][1], print n1[0][3]...etc
b) Compute the mean of all the elements in your array, then get the mean of each column and the mean of each row. Just like problem a, you must do this in one step. The output should be an array of 4 numbers.

5: Pandas- Series

First, import the pandas library as pd into your script (import pandas as pd). While you are at it, also import matplotlib.pyplot as plt.

Documentation

Let's create a series that we can query using a variety of tools. First, define a dictionary called foods that contains the food as the key and the country for which it is a popular food as the value.

>> foods = {'chile': 'Mexico',
'potato': 'Ireland',
'sushi': 'Japan',
'poutine': 'Canada'}

Then create a pandas series using that dictionary:

>> f = pd.Series(foods)
>> f

Using the loc and iloc attributes (use both of them for each problem), query your series to find
a) The 3rd country on the list
b) Which country has poutine as it's national food
c) Use indexing to find the first two countries

Next, let's create a series with a set of numbers:
>> s = pd.Series([100.00, 120.00, 101.00, 3.00])
d) Write a script that will provide a sum of the values in the series. There are a variety of ways to do it, such as using either basic for loop or numpy tools.

6: Pandas- DataFrames

Documentation

We'll work on setting up a DataFrame using a couple of series objects built for some local seismic and GPS stations.

station_1 = pd.Series({'Name': 'SBY', 'Lat': 33.975, 'Lon': -107.181})
station_2 = pd.Series({'Name': 'LEM', 'Lat': 34.166, 'Lon': -106.972})
station_3 = pd.Series({'Name': 'SC01', 'Lat': 34.068, 'Lon': -106.967})

df = pd.DataFrame([station_1, station_2, station_3], index=['Seismic', 'Seismic', 'GPS'])
print(df.head())

Next, take a look at the .loc and .iloc attributes for ways to extract certain data from the DataFrame.
a) Query the DataFrame to list the seismic stations contained in the DataFrame.
b) Next query the DataFrame to just get the seismic station names.
c) Finally, query the DataFrame to get the latitudes for all the stations.

Create a DataFrame that contains 5000 random numbers, split into 5 columns of 1000 numbers each. To do this, you will have to use the line import random.
The documentation for the random library is here.
d) Report the mean value in each column
e) Plot the 2nd and 4th columns.
f) In a separate plot, plot only the negative values of the 1st column.

7: Redesign

After finishing all this, add a comment at the end of your program, in which you explain how you could use one of these new data types (e.g., Pandas Series) in an exercise from a previous lab. Rewrite the exercise and any code you would be using based on this new data type. You don't need to test any code, keep your writing commented out. We're interested in a sketch of the solution.

rg <at> nmt <dot> edu | Last modified: October 02 2017 03:42.