Exercise 2.4

Objectives:

  • Experiment with various sequence operations.

  • Learn how to use iteration features.

  • Continued practice with data handling.

Files Created: None.

Files Modified: report.py

(a) Counting

Try some basic counting examples:

>>> for n in xrange(10):            # Count 0 ... 9
       print n,

0 1 2 3 4 5 6 7 8 9
>>> for n in xrange(10,0,-1):       # Count 10 ... 1
       print n,

10 9 8 7 6 5 4 3 2 1
>>> for n in xrange(0,10,2):        # Count 0, 2, ... 8
       print n,

0 2 4 6 8
>>>

(b) More sequence operations

Interactively experiment with some of the sequence reduction operations:

>>> data = [4, 9, 1, 25, 16, 100, 49]
>>> min(data)
1
>>> max(data)
100
>>> sum(data)
204
>>>

Try looping over the data:

>>> for x in data:
       print x

4
9
...
>>> for n,x in enumerate(data):
       print n,x

0 4
1 9
2 1
...
>>>
An Antipattern

Sometimes the for statement, len(), and range() get used by novices in some kind of horrible code fragment that looks like it emerged from the depths of a rusty C program. For example:

>>> for n in range(len(data)):
        print data[n]

4
9
1
...
>>>

Don’t do that! Not only does reading it make everyone’s eyes bleed, it’s inefficient with memory and it runs a lot slower. Just use a normal for loop if you want to iterate over data. Use enumerate() if you happen to need the index for some reason.

(c) Another enumerate() example

The file Data/missing.csv contains data for a stock portfolio, but has some rows with missing data. Try the following code sample that loops over all of the lines of the file, but prints a warning message for all bad rows along with the associated row number.

>>> import csv
>>> f = open('Data/missing.csv')
>>> f_csv = csv.reader(f)
>>> headers = next(f_csv)
>>> for rowno, row in enumerate(f_csv, start=1):
        try:
            name = row[0]
            shares = int(row[1])
            price = float(row[2])
        except ValueError:
            print "Row %d: Couldn't convert: %s" % (rowno, row)

Row 4: Couldn't convert: ['MSFT', '', '51.23']
Row 7: Couldn't convert: ['IBM', '', '70.44']
>>>

In this example, the 1 argument to enumerate() sets the starting value for the count. In this case, we’re starting the count with row number 1. If you don’t specify a starting value, enumerate() starts counting from 0.

(d) Using the zip() function

In the file Data/portfolio.csv, the first line contains column headers. In all previous code, we’ve simply been discarding them. For example:

>>> f = open('Data/portfolio.csv')
>>> f_csv = csv.reader(f)
>>> headers = next(f_csv)
>>> headers
['name', 'shares', 'price']
>>>

However, what if you could use the headers for something useful? This is where the zip() function enters the picture. First try this to pair the file headers with a row of data:

>>> row = next(f_csv)
>>> row = ['AA', '100', '32.20']
>>> zip(headers, row)
[ ('name', 'AA'), ('shares', '100'), ('price', '32.20') ]
>>>

Notice how zip() paired the column headers with the column values. This pairing is just an intermediate step to building a dictionary. Now try this:

>>> record = dict(zip(headers, row))
>>> record
{'price': '32.20', 'name': 'AA', 'shares': '100'}
>>>

This transformation is one of the most useful tricks to know about when processing a lot of data files. For example, suppose you wanted to make your report program work with various input files, but without regard for the actual column number where the name, shares, and price appear. Modify the read_portfolio() function in report.py so that it looks like this:

# report.py
import csv

def read_portfolio(filename):
    '''
    Read a stock portfolio file into a list of dictionaries with keys
    name, shares, and price.
    '''
    portfolio = []
    f = open(filename)
    f_csv = csv.reader(f)
    headers = next(f_csv)

    for row in f_csv:
        record = dict(zip(headers, row))         # Turn the row into a dict
        stock = {                                # Pick out fields of interest
            'name': record['name'],
            'shares' : int(record['shares']),
            'price' : float(record['price'])
            }
        portfolio.append(stock)
    f.close()
    return portfolio

Now, try your function on a completely different data file Data/portfoliodate.csv which looks like this:

name,date,time,shares,price
"AA","6/11/2007","9:50am",100,32.20
"IBM","5/13/2007","4:20pm",50,91.10
"CAT","9/23/2006","1:30pm",150,83.44
"MSFT","5/17/2007","10:30am",200,51.23
"GE","2/1/2006","10:45am",95,40.37
"MSFT","10/31/2006","12:05pm",50,65.10
"IBM","7/9/2006","3:15pm",100,70.44
>>> portfolio = read_portfolio('Data/portfoliodate.csv')
>>> portfolio
[{'price': 32.2, 'name': 'AA', 'shares': 100}, {'price': 91.1, 'name': 'IBM', 'shares': 50}, {'price': 83.44, 'name': 'CAT', 'shares': 150}, {'price': 51.23, 'name': 'MSFT', 'shares': 200}, {'price': 40.37, 'name': 'GE', 'shares': 95}, {'price': 65.1, 'name': 'MSFT', 'shares': 50}, {'price': 70.44, 'name': 'IBM', 'shares': 100}]
>>>

Modify your report.py program so that it reads data from Data/portfoliodate.csv instead of Data/portfolio.csv. Amazingly, you’ll find that your program still works even though the data file has a completely different column format than before. That’s cool!

Discussion

The change made here is subtle, but significant. Instead of read_portfolio() being hardcoded to read a single fixed file format, the new version simply reads any CSV file and picks the values of interest out of it. As long as the file has the required columns, the code will work.

(e) Inverting a dictionary

A dictionary maps keys to values. For example, a dictionary of stock prices:

>>> prices = {
        'GOOG' : 490.1,
        'AA' : 23.45,
        'IBM' : 91.1,
        'MSFT' : 34.23
    }
>>>

If you use the items() method, you can get a list of (key,value) pairs:

>>> prices.items()
[('GOOG', 490.1), ('AA', 23.45), ('IBM', 91.1), ('MSFT', 34.23)]
>>>

However, what if you wanted to get a list of (value, key) pairs instead? Easy: use zip().

>>> pricelist = zip(prices.values(),prices.keys())
>>> pricelist
[(490.1, 'GOOG'), (23.45, 'AA'), (91.1, 'IBM'), (34.23, 'MSFT')]
>>>

Why would you do this? For one, it allows you to perform certain kinds of data processing on the dictionary data. For example:

>>> min(pricelist)
(23.45, 'AA')
>>> max(pricelist)
(490.1, 'GOOG')
>>> sorted(pricelist)
[(23.45, 'AA'), (34.23, 'MSFT'), (91.1, 'IBM'), (490.1, 'GOOG')]
>>>

This also illustrates an important feature of tuples. When used in comparisons, tuples are compared element-by-element starting with the first item (similar to how strings are compared character-by-character).

Discussion

zip() is often used in situations like this where you need to pair up data from different places. For example, pairing up the column names with column values in order to make a dictionary of named values.

zip() is not limited to pairs. For example, you can use it with any number of input lists:

>>> a = [1, 2, 3, 4]
>>> b = ['w', 'x', 'y', 'z']
>>> c = [0.2, 0.4, 0.6, 0.8]
>>> zip(a, b, c)
[(1, 'w', 0.2), (2, 'x', 0.4), (3, 'y', 0.6), (4, 'z', 0.8))]
>>>

Also, be aware that zip() stops once the shortest input sequence is exhausted. For example:

>>> a = [1, 2, 3, 4, 5, 6]
>>> b = ['x', 'y', 'z']
>>> zip(a,b)
[(1, 'x'), (2, 'y'), (3, 'z')]
>>>
Links