Exercise 3.2
Overview
A central part of your report.py
program focuses on the reading of CSV files. For
example, the function read_portfolio()
reads a file containing rows of portfolio data and
the function read_prices()
reads a file containing rows of price data. In both of those
functions, there are a lot of low-level "fiddly" bits and similar features. For example,
they both open a file and wrap it with the csv
module and they both convert various fields into
new types. If you were doing a lot of file parsing for real, you’d
probably want to clean some of this up and make it more general purpose.
Start this exercise by creating a new file called fileparse.py
. This is where we will be
doing our work.
(a) Reading CSV Files
To start, let’s just focus on the problem of reading a CSV file into a list of dictionaries.
In the file fileparse.py
, define a simple function that looks like this:
# fileparse.py
import csv
def parse_csv(filename):
'''
Parse a CSV file into a list of records
'''
f = open(filename)
f_csv = csv.reader(f)
# Read the file headers
headers = next(f_csv)
records = []
for row in f_csv:
if not row: # Skip rows with no data
continue
record = dict(zip(headers, row))
records.append(record)
f.close()
return records
This function reads a CSV file into a list of dictionaries while
hiding the details of opening the file, wrapping it with the csv
module, ignoring blank lines, and so forth. Try it out:
>>> portfolio = parse_csv('Data/portfolio.csv')
>>> portfolio
[{'price': '32.20', 'name': 'AA', 'shares': '100'}, {'price': '91.10', 'name': 'IBM', 'shares': '50'}, {'price': '83.44', 'name': 'CAT', 'shares': '150'}, {'price': '51.23', 'name': 'MSFT', 'shares': '200'}, {'price': '40.37', 'name': 'GE', 'shares': '95'}, {'price': '65.10', 'name': 'MSFT', 'shares': '50'}, {'price': '70.44', 'name': 'IBM', 'shares': '100'}]
>>>
This is great except that you can’t do any kind of useful calculation with the data because everything is represented as a string. We’ll fix this shortly, but let’s keep building on it.
(b) Building a Column Selector
In many cases, you’re only interested in selected columns from a CSV file, not all of the data. Modify the
parse_csv()
function so that it optionally allows user-specified columns to be picked out as follows:
>>> # Read all of the data
>>> portfolio = parse_csv('Data/portfolio.csv')
>>> portfolio
[{'price': '32.20', 'name': 'AA', 'shares': '100'}, {'price': '91.10', 'name': 'IBM', 'shares': '50'}, {'price': '83.44', 'name': 'CAT', 'shares': '150'}, {'price': '51.23', 'name': 'MSFT', 'shares': '200'}, {'price': '40.37', 'name': 'GE', 'shares': '95'}, {'price': '65.10', 'name': 'MSFT', 'shares': '50'}, {'price': '70.44', 'name': 'IBM', 'shares': '100'}]
>>> # Read some of the data
>>> shares_held = parse_csv('Data/portfolio.csv', select=['name','shares'])
>>> shares_held
[{'name': 'AA', 'shares': '100'}, {'name': 'IBM', 'shares': '50'}, {'name': 'CAT', 'shares': '150'}, {'name': 'MSFT', 'shares': '200'}, {'name': 'GE', 'shares': '95'}, {'name': 'MSFT', 'shares': '50'}, {'name': 'IBM', 'shares': '100'}]
>>>
An example of a column selector was given in Exercise 2.5. However, here’s one way to do it:
# fileparse.py
import csv
def parse_csv(filename, select=None):
'''
Parse a CSV file into a list of records
'''
f = open(filename)
f_csv = csv.reader(f)
# Read the file headers
headers = next(f_csv)
# If a column selector was given, find indices of the specified columns.
# Also set the output columns used for the resulting dictionaries
if select:
indices = [headers.index(colname) for colname in select]
output_columns = select
else:
indices = []
output_columns = headers
records = []
for row in f_csv:
if not row: # Skip rows with no data
continue
# Filter the row if specific columns were selected
if indices:
row = [ row[index] for index in indices ]
# Make a dictionary
record = dict(zip(output_columns, row))
records.append(record)
f.close()
return records
(c) Performing Type Conversion
Modify the parse_csv()
function so that it optionally allows type-conversions to be applied to the
returned data. For example:
>>> portfolio = parse_csv('Data/portfolio.csv', types=[str, int, float])
>>> portfolio
[{'price': 32.2, 'name': 'AA', 'shares': 100}, {'price': 91.1, 'name': 'IBM', 'shares': 50}, {'price': 83.44, 'name': 'CAT', 'shares': 150}, {'price': 51.23, 'name': 'MSFT', 'shares': 200}, {'price': 40.37, 'name': 'GE', 'shares': 95}, {'price': 65.1, 'name': 'MSFT', 'shares': 50}, {'price': 70.44, 'name': 'IBM', 'shares': 100}]
>>> shares_held = parse_csv('Data/portfolio.csv', select=['name', 'shares'], types=[str, int])
>>> shares_held
[{'name': 'AA', 'shares': 100}, {'name': 'IBM', 'shares': 50}, {'name': 'CAT', 'shares': 150}, {'name': 'MSFT', 'shares': 200}, {'name': 'GE', 'shares': 95}, {'name': 'MSFT', 'shares': 50}, {'name': 'IBM', 'shares': 100}]
>>>
You already explored this in Exercise 2.6. You you just need to insert the following fragment of code into your solution:
...
if types:
row = [func(val) for func, val in zip(types, row) ]
...
(d) Working without headers
Some CSV files don’t include any header information. For example, the file Data/prices.csv
looks like this:
"AA",9.22
"AXP",24.85
"BA",44.85
"BAC",11.27
...
Modify the parse_csv()
function so that it can work with such files by creating a list of tuples instead.
For example:
>>> prices = parse_csv('Data/prices.csv', types=[str,float], has_headers=False)
>>> prices
[('AA', 9.22), ('AXP', 24.85), ('BA', 44.85), ('BAC', 11.27), ('C', 3.72), ('CAT', 35.46), ('CVX', 66.67), ('DD', 28.47), ('DIS', 24.22), ('GE', 13.48), ('GM', 0.75), ('HD', 23.16), ('HPQ', 34.35), ('IBM', 106.28), ('INTC', 15.72), ('JNJ', 55.16), ('JPM', 36.9), ('KFT', 26.11), ('KO', 49.16), ('MCD', 58.99), ('MMM', 57.1), ('MRK', 27.58), ('MSFT', 20.89), ('PFE', 15.19), ('PG', 51.94), ('T', 24.79), ('UTX', 52.61), ('VZ', 29.26), ('WMT', 49.74), ('XOM', 69.35)]
>>>
To make this change, you’ll need to modify the code so that the first line of data isn’t interpreted as a header line. Also, you’ll need to make sure you don’t create dictionaries as there are no longer any column names to use for keys.
(e) Picking a different column delimiter
Although CSV files are pretty common, it’s also possible that you could encounter a file
that uses a different column separator such as a tab or space. For example, the file
Data/portfolio.dat
looks like this:
name shares price
"AA" 100 32.20
"IBM" 50 91.10
"CAT" 150 83.44
"MSFT" 200 51.23
"GE" 95 40.37
"MSFT" 50 65.10
"IBM" 100 70.44
The csv.reader()
function allows a different delimiter to be given as follows:
f_csv = csv.reader(f, delimiter=' ')
Modify your parse_csv()
function so that it also allows the delimiter to be changed. For example:
>>> portfolio = parse_csv('Data/portfolio.dat', types=[str, int, float], delimiter=' ')
>>> portfolio
[{'price': '32.20', 'name': 'AA', 'shares': '100'}, {'price': '91.10', 'name': 'IBM', 'shares': '50'}, {'price': '83.44', 'name': 'CAT', 'shares': '150'}, {'price': '51.23', 'name': 'MSFT', 'shares': '200'}, {'price': '40.37', 'name': 'GE', 'shares': '95'}, {'price': '65.10', 'name': 'MSFT', 'shares': '50'}, {'price': '70.44', 'name': 'IBM', 'shares': '100'}]
>>>