Exercise 4.4
(a) Simple XML Parsing
The file Data/allroutes.xml
contains an XML file representing a
snapshot of the latitude and longitude position of all GPS-equipped
city buses in Chicago (data which is available as a real-time
download). The file looks something like this:
<?xml version="1.0"?>
<buses>
<bus>
<id>7574</id>
<route>147</route>
<color>#3300ff</color>
<revenue>true</revenue>
<direction>North Bound</direction>
<latitude>41.925682067871094</latitude>
<longitude>-87.63092803955078</longitude>
<pattern>2499</pattern>
<patternDirection>North Bound</patternDirection>
<run>P675</run>
<finalStop><![CDATA[Paulina & Howard Terminal]]></finalStop>
<operator>42493</operator>
</bus>
<bus>
<id>6842</id>
<route>81</route>
<color>#996633</color>
<revenue>true</revenue>
<direction>East Bound</direction>
<latitude>41.96847915649414</latitude>
<longitude>-87.71509087085724</longitude>
<pattern>1649</pattern>
<patternDirection>East Bound</patternDirection>
<run>F259</run>
<finalStop><![CDATA[Marine Drive & Wilson]]></finalStop>
<operator>40641</operator>
</bus>
...
</buses>
Let’s use the ElementTree
library to find out where all of
the Route-22 buses are currently located along with their direction.
>>> from xml.etree.ElementTree import parse
>>> buses = parse('Data/allroutes.xml')
>>> for bus in buses.findall('bus'):
if bus.findtext('route') == '22':
lat = bus.findtext('latitude')
lon = bus.findtext('longitude')
direction = bus.findtext('direction')
print lat, lon, direction
41.99031664530436 -87.67012786865234 South Bound
41.87956511974335 -87.63079524040222 South Bound
41.880481123924255 -87.62948191165924 North Bound
... more output ...
>>>
(b) Encoding/Decoding JSON
JSON is a common data encoding used in distributed systems and web applications. It is
relatively easy to speak JSON if your Python program makes use of standard data structures such as
lists and dictionaries. Just use the json
module to perform the necessary encoding and decoding.
To illustrate, use your fileparse
module to read some data:
>>> import fileparse
>>> portfolio = fileparse.parse_csv('Data/portfolio.csv', types=[str,int,float])
>>> portfolio
[{'price': 32.2, 'name': 'AA', 'shares': 100}, {'price': 91.1, 'name': 'IBM', 'shares': 50}, {'price': 83.44, 'name': 'CAT', 'shares': 150}, {'price': 51.23, 'name': 'MSFT', 'shares': 200}, {'price': 40.37, 'name': 'GE', 'shares': 95}, {'price': 65.1, 'name': 'MSFT', 'shares': 50}, {'price': 70.44, 'name': 'IBM', 'shares': 100}]
>>>
Now, encode the data as JSON:
>>> import json
>>> encoded = json.dumps(portfolio)
>>> encoded
'[{"price": 32.2, "name": "AA", "shares": 100}, {"price": 91.1, "name": "IBM", "shares": 50}, {"price": 83.44, "name": "CAT", "shares": 150}, {"price": 51.23, "name": "MSFT", "shares": 200}, {"price": 40.37, "name": "GE", "shares": 95}, {"price": 65.1, "name": "MSFT", "shares": 50}, {"price": 70.44, "name": "IBM", "shares": 100}]'
>>>
Once you have the JSON encoding, you can store it in files, send it somewhere, store it in a database, or perform any number of similar operations. For example:
>>> f = open('data.json', 'w')
>>> f.write(encoded)
>>> f.close()
To decode JSON and turn it back into Python data, just use json.loads()
. For example:
>>> data = json.loads(encoded)
>>> data
[{u'price': 32.2, u'name': u'AA', u'shares': 100}, {u'price': 91.1, u'name': u'IBM', u'shares': 50}, {u'price': 83.44, u'name': u'CAT', u'shares': 150}, {u'price': 51.23, u'name': u'MSFT', u'shares': 200}, {u'price': 40.37, u'name': u'GE', u'shares': 95}, {u'price': 65.1, u'name': u'MSFT', u'shares': 50}, {u'price': 70.44, u'name': u'IBM', u'shares': 100}]
>>>
Note that when JSON is decoded, all strings are Unicode (e.g., the strings u'name'
, u'shares'
, and u'price'
above).
(c) Reading Binary-Encoded Records
The file Data/portfolio.bin
contains portfolio data
in a packed binary format. Each record is stored as follows:
Bytes Size Description
-------------------------------------------
0-7 8 bytes Name of stock (string)
8-11 4 bytes Number of shares (32-bit integer, little endian)
12-15 4 bytes Price (32-bit float, little endian)
Let’s try reading some of this data:
>>> import struct
>>> f = open('Data/portfolio.bin', 'rb')
>>> rawrecord = f.read(16) # Get a raw 16-byte record
>>> rawrecord
'AA\x00\x00\x00\x00\x00\x00d\x00\x00\x00\xcd\xcc\x00B'
>>> name,shares,price = struct.unpack('<8sif', rawrecord)
>>> name
'AA\x00\x00\x00\x00\x00\x00'
>>> shares
100
>>> price
32.20000076293945
>>>
Let’s strip the padding off of the name and make a dictionary:
>>> name = name.strip('\x00')
>>> name
'AA'
>>> s = { 'name': name, 'shares' : shares, 'price' :price }
>>> s
{'price': 32.20000076293945, 'name': 'AA', 'shares': 100}
>>>
To read the rest of the file, you would continue to read the file in 16-byte chunks and decode as shown. For example:
>>> while True:
rawrecord = f.read(16)
if not rawrecord:
break
name, shares, price = struct.unpack('<8sif', rawrecord)
name = name.strip('\x00')
print name, shares, price
IBM 50 91.0999984741
CAT 150 83.4400024414
MSFT 200 51.2299995422
GE 95 40.3699989319
MSFT 50 65.0999984741
IBM 100 70.4400024414
>>>