Exercise 4.4

Objectives:

  • Simple XML parsing.

  • Simple JSON encoding.

  • Simple binary data handling.

Files Created: None

Files Modified: None

(a) Simple XML Parsing

The file Data/allroutes.xml contains an XML file representing a snapshot of the latitude and longitude position of all GPS-equipped city buses in Chicago (data which is available as a real-time download). The file looks something like this:

<?xml version="1.0"?>
<buses>
  <bus>
    <id>7574</id>
    <route>147</route>
    <color>#3300ff</color>
    <revenue>true</revenue>
    <direction>North Bound</direction>
    <latitude>41.925682067871094</latitude>
    <longitude>-87.63092803955078</longitude>
    <pattern>2499</pattern>
    <patternDirection>North Bound</patternDirection>
    <run>P675</run>
    <finalStop><![CDATA[Paulina & Howard Terminal]]></finalStop>
    <operator>42493</operator>
  </bus>
  <bus>
    <id>6842</id>
    <route>81</route>
    <color>#996633</color>
    <revenue>true</revenue>
    <direction>East Bound</direction>
    <latitude>41.96847915649414</latitude>
    <longitude>-87.71509087085724</longitude>
    <pattern>1649</pattern>
    <patternDirection>East Bound</patternDirection>
    <run>F259</run>
    <finalStop><![CDATA[Marine Drive & Wilson]]></finalStop>
    <operator>40641</operator>
  </bus>
  ...
</buses>

Let’s use the ElementTree library to find out where all of the Route-22 buses are currently located along with their direction.

>>> from xml.etree.ElementTree import parse
>>> buses = parse('Data/allroutes.xml')
>>> for bus in buses.findall('bus'):
      if bus.findtext('route') == '22':
            lat = bus.findtext('latitude')
            lon = bus.findtext('longitude')
            direction = bus.findtext('direction')
            print lat, lon, direction

41.99031664530436 -87.67012786865234 South Bound
41.87956511974335 -87.63079524040222 South Bound
41.880481123924255 -87.62948191165924 North Bound
... more output ...
>>>

(b) Encoding/Decoding JSON

JSON is a common data encoding used in distributed systems and web applications. It is relatively easy to speak JSON if your Python program makes use of standard data structures such as lists and dictionaries. Just use the json module to perform the necessary encoding and decoding. To illustrate, use your fileparse module to read some data:

>>> import fileparse
>>> portfolio = fileparse.parse_csv('Data/portfolio.csv', types=[str,int,float])
>>> portfolio
[{'price': 32.2, 'name': 'AA', 'shares': 100}, {'price': 91.1, 'name': 'IBM', 'shares': 50}, {'price': 83.44, 'name': 'CAT', 'shares': 150}, {'price': 51.23, 'name': 'MSFT', 'shares': 200}, {'price': 40.37, 'name': 'GE', 'shares': 95}, {'price': 65.1, 'name': 'MSFT', 'shares': 50}, {'price': 70.44, 'name': 'IBM', 'shares': 100}]
>>>

Now, encode the data as JSON:

>>> import json
>>> encoded = json.dumps(portfolio)
>>> encoded
'[{"price": 32.2, "name": "AA", "shares": 100}, {"price": 91.1, "name": "IBM", "shares": 50}, {"price": 83.44, "name": "CAT", "shares": 150}, {"price": 51.23, "name": "MSFT", "shares": 200}, {"price": 40.37, "name": "GE", "shares": 95}, {"price": 65.1, "name": "MSFT", "shares": 50}, {"price": 70.44, "name": "IBM", "shares": 100}]'
>>>

Once you have the JSON encoding, you can store it in files, send it somewhere, store it in a database, or perform any number of similar operations. For example:

>>> f = open('data.json', 'w')
>>> f.write(encoded)
>>> f.close()

To decode JSON and turn it back into Python data, just use json.loads(). For example:

>>> data = json.loads(encoded)
>>> data
[{u'price': 32.2, u'name': u'AA', u'shares': 100}, {u'price': 91.1, u'name': u'IBM', u'shares': 50}, {u'price': 83.44, u'name': u'CAT', u'shares': 150}, {u'price': 51.23, u'name': u'MSFT', u'shares': 200}, {u'price': 40.37, u'name': u'GE', u'shares': 95}, {u'price': 65.1, u'name': u'MSFT', u'shares': 50}, {u'price': 70.44, u'name': u'IBM', u'shares': 100}]
>>>

Note that when JSON is decoded, all strings are Unicode (e.g., the strings u'name', u'shares', and u'price' above).

(c) Reading Binary-Encoded Records

The file Data/portfolio.bin contains portfolio data in a packed binary format. Each record is stored as follows:

Bytes          Size             Description
-------------------------------------------
0-7            8 bytes          Name of stock (string)
8-11           4 bytes          Number of shares (32-bit integer, little endian)
12-15          4 bytes          Price (32-bit float, little endian)

Let’s try reading some of this data:

>>> import struct
>>> f = open('Data/portfolio.bin', 'rb')
>>> rawrecord = f.read(16)      # Get a raw 16-byte record
>>> rawrecord
'AA\x00\x00\x00\x00\x00\x00d\x00\x00\x00\xcd\xcc\x00B'
>>> name,shares,price = struct.unpack('<8sif', rawrecord)
>>> name
'AA\x00\x00\x00\x00\x00\x00'
>>> shares
100
>>> price
32.20000076293945
>>>

Let’s strip the padding off of the name and make a dictionary:

>>> name = name.strip('\x00')
>>> name
'AA'
>>> s = { 'name': name, 'shares' : shares, 'price' :price }
>>> s
{'price': 32.20000076293945, 'name': 'AA', 'shares': 100}
>>>

To read the rest of the file, you would continue to read the file in 16-byte chunks and decode as shown. For example:

>>> while True:
       rawrecord = f.read(16)
       if not rawrecord:
            break
       name, shares, price = struct.unpack('<8sif', rawrecord)
       name = name.strip('\x00')
       print name, shares, price

IBM 50 91.0999984741
CAT 150 83.4400024414
MSFT 200 51.2299995422
GE 95 40.3699989319
MSFT 50 65.0999984741
IBM 100 70.4400024414
>>>
Links