Exercise 9.2

Objectives:

  • See how to deal with file encodings with the io module.

  • Learn how to properly produce Unicode characters in HTML output.

Files Created: recipe.py

Files Modified: None

A major problem with file I/O on modern systems is dealing with different encodings of text. Although a lot of text is often encoded in ASCII and other 8-bit character sets, if any kind of international characters are being used, you will start to work with files with various forms of multi-byte character encodings.

(a) Understanding the encoding problem

Character encoding is a slightly different problem than the file format. For example, suppose you’re writing a program to parse CSV file such as Data/portfolio.csv which looks like this:

"AA",100,32.20
"IBM",50,91.10
"CAT",150,83.44
"MSFT",200,51.23
"GE",95,40.37
"MSFT",50,65.10
"IBM",100,70.44

You know that you can easily read the file contents, using code like this. Try it out:

>>> for line in open('Data/portfolio.csv'):
         fields = line.split(',')
         print fields

... look at the output ...
>>>

The file Data/portfolio3.csv has the same portfolio data as above, but encoded in UTF-16, a multibyte character encoding used for Unicode. What happens if you try typing the same statements as above?

>>> for line in open('Data/portfolio3.csv'):
           fields = line.split(',')
           print fields

... look at the output ...
>>>

If you’ve done this correctly, you’ll get a garbled mess. The reason: characters aren’t encoded in ASCII. Now, try this:

>>> import io
>>> for line in io.open('Data/portfolio3.csv','r',encoding='utf-16'):
           fields = line.split(u',')
           print fields

... look at the output ...
>>>

Important lesson : Character encoding has nothing to do with the fact that the file is a CSV file. It’s simply an extra layer of complexity that has been added to characters in the file.

One other point: when working with Unicode, you should make sure you use Unicode strings everywhere (hence the use of u',' in the split operation).

(b) Encoding Unicode Characters in HTML Output

Here is a program that is supposed to create an HTML file with a list of ingredients for something tasty:

# recipe.py

ingredients = [
    u'Avocado',
    u'Tomato',
    u'Red Onion',
    u'Jalape\u00f1o pepper',
    u'Cilantro',
    u'Sea Salt',
]

header = u'''
<html>
<body>
'''

footer = u'''
</body>
</html>
'''

f = open('output.html','w')
f.write(header)
for item in ingredients:
    f.write('<li>%s</li>\n' % item)
f.write(footer)
f.close()

import webbrowser
import os
webbrowser.open('file:///'+os.path.abspath('output.html'))

Copy the above program to a file recipe.py and run it. Chances are, it will crash with an encoding error.

Your challenge: Modify the above program so that it runs without crashing and the resulting HTML renders the characters in the ingredient list correctly.

Links