Exercise 4.3
(a) Simple Regular Expression Pattern Matching
In this exercise, we experiment with regular expression pattern matching. First, define the string:
>>> text = "Guido was out of the office from 12/14/2012 to 1/3/2013."
>>>
Now, find all of the matching dates:
>>> import re
>>> dates = re.findall(r'(\d+)/(\d+)/(\d+)', text)
>>> dates
[('12','14','2012'), ('1', '3', '2013')]
>>>
Replace the dates with a different format:
>>> newtext = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)
>>> newtext
'Guido was out of the office from 2012-12-14 to 2013-1-3.'
>>>
(b) Splitting text on multiple delimiters
Consider the following text containing a timestamp.
>>> text = 'Fri Jan 17 12:22:52 CST 2014'
>>>
Suppose you wanted to split the date into parts by splitting the string on space and colon (:) characters. Try this:
>>> parts = ['weekday', 'month', 'day', 'hour', 'minute', 'second', 'timezone', 'year']
>>> d = dict(zip(parts, re.split(r'[ :]', text)))
>>> d
{'weekday': 'Fri', 'hour': '12', 'month': 'Jan', 'second': '52', 'year': '2014', 'timezone': 'CST', 'day': '17', 'minute': '22'}
>>>
(c) Number conversion
Consider the following list of strings:
>>> vals = ['1','2','-','3','N/A','4','-5','+6']
>>>
Now, suppose you wanted to convert all of the values into integers using a list comprehension. Sadly, it doesn’t work:
>>> data = [int(val) for val in vals]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '-'
>>>
To avoid this, you might try putting some sort of simple filter on it. For example:
>>> data = [(int(val) if val.isdigit() else None) for val in vals]
>>> data
[1, 2, None, 3, None, 4, None, None]
>>>
Sadly, that didn’t work either because perfectly valid values of -5 and +6 didn’t get converted. Fortunately, you can wield the awesome power of a regex here. Try this:
>>> data = [(int(val) if re.match(r'[+-]?\d+$', val) else None) for val in vals]
>>> data
[1, 2, None, 3, None, 4, -5, 6]
>>>
Ah yes, mixing list comprehensions and regular expressions together at the same time—at the very least you have slightly increased your job security.