Exericse Data.1

Objectives:

  • Try a few simple examples involving numpy

Files Created: None

Files Modified: None

(a) Creating and manipulating arrays

Start by importing numpy:

>>> import numpy
>>>

Create an array from the contents of a Python list and try a simple operation to see how it’s different.

>>> nums = [1, 4.5, 6.25, 8, 10, 15]
>>> nums + 1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate list (not "int") to list
>>>
>>> a = numpy.array(nums)
>>> a
array([  1.  ,   4.5 ,   6.25,   8.  ,  10.  ,  15.  ])
>>> a + 1
array([  2.  ,   5.5 ,   7.25,   9.  ,  11.  ,  16.  ])
>>>

Create a few arrays initialized to zeros.

>>> b = numpy.zeros(shape=10,dtype=float)
>>> b
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
>>> grid = numpy.zeros(shape=(10,10),dtype=float)
>>> grid
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])
>>>

Try changing some of the grid values and with different kinds of slices:

>>> grid[2:4] = 1
>>> grid
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])
>>> grid[2:6, 4:8] += 10
>>> grid
array([[  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  1.,   1.,   1.,   1.,  11.,  11.,  11.,  11.,   1.,   1.],
       [  1.,   1.,   1.,   1.,  11.,  11.,  11.,  11.,   1.,   1.],
       [  0.,   0.,   0.,   0.,  10.,  10.,  10.,  10.,   0.,   0.],
       [  0.,   0.,   0.,   0.,  10.,  10.,  10.,  10.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])
>>> grid[:4, :3] = [2,3,4]
>>> grid
array([[  2.,   3.,   4.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  2.,   3.,   4.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  2.,   3.,   4.,   1.,  11.,  11.,  11.,  11.,   1.,   1.],
       [  2.,   3.,   4.,   1.,  11.,  11.,  11.,  11.,   1.,   1.],
       [  0.,   0.,   0.,   0.,  10.,  10.,  10.,  10.,   0.,   0.],
       [  0.,   0.,   0.,   0.,  10.,  10.,  10.,  10.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])
>>>

Create an array initialized to a range of floating point numbers:

>>> xpts = numpy.arange(0, 1, 0.1)
>>> xpts
array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])
>>>

Evaluate an equation on the array:

>>> ypts = 2*xpts**2 - 3*xpts + 7
>>> ypts
array([ 7.  ,  6.72,  6.48,  6.28,  6.12,  6.  ,  5.92,  5.88,  5.88,  5.92])
>>>

Carefully observe how the operation was applied to every array element.

(b) Performance of numpy

A key feature of numpy is that is extremely fast. To illustrate, create a large list of Python numbers:

>>> nums = range(10000000)     # 10 million ints
>>> from math import sin, cos

>>> # Perform a calculation and see how long it takes
>>> vals = [2*sin(0.5*x) + 3*cos(0.75*x) for x in nums]
>>>

Now, try the same calculation using numpy:

>>> from numpy import arange, sin, cos
>>> nums = arange(10000000)
>>> vals = 2*sin(0.5*nums) + 3*cos(0.75*nums)
>>> vals
array([ 3.        ,  3.15391768,  1.89515357, ..., -0.36134788,
        0.82542516,  1.36708392])
>>> len(vals)
10000000
>>>

You should find that the second calculation is significantly faster than the version using list comprehensions. The memory use should be less as well.

(c) Using Arrays as an Alternative To Lists

For data processing problems, you might consider the use of a numpy array as an alternative to a list. For example, you could use arrays to represent the different columns in a datafile. To illustrate, try the following example. First, read your portfolio data:

>>> import report
>>> portfolio = report.read_portfolio('Data/portfolio.csv')
>>> portfolio
[{'price': 32.2, 'name': 'AA', 'shares': 100}, {'price': 91.1, 'name': 'IBM', 'shares': 50}, {'price': 83.44, 'name': 'CAT', 'shares': 150}, {'price': 51.23, 'name': 'MSFT', 'shares': 200}, {'price': 40.37, 'name': 'GE', 'shares': 95}, {'price': 65.1, 'name': 'MSFT', 'shares': 50}, {'price': 70.44, 'name': 'IBM', 'shares': 100}]
>>>

Here, the portfolio is represented as a list of dictionaries. However, let’s flip it all around into a dictionary of columns. Pay careful attention to what happens here:

>>> import numpy
>>> columns = { }
>>> columns['name'] = numpy.array([s['name'] for s in portfolio])
>>> columns['shares'] = numpy.array([s['shares'] for s in portfolio])
>>> columns['price'] = numpy.array([s['price'] for s in portfolio])
>>> columns
{'price': array([ 32.2 ,  91.1 ,  83.44,  51.23,  40.37,  65.1 ,  70.44]), 'shares': array([100,  50, 150, 200,  95,  50, 100]), 'name': array(['AA', 'IBM', 'CAT', 'MSFT', 'GE', 'MSFT', 'IBM'],
      dtype='|S4')}
>>>

In this new representation, there is only one dictionary. The dictionary holds three columns of data—each column represented by a numpy array. You can use it to perform calculations similar to what you might do with a spreadsheet. Try it out:

>>> columns['shares']*columns['price']
array([  3220.  ,   4555.  ,  12516.  ,  10246.  ,   3835.15,   3255.  ,
         7044.  ])

>>> # Create a new column of data
>>> columns['cost'] = column['shares']*columns['price']

>>> # Perform some reductions
>>> columns['cost'].sum()
44671.150000000001
>>> columns['price'].min()
32.200000000000003
>>>

Perform some more advanced kinds of queries. These are going to mirror some of the things we did using list comprehensions.

>>> # Compare all of the shares
>>> less100 = columns['shares'] < 100
>>> less100
array([False,  True, False, False,  True,  True, False], dtype=bool)

>>> # Find the names of those stocks
>>> columns['name'][less100]
array(['IBM', 'GE', 'MSFT'],
      dtype='|S4')

>>> # Find the shares of those stocks
>>> columns['shares'][less100]
array([50, 95, 50])
>>>

In this example, the boolean array less100 is used to index entries of the other columns. For example, columns['name'][less100] only picks out the names where True appears in less100. That’s interesting, but let’s just chop the whole set of columns up. You might want to hang on to your hat:

>>> result = { name: col[less100] for name, col in columns.items() }
>>> result
{ 'price': array([ 91.1 ,  40.37,  65.1 ]), 'cost': array([ 4555.  ,  3835.15,  3255.  ]), 'name': array(['IBM', 'GE', 'MSFT'],
  dtype='|S4'), 'shares': array([50, 95, 50])}
>>>

This result is simply a new dictionary of columns containing a subset of the data. You could continue to use it to perform other kinds of operations.

Discussion

This exercise is only scratching the surface of what numpy is about. In the big picture, it provides operations for manipulating arrays of uniform values (e.g., arrays of ints, floats, etc.). If you ever find yourself manipulating large lists of data, numpy will likely be much more efficient and significantly faster.

For column oriented datafiles such as CSV files, you should consider the use of Pandas—covered in the next exercise.

Links

[ Back | Next | Index ]