Numerical computing in Python

Mihai Andrei

  1. scientific computing
  2. numpy
  3. simulation
  4. optimisation, machine learning

Programs

  • control hardware: C,asm
  • record keeping: Cobol, Java
  • measurement processing: Fortran, Matlab

Data in record keeping

  • About people
  • data is created by user input: do you wear glasses?
  • data is precise and discrete: yes/no
  • validation is possible in principle: jk4##! is not a person name
  • strings and ints (language and counting)
  • data is a graph of relationships:
{
  'name': 'mihai',
  'presentations': [
     {'where': 'ropython', 'topic': 'numpy'},
     {'where': 'tvb node berlin', 'topic': 'tvb gui'}
   ]
}

Data in measurement

  • About the world
  • data is produced by sensors: batroom scale
  • data is imprecise and continuous: bathroom scale is not microgram precise
  • validation is hard: what is a valid cat picture?
  • data is imprecise, large and flat
  • floats: (nature no speak english)
array([[-1.        , -0.86666667, -0.73333333, -0.6       ],
       [-0.46666667, -0.33333333, -0.2       , -0.06666667],
       [ 0.06666667,  0.2       ,  0.33333333,  0.46666667],
       [ 0.6       ,  0.73333333,  0.86666667,  1.        ]])

procedures

record keeping

  • CRUD : Mihai wears no glasses!
  • search : does Mihai have glasses?
  • presentation: fancy text and media
  • model relationships

measurement

  • finding structure: Is Mihai wearing glasses in this picture?
  • visualization: find structure yourself human

  • model reality: heat transfer

  • simulation: how will the stove heat the room?

science

algorithms and data structures

Directed_graph

linear algebra, calculus, statistics

matrix.png

numpy

  • a new datatype suitable for tables of numbers
  • the foundation of all numerics in Python

  • n-dimensional, homogeneous, fixed-size, contigous array

  • vectorized operations

Installing

$ pip install numpy matplotlib jupyter

like lists but typed and fixed sized

In [1]:
# let's meet the beast
import numpy as np
In [2]:
# ndarrays look like lists
a = np.array([0, 1, 2, 3])
a
Out[2]:
array([0, 1, 2, 3])
In [3]:
# I can index
a[1] + a[-1]
Out[3]:
4
In [4]:
# and slice
a[1:-1]
Out[4]:
array([1, 2])
In [5]:
# mutable
a[0] = 4
a
Out[5]:
array([4, 1, 2, 3])
In [6]:
# ok, so they are lists ... meh
a[0] = 'nope'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-8bf143d5311d> in <module>()
      1 # ok, so they are lists ... meh
----> 2 a[0] = 'nope'

ValueError: invalid literal for int() with base 10: 'nope'
In [7]:
# typed
a.dtype
Out[7]:
dtype('int64')
In [8]:
# fixed size
a.append(43)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-f8af5ece7ee3> in <module>()
      1 # fixed size
----> 2 a.append(43)

AttributeError: 'numpy.ndarray' object has no attribute 'append'

But why ??

Speed:

  • Memory compactness
  • no indirection or allocation
  • optimized typed operations
header 4 1 2 3
In [9]:
# dtypes have defaults
a = np.zeros(4)
print(repr(a.dtype))
print(a)
dtype('float64')
[0. 0. 0. 0.]
In [10]:
# but we can ask for a dtype
a = np.zeros(4, dtype=np.uint8)
a
Out[10]:
array([0, 0, 0, 0], dtype=uint8)

multidimensional

In [11]:
# multidimensional
a = np.array([
    [0, 1, 2],
    [3, 4, 5]
])
In [12]:
a.ndim
Out[12]:
2
In [13]:
# the shape is fixed, no appends
a.shape
Out[13]:
(2, 3)
In [14]:
# indexing in 2d
a[1]
Out[14]:
array([3, 4, 5])
In [15]:
# this works
a[1][1]
Out[15]:
4
In [16]:
# this is idiomatic
a[1, 1]
Out[16]:
4
In [17]:
# a cube
a = np.arange(2*3*2).reshape((2, 3, 2))
a.shape
Out[17]:
(2, 3, 2)
In [18]:
a
Out[18]:
array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]]])
In [19]:
a[1, :, 1]
Out[19]:
array([ 7,  9, 11])

slices

In [20]:
a = np.arange(4*4)
# reshape 
a = a.reshape((4, 4))
a
Out[20]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])
In [21]:
a[1]
Out[21]:
array([4, 5, 6, 7])
In [22]:
# all from dim 0, third from dim 1
a[:,2]
Out[22]:
array([ 2,  6, 10, 14])
In [23]:
# the interior
a[1:-1, 1:-1]
Out[23]:
array([[ 5,  6],
       [ 9, 10]])
In [24]:
# a more informative a[1]
a[1, :]
Out[24]:
array([4, 5, 6, 7])
In [25]:
# have defaults
a[:2, 2:]
Out[25]:
array([[2, 3],
       [6, 7]])
In [26]:
# Slices *ARE VIEWS* in numpy!
a = np.zeros((2, 2))
a
Out[26]:
array([[0., 0.],
       [0., 0.]])
In [27]:
b = a[:, 1]
b
Out[27]:
array([0., 0.])
In [28]:
b[0] = 23
b
Out[28]:
array([23.,  0.])
In [29]:
a
Out[29]:
array([[ 0., 23.],
       [ 0.,  0.]])

vectorized operations

In [30]:
a = np.array([[2, 1], [3, 0]])
a
Out[30]:
array([[2, 1],
       [3, 0]])
In [31]:
b = np.array([[-1, 1], [0, 1]])
b
Out[31]:
array([[-1,  1],
       [ 0,  1]])
In [32]:
a + b
Out[32]:
array([[1, 2],
       [3, 1]])
In [33]:
# ufuncs are functions that work in this vectorized fashion
np.sin(a)
Out[33]:
array([[0.90929743, 0.84147098],
       [0.14112001, 0.        ]])

vectorized operations make numpy fast

replace loops

a = range(3) # integers are objects on a heap
b = range(3)
ret = []  # containter of generic objects

for u, v in zip(a, b): # happens in python
    ret.append(u + v)

a = np.arange(3) # native machine integers
b = np.arange(3) # fixed size containter, no indirection
ret = a + b # happens in C

broadcasting

numpy_broadcasting.png

In [34]:
a = np.array([[0], [10], [20], [30]])
b = np.array([0, 1, 2])
print('a', a.shape)
print(repr(a))
print('b', b.shape)
print(repr(b))
a + b
a (4, 1)
array([[ 0],
       [10],
       [20],
       [30]])
b (3,)
array([0, 1, 2])
Out[34]:
array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22],
       [30, 31, 32]])
In [35]:
# why not :?
c = np.array([[0, 10, 20, 30]])

print('c', c.shape)

c + b
c (1, 4)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-35-2645cb6512bc> in <module>()
      4 print('c', c.shape)
      5
----> 6 c + b

ValueError: operands could not be broadcast together with shapes (1,4) (3,) 

broadcasting rule is: trailing dimensions must have same size or be 1

boolean indexing

In [36]:
a = np.random.randint(3, size=(3, 3))
a
Out[36]:
array([[2, 1, 1],
       [1, 1, 2],
       [0, 2, 1]])
In [37]:
a > 1
Out[37]:
array([[ True, False, False],
       [False, False,  True],
       [False,  True, False]])
In [38]:
a[a>1]
Out[38]:
array([2, 2, 2])
In [39]:
a[a>1] = 4
a
Out[39]:
array([[4, 1, 1],
       [1, 1, 4],
       [0, 4, 1]])
In [40]:
np.any(a == 0)
Out[40]:
True
In [41]:
np.all(a < 8)
Out[41]:
True
In [42]:
# cant convert to bool directly
bool(a < 8)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-56527d2ed582> in <module>()
      1 # cant convert to bool directly
----> 2 bool(a < 8)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Integer indexing

In [43]:
a = np.array([[1, 2], [3, 4], [5, 6]])
a
Out[43]:
array([[1, 2],
       [3, 4],
       [5, 6]])
In [44]:
rows = [0, 2]
columns = [0, 1]
# for each row pick the values from columns
a[rows, columns]
Out[44]:
array([1, 6])
In [45]:
rows = np.array([[0, 0], [-1, -1]])
columns = np.array([[0, -1], [0, -1]])
# same logoc repeating for each dimension in the index arrays
corners = a[rows, columns]
corners
Out[45]:
array([[1, 2],
       [5, 6]])
In [46]:
color_palette = np.array([
    [0, 0, 0], [11, 11, 11], [23, 23, 23],
    [3, 33, 3], [4, 4, 14], [51, 51, 51]])
color_palette[2]
Out[46]:
array([23, 23, 23])
In [47]:
a = np.arange(2*3).reshape((2,3))
a
Out[47]:
array([[0, 1, 2],
       [3, 4, 5]])
In [48]:
# transform all values to colors
a_colors = color_palette[a]
a_colors
Out[48]:
array([[[ 0,  0,  0],
        [11, 11, 11],
        [23, 23, 23]],

       [[ 3, 33,  3],
        [ 4,  4, 14],
        [51, 51, 51]]])
In [49]:
a_colors[1, 1]
Out[49]:
array([ 4,  4, 14])

Why this complex indexing ?

  • replaces loops

Hybrid data

  • both structure and measurements
  • excell like tabular data
  • use pandas, like numpy but with excell and database like operations
In [53]:
import pandas as pd
In [54]:
f = pd.read_csv('lala.csv')
f
Out[54]:
Names Births Country
0 Bob 968 US
1 Jessica 155 US
2 Mary 77 US
3 John 578 US
4 Mel 973 US
5 Ion 32 RO
6 Petrica 4 RO
7 Mihai 123 RO
In [55]:
# index by column names
f['Names']
Out[55]:
0        Bob
1    Jessica
2       Mary
3       John
4        Mel
5        Ion
6    Petrica
7      Mihai
Name: Names, dtype: object
In [56]:
f.loc[2:4, 'Names']
Out[56]:
2    Mary
3    John
4     Mel
Name: Names, dtype: object
In [57]:
f.columns
Out[57]:
Index(['Names', 'Births', 'Country'], dtype='object')
In [58]:
group = f.groupby('Country').sum()
group
Out[58]:
Births
Country
RO 159
US 2751

demos & matplotlib