Vectors & Matrices
Explaining the concepts of vectors and matrices, and how they help in Data Science.
Vectors
Vectors are ordered arrays of numbers. The elements of a vector are all the same type. A vector does not, for example, contain both characters and numbers. The number of elements in the array is often referred to as the dimension or rank. The elements of a vector can be referenced with an index. In math settings, indexes typically run from 1 to n. In computer science, indexing will typically run from 0 to n-1.
import numpy as np
import time
NumPy Arrays
NumPy's basic data structure is an indexable, n-dimensional array containing elements of the same type (dtype
). Above, dimension was the number of elements in the vector; here, dimension refers to the number of indexes of an array. A one-dimensional or 1-D array has one index.
- 1-D array, shape (n,): n elements indexed [0] through [n-1]
a = np.zeros(4); print(f"np.zeros(4) : a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.zeros((4,)); print(f"np.zeros(4,) : a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.random.random_sample(4); print(f"np.random.random_sample(4): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
Some data creation routines do not take a shape tuple.
a = np.arange(4.); print(f"np.arange(4.): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.random.rand(4); print(f"np.random.rand(4): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
Values can be specified manually as well.
a = np.array([5,4,3,2]); print(f"np.array([5,4,3,2]): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.array([5.,4,3,2]); print(f"np.array([5.,4,3,2]): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
These have all created one-dimensional vector a
with four elements. a.shape
returns the dimensions. Here we see a.shape = (4,)
indicating a 1-D array with 4 elements.
Operations on Vectors
Elements of vectors can be accessed via indexing and slicing.
Indexing means referring to an element of an array by its position within the array.
Slicing means getting a subset of elements from an array based on their indices.
NumPy starts indexing at zero so the 3rd element of a vector a is a[2]
.
a = np.arange(10)
print(a)
# access an element
print(f"a[2].shape: {a[2].shape} a[2] = {a[2]}, Accessing an element returns a scalar")
# access the last element, negative indexes count from the end
print(f"a[-1] = {a[-1]}")
# indexs must be within the range of the vector or they will produce and error
try:
c = a[10]
except Exception as e:
print("The error message you'll see is:")
print(e)
a = np.arange(10)
print(f"a = {a}")
# access 5 consecutive elements (start:stop:step)
c = a[2:7:1]; print("a[2:7:1] = ", c)
# access 3 elements separated by two
c = a[2:7:2]; print("a[2:7:2] = ", c)
# access all elements index 3 and above
c = a[3:]; print("a[3:] = ", c)
# access all elements below index 3
c = a[:3]; print("a[:3] = ", c)
# access all elements
c = a[:]; print("a[:] = ", c)
a = np.array([1,2,3,4])
print(f"a : {a}")
# negate elements of a
b = -a
print(f"b = -a : {b}")
# sum all elements of a, returns a scalar
b = np.sum(a)
print(f"b = np.sum(a) : {b}")
# get the average of elements of a
b = np.mean(a)
print(f"b = np.mean(a): {b}")
# square elements of a
b = a**2
print(f"b = a**2 : {b}")
a = np.array([ 1, 2, 3, 4])
b = np.array([-1,-2, 3, 4])
print(f"Binary operators work element wise: {a + b}")
Of course, for this to work correctly, the vectors must be of the same size.
c = np.array([1, 2])
try:
d = a + c
except Exception as e:
print("The error message you'll see is:")
print(e)
a = np.array([1, 2, 3, 4])
# multiply a by a scalar
b = 5 * a
print(f"b = 5 * a : {b}")
Vector Vector dot product
The dot product multiplies the values in two vectors element-wise and then sums the result. Vector dot product requires the dimensions of the two vectors to be the same.
Let's implement our own version of the dot product below:
Using a for loop, implement a function which returns the dot product of two vectors. The function to return given inputs $a$ and $b$:
$$ x = \sum_{i=0}^{n-1} a_i b_i $$
Assume both a
and b
are the same shape.
def my_dot(a, b):
"""
Compute the dot product of two vectors
Args:
a (ndarray (n,)): input vector
b (ndarray (n,)): input vector with same dimension as a
Returns:
x (scalar):
"""
x = 0
for i in range(a.shape[0]):
x = x + a[i] * b[i]
return x
a = np.array([1, 2, 3, 4])
print(f"a : {a}")
b = np.array([-1, 4, 3, 2])
print(f"b : {b}")
print(f"my_dot(a, b) = {my_dot(a, b)}")
Note, the dot product is expected to return a scalar value.
Let's try the same operations using np.dot
.
a = np.array([1, 2, 3, 4])
b = np.array([-1, 4, 3, 2])
c = np.dot(a, b)
print(f"NumPy 1-D np.dot(a, b) = {c}, np.dot(a, b).shape = {c.shape} ")
c = np.dot(b, a)
print(f"NumPy 1-D np.dot(b, a) = {c}, np.dot(a, b).shape = {c.shape} ")
Above, the results for 1-D matched our implementation.
np.random.seed(1)
a = np.random.rand(10000000) # very large arrays
b = np.random.rand(10000000)
tic = time.time() # capture start time
c = np.dot(a, b)
toc = time.time() # capture end time
print(f"np.dot(a, b) = {c:.4f}")
print(f"Vectorized version duration: {1000*(toc-tic):.4f} ms ")
tic = time.time() # capture start time
c = my_dot(a,b)
toc = time.time() # capture end time
print(f"my_dot(a, b) = {c:.4f}")
print(f"loop version duration: {1000*(toc-tic):.4f} ms ")
del(a); del(b) #remove these big arrays from memory
So, vectorization provides a large speed up in this example. This is because NumPy makes better use of available data parallelism in the underlying hardware. GPU's and modern CPU's implement Single Instruction, Multiple Data (SIMD) pipelines allowing multiple operations to be issued in parallel. This is critical in Machine Learning where the data sets are often very large.
Matrices
Matrices are two-dimensional arrays. The elements of a matrix are all of the same type. In notation, matrices are denoted with a capital, bold letter such as $\mathbf{X}$. In general, m
is often the number of rows and n
the number of columns. The elements of a matrix can be referenced with a two dimensional index. In math settings, numbers in the index typically run from 1 to n. In computer science and these labs, indexing will run from 0 to n-1.
NumPy Arrays
NumPy's basic data structure is an indexable, n-dimensional array containing elements of the same type (dtype
). Matrices have a two-dimensional (2-D) index [m,n].
In machine learning, 2-D matrices are used to hold training data. Training data is $m$ examples by $n$ features creating an (m,n) array.
a = np.zeros((1, 5))
print(f"a shape = {a.shape}, a = {a}")
a = np.zeros((2, 1))
print(f"a shape = {a.shape}, a = {a}")
a = np.random.random_sample((1, 1))
print(f"a shape = {a.shape}, a = {a}")
One can also manually specify data. Dimensions are specified with additional brackets matching the format in the printing above.
a = np.array([[5], [4], [3]]);
print(f" a shape = {a.shape}, np.array: a = {a}")
a = np.array([[5], # One can also
[4], # separate values
[3]]); #into separate rows
print(f" a shape = {a.shape}, np.array: a = {a}")
a = np.arange(6).reshape(-1, 2) #reshape is a convenient way to create matrices
print(f"a.shape: {a.shape}, \na= {a}")
# access an element
print(f"\na[2,0].shape: {a[2, 0].shape}, a[2,0] = {a[2, 0]}, type(a[2,0]) = {type(a[2, 0])} \nAccessing an element returns a scalar\n")
# access a row
print(f"a[2].shape: {a[2].shape}, a[2] = {a[2]}, type(a[2]) = {type(a[2])} \nAccessing a matrix by just specifying the row will return a 1-D vector")
a = np.arange(20).reshape(-1, 10)
print(f"a = \n{a}")
# access 5 consecutive elements (start:stop:step)
print("a[0, 2:7:1] = ", a[0, 2:7:1], ", a[0, 2:7:1].shape =", a[0, 2:7:1].shape, "a 1-D array")
# access 5 consecutive elements (start:stop:step) in two rows
print("a[:, 2:7:1] = \n", a[:, 2:7:1], ", a[:, 2:7:1].shape =", a[:, 2:7:1].shape, "a 2-D array")
# access all elements
print("a[:,:] = \n", a[:,:], ", a[:,:].shape =", a[:,:].shape)
# access all elements in one row (very common usage)
print("a[1,:] = ", a[1,:], ", a[1,:].shape =", a[1,:].shape, "a 1-D array")
# same as
print("a[1] = ", a[1], ", a[1].shape =", a[1].shape, "a 1-D array")