Chapter 2

Linear Algebra

Linear algebra is a branch of mathematics that is widely used throughout science

and engineering. Yet because linear algebra is a form of continuous rather than

discrete mathematics, many computer scientists have little experience with it. A

good understanding of linear algebra is essential for understanding and working

with many machine learning algorithms, especially deep learning algorithms. We

therefore precede our introduction to deep learning with a focused presentation of

the key linear algebra prerequisites.

If you are already familiar with linear algebra, feel free to skip this chapter. If

you have previous experience with these concepts but need a detailed reference

sheet to review key formulas, we recommend The Matrix Cookbook (Petersen and

Pedersen, 2006). If you have had no exposure at all to linear algebra, this chapter

will teach you enough to read this book, but we highly recommend that you also

consult another resource focused exclusively on teaching linear algebra, such as

Shilov (1977). This chapter completely omits many important linear algebra topics

that are not essential for understanding deep learning.

2.1 Scalars, Vectors, Matrices and Tensors

The study of linear algebra involves several types of mathematical objects:

• Scalars

: A scalar is just a single number, in contrast to most of the other

objects studied in linear algebra, which are usually arrays of multiple numbers.

We write scalars in italics. We usually give scalars lowercase variable names.

When we introduce them, we specify what kind of number they are. For

29

CHAPTER 2. LINEAR ALGEBRA

example, we might say “Let

s ∈ R

be the slope of the line,” while deﬁning a

real-valued scalar, or “Let

n ∈ N

be the number of units,” while deﬁning a

natural number scalar.

• Vectors

: A vector is an array of numbers. The numbers are arranged in

order. We can identify each individual number by its index in that ordering.

Typically we give vectors lowercase names in bold typeface, such as

x

. The

elements of the vector are identiﬁed by writing its name in italic typeface,

with a subscript. The ﬁrst element of

x

is

x

1

, the second element is

x

2

, and

so on. We also need to say what kind of numbers are stored in the vector. If

each element is in

R

, and the vector has

n

elements, then the vector lies in

the set formed by taking the Cartesian product of

R n

times, denoted as

R

n

.

When we need to explicitly identify the elements of a vector, we write them

as a column enclosed in square brackets:

x =

x

1

x

2

.

.

.

x

n

. (2.1)

We can think of vectors as identifying points in space, with each element

giving the coordinate along a diﬀerent axis.

Sometimes we need to index a set of elements of a vector. In this case, we

deﬁne a set containing the indices and write the set as a subscript. For

example, to access

x

1

,

x

3

and

x

6

, we deﬁne the set

S

=

{

1

,

3

,

6

}

and write

x

S

. We use the

−

sign to index the complement of a set. For example

x

−1

is

the vector containing all elements of

x

except for

x

1

, and

x

−S

is the vector

containing all elements of x except for x

1

, x

3

and x

6

.

• Matrices

: A matrix is a 2-D array of numbers, so each element is identiﬁed

by two indices instead of just one. We usually give matrices uppercase

variable names with bold typeface, such as

A

. If a real-valued matrix

A

has

a height of

m

and a width of

n

, then we say that

A ∈ R

m×n

. We usually

identify the elements of a matrix using its name in italic but not bold font,

and the indices are listed with separating commas. For example,

A

1,1

is the

upper left entry of

A

and

A

m,n

is the bottom right entry. We can identify

all the numbers with vertical coordinate

i

by writing a “:” for the horizontal

coordinate. For example,

A

i,:

denotes the horizontal cross section of

A

with

vertical coordinate

i

. This is known as the

i

-th

row

of

A

. Likewise,

A

:,i

is

30

CHAPTER 2. LINEAR ALGEBRA

the i-th column of A. When we need to explicitly identify the elements of

a matrix, we write them as an array enclosed in square brackets:

A

1,1

A

1,2

A

2,1

A

2,2

. (2.2)

Sometimes we may need to index matrix-valued expressions that are not just

a single letter. In this case, we use subscripts after the expression but do not

convert anything to lowercase. For example,

f

(

A

)

i,j

gives element (

i, j

) of

the matrix computed by applying the function f to A.

• Tensors

: In some cases we will need an array with more than two axes.

In the general case, an array of numbers arranged on a regular grid with a

variable number of axes is known as a tensor. We denote a tensor named “A”

with this typeface:

A

. We identify the element of

A

at coordinates (

i, j, k

)

by writing A

i,j,k

.

One important operation on matrices is the

transpose

. The transpose of a

matrix is the mirror image of the matrix across a diagonal line, called the

main

diagonal

, running down and to the right, starting from its upper left corner. See

ﬁgure 2.1 for a graphical depiction of this operation. We denote the transpose of a

matrix A as A

, and it is deﬁned such that

(A

)

i,j

= A

j,i

. (2.3)

Vectors can be thought of as matrices that contain only one column. The

transpose of a vector is therefore a matrix with only one row. Sometimes we

A =

A

1,1

A

1,2

A

2,1

A

2,2

A

3,1

A

3,2

⇒ A

=

A

1,1

A

2,1

A

3,1

A

1,2

A

2,2

A

3,2

Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the

main diagonal.

31

CHAPTER 2. LINEAR ALGEBRA

deﬁne a vector by writing out its elements in the text inline as a row matrix, then

using the transpose operator to turn it into a standard column vector, for example

x = [x

1

, x

2

, x

3

]

.

A scalar can be thought of as a matrix with only a single entry. From this, we

can see that a scalar is its own transpose: a = a

.

We can add matrices to each other, as long as they have the same shape, just

by adding their corresponding elements: C = A + B where C

i,j

= A

i,j

+ B

i,j

.

We can also add a scalar to a matrix or multiply a matrix by a scalar, just

by performing that operation on each element of a matrix:

D

=

a · B

+

c

where

D

i,j

= a · B

i,j

+ c.

In the context of deep learning, we also use some less conventional notation.

We allow the addition of matrix and a vector, yielding another matrix:

C

=

A

+

b

,

where

C

i,j

=

A

i,j

+

b

j

. In other words, the vector

b

is added to each row of the

matrix. This shorthand eliminates the need to deﬁne a matrix with

b

copied into

each row before doing the addition. This implicit copying of

b

to many locations

is called broadcasting.

2.2 Multiplying Matrices and Vectors

One of the most important operations involving matrices is multiplication of two

matrices. The

matrix product

of matrices

A

and

B

is a third matrix

C

. In

order for this product to be deﬁned,

A

must have the same number of columns as

B

has rows. If

A

is of shape

m × n

and

B

is of shape

n × p

, then

C

is of shape

m × p

. We can write the matrix product just by placing two or more matrices

together, for example,

C = AB. (2.4)

The product operation is deﬁned by

C

i,j

=

k

A

i,k

B

k,j

. (2.5)

Note that the standard product of two matrices is not just a matrix containing

the product of the individual elements. Such an operation exists and is called the

element-wise product, or Hadamard product, and is denoted as A B.

The

dot product

between two vectors

x

and

y

of the same dimensionality

is the matrix product

x

y

. We can think of the matrix product

C

=

AB

as

computing C

i,j

as the dot product between row i of A and column j of B.

32

CHAPTER 2. LINEAR ALGEBRA

Matrix product operations have many useful properties that make mathematical

analysis of matrices more convenient. For example, matrix multiplication is

distributive:

A(B + C) = AB + AC. (2.6)

It is also associative:

A(BC) = (AB)C. (2.7)

Matrix multiplication is not commutative (the condition

AB

=

BA

does not

always hold), unlike scalar multiplication. However, the dot product between two

vectors is commutative:

x

y = y

x. (2.8)

The transpose of a matrix product has a simple form:

(AB)

= B

A

. (2.9)

This enables us to demonstrate equation 2.8 by exploiting the fact that the value

of such a product is a scalar and therefore equal to its own transpose:

x

y =

x

y

= y

x. (2.10)

Since the focus of this textbook is not linear algebra, we do not attempt to

develop a comprehensive list of useful properties of the matrix product here, but

the reader should be aware that many more exist.

We now know enough linear algebra notation to write down a system of linear

equations:

Ax = b (2.11)

where

A ∈ R

m×n

is a known matrix,

b ∈ R