Chapter 2
Linear Algebra
Linear algebra is a branch of mathematics that is widely used throughout science
and engineering. However, because linear algebra is a form of continuous rather
than discrete mathematics, many computer scientists have little experience with it.
A good understanding of linear algebra is essential for understanding and working
with many machine learning algorithms, especially deep learning algorithms. We
therefore precede our introduction to deep learning with a focused presentation of
the key linear algebra prerequisites.
If you are already familiar with linear algebra, feel free to skip this chapter. If
you have previous experience with these concepts but need a detailed reference
sheet to review key formulas, we recommend The Matrix Cookbook (Petersen and
Pedersen, 2006). If you have no exposure at all to linear algebra, this chapter
will teach you enough to read this book, but we highly recommend that you also
consult another resource focused exclusively on teaching linear algebra, such as
Shilov (1977). This chapter will completely omit many important linear algebra
topics that are not essential for understanding deep learning.
2.1 Scalars, Vectors, Matrices and Tensors
The study of linear algebra involves several types of mathematical objects:
Scalars
: A scalar is just a single number, in contrast to most of the other
objects studied in linear algebra, which are usually arrays of multiple numbers.
We write scalars in italics. We usually give scalars lower-case variable names.
When we introduce them, we specify what kind of number they are. For
31
CHAPTER 2. LINEAR ALGEBRA
example, we might say “Let
s R
be the slope of the line,” while defining a
real-valued scalar, or “Let
n N
be the number of units,” while defining a
natural number scalar.
Vectors
: A vector is an array of numbers. The numbers are arranged in
order. We can identify each individual number by its index in that ordering.
Typically we give vectors lower case names written in bold typeface, such
as
x
. The elements of the vector are identified by writing its name in italic
typeface, with a subscript. The first element of
x
is
x
1
, the second element
is
x
2
and so on. We also need to say what kind of numbers are stored in
the vector. If each element is in
R
, and the vector has
n
elements, then the
vector lies in the set formed by taking the Cartesian product of
R n
times,
denoted as
R
n
. When we need to explicitly identify the elements of a vector,
we write them as a column enclosed in square brackets:
x =
x
1
x
2
.
.
.
x
n
. (2.1)
We can think of vectors as identifying points in space, with each element
giving the coordinate along a different axis.
Sometimes we need to index a set of elements of a vector. In this case, we
define a set containing the indices and write the set as a subscript. For
example, to access
x
1
,
x
3
and
x
6
, we define the set
S
=
{
1
,
3
,
6
}
and write
x
S
. We use the
sign to index the complement of a set. For example
x
1
is
the vector containing all elements of
x
except for
x
1
, and
x
S
is the vector
containing all of the elements of x except for x
1
, x
3
and x
6
.
Matrices
: A matrix is a 2-D array of numbers, so each element is identified
by two indices instead of just one. We usually give matrices upper-case
variable names with bold typeface, such as
A
. If a real-valued matrix
A
has
a height of
m
and a width of
n
, then we say that
A R
m×n
. We usually
identify the elements of a matrix using its name in italic but not bold font,
and the indices are listed with separating commas. For example,
A
1,1
is the
upper left entry of
A
and
A
m,n
is the bottom right entry. We can identify all
of the numbers with vertical coordinate
i
by writing a : for the horizontal
coordinate. For example,
A
i,:
denotes the horizontal cross section of
A
with
vertical coordinate
i
. This is known as the
i
-th
row
of
A
. Likewise,
A
:,i
is
32
CHAPTER 2. LINEAR ALGEBRA
A =
A
1,1
A
1,2
A
2,1
A
2,2
A
3,1
A
3,2
A
=
A
1,1
A
2,1
A
3,1
A
1,2
A
2,2
A
3,2
Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the
main diagonal.
the i-th column of A. When we need to explicitly identify the elements of
a matrix, we write them as an array enclosed in square brackets:
A
1,1
A
1,2
A
2,1
A
2,2
. (2.2)
Sometimes we may need to index matrix-valued expressions that are not just
a single letter. In this case, we use subscripts after the expression, but do
not convert anything to lower case. For example,
f
(
A
)
i,j
gives element (
i, j
)
of the matrix computed by applying the function f to A.
Tensors
: In some cases we will need an array with more than two axes.
In the general case, an array of numbers arranged on a regular grid with a
variable number of axes is known as a tensor. We denote a tensor named “A”
with this typeface:
A
. We identify the element of
A
at coordinates (
i, j, k
)
by writing A
i,j,k
.
One important operation on matrices is the
transpose
. The transpose of a
matrix is the mirror image of the matrix across a diagonal line, called the
main
diagonal
, running down and to the right, starting from its upper left corner. See
figure 2.1 for a graphical depiction of this operation. We denote the transpose of a
matrix A as A
, and it is defined such that
(A
)
i,j
= A
j,i
. (2.3)
Vectors can be thought of as matrices that contain only one column. The
transpose of a vector is therefore a matrix with only one row. Sometimes we
33
CHAPTER 2. LINEAR ALGEBRA
define a vector by writing out its elements in the text inline as a row matrix,
then using the transpose operator to turn it into a standard column vector, e.g.,
x = [x
1
, x
2
, x
3
]
.
A scalar can be thought of as a matrix with only a single entry. From this, we
can see that a scalar is its own transpose: a = a
.
We can add matrices to each other, as long as they have the same shape, just
by adding their corresponding elements: C = A + B where C
i,j
= A
i,j
+ B
i,j
.
We can also add a scalar to a matrix or multiply a matrix by a scalar, just
by performing that operation on each element of a matrix:
D
=
a · B
+
c
where
D
i,j
= a · B
i,j
+ c.
In the context of deep learning, we also use some less conventional notation.
We allow the addition of matrix and a vector, yielding another matrix:
C
=
A
+
b
,
where
C
i,j
=
A
i,j
+
b
j
. In other words, the vector
b
is added to each row of the
matrix. This shorthand eliminates the need to define a matrix with
b
copied into
each row before doing the addition. This implicit copying of
b
to many locations
is called broadcasting.
2.2 Multiplying Matrices and Vectors
One of the most important operations involving matrices is multiplication of two
matrices. The
matrix product
of matrices
A
and
B
is a third matrix
C
. In
order for this product to be defined,
A
must have the same number of columns as
B
has rows. If
A
is of shape
m × n
and
B
is of shape
n × p
, then
C
is of shape
m × p
. We can write the matrix product just by placing two or more matrices
together, e.g.
C = AB. (2.4)
The product operation is defined by
C
i,j
=
k
A
i,k
B
k,j
. (2.5)
Note that the standard product of two matrices is not just a matrix containing
the product of the individual elements. Such an operation exists and is called the
element-wise product or Hadamard product, and is denoted as A B.
The
dot product
between two vectors
x
and
y
of the same dimensionality
is the matrix product
x
y
. We can think of the matrix product
C
=
AB
as
computing C
i,j
as the dot product between row i of A and column j of B.
34
CHAPTER 2. LINEAR ALGEBRA
Matrix product operations have many useful properties that make mathematical
analysis of matrices more convenient. For example, matrix multiplication is
distributive:
A(B + C) = AB + AC. (2.6)
It is also associative:
A(BC) = (AB)C. (2.7)
Matrix multiplication is not commutative (the condition