Chapter 2
Linear Algebra
Linear algebra is a branch of mathematics that is widely used throughout science
and engineering. Yet because linear algebra is a form of continuous rather than
discrete mathematics, many computer scientists have little experience with it. A
good understanding of linear algebra is essential for understanding and working
with many machine learning algorithms, especially deep learning algorithms. We
therefore precede our introduction to deep learning with a focused presentation of
the key linear algebra prerequisites.
If you are already familiar with linear algebra, feel free to skip this chapter. If
you have previous experience with these concepts but need a detailed reference
sheet to review key formulas, we recommend The Matrix Cookbook (Petersen and
Pedersen, 2006). If you have had no exposure at all to linear algebra, this chapter
will teach you enough to read this book, but we highly recommend that you also
consult another resource focused exclusively on teaching linear algebra, such as
Shilov (1977). This chapter completely omits many important linear algebra topics
that are not essential for understanding deep learning.
2.1 Scalars, Vectors, Matrices and Tensors
The study of linear algebra involves several types of mathematical objects:
Scalars
: A scalar is just a single number, in contrast to most of the other
objects studied in linear algebra, which are usually arrays of multiple numbers.
We write scalars in italics. We usually give scalars lowercase variable names.
When we introduce them, we specify what kind of number they are. For
29
CHAPTER 2. LINEAR ALGEBRA
example, we might say “Let
s R
be the slope of the line,” while defining a
real-valued scalar, or “Let
n N
be the number of units,” while defining a
natural number scalar.
Vectors
: A vector is an array of numbers. The numbers are arranged in
order. We can identify each individual number by its index in that ordering.
Typically we give vectors lowercase names in bold typeface, such as
x
. The
elements of the vector are identified by writing its name in italic typeface,
with a subscript. The first element of
x
is
x
1
, the second element is
x
2
, and
so on. We also need to say what kind of numbers are stored in the vector. If
each element is in
R
, and the vector has
n
elements, then the vector lies in
the set formed by taking the Cartesian product of
R n
times, denoted as
R
n
.
When we need to explicitly identify the elements of a vector, we write them
as a column enclosed in square brackets:
x =
x
1
x
2
.
.
.
x
n
. (2.1)
We can think of vectors as identifying points in space, with each element
giving the coordinate along a different axis.
Sometimes we need to index a set of elements of a vector. In this case, we
define a set containing the indices and write the set as a subscript. For
example, to access
x
1
,
x
3
and
x
6
, we define the set
S
=
{
1
,
3
,
6
}
and write
x
S
. We use the
sign to index the complement of a set. For example
x
1
is
the vector containing all elements of
x
except for
x
1
, and
x
S
is the vector
containing all elements of x except for x
1
, x
3
and x
6
.
Matrices
: A matrix is a 2-D array of numbers, so each element is identified
by two indices instead of just one. We usually give matrices uppercase
variable names with bold typeface, such as
A
. If a real-valued matrix
A
has
a height of
m
and a width of
n
, then we say that
A R
m×n
. We usually
identify the elements of a matrix using its name in italic but not bold font,
and the indices are listed with separating commas. For example,
A
1,1
is the
upper left entry of
A
and
A
m,n
is the bottom right entry. We can identify
all the numbers with vertical coordinate
i
by writing a : for the horizontal
coordinate. For example,
A
i,:
denotes the horizontal cross section of
A
with
vertical coordinate
i
. This is known as the
i
-th
row
of
A
. Likewise,
A
:,i
is
30
CHAPTER 2. LINEAR ALGEBRA
the i-th column of A. When we need to explicitly identify the elements of
a matrix, we write them as an array enclosed in square brackets:
A
1,1
A
1,2
A
2,1
A
2,2
. (2.2)
Sometimes we may need to index matrix-valued expressions that are not just
a single letter. In this case, we use subscripts after the expression but do not
convert anything to lowercase. For example,
f
(
A
)
i,j
gives element (
i, j
) of
the matrix computed by applying the function f to A.
Tensors
: In some cases we will need an array with more than two axes.
In the general case, an array of numbers arranged on a regular grid with a
variable number of axes is known as a tensor. We denote a tensor named “A”
with this typeface:
A
. We identify the element of
A
at coordinates (
i, j, k
)
by writing A
i,j,k
.
One important operation on matrices is the
transpose
. The transpose of a
matrix is the mirror image of the matrix across a diagonal line, called the
main
diagonal
, running down and to the right, starting from its upper left corner. See
figure 2.1 for a graphical depiction of this operation. We denote the transpose of a
matrix A as A
, and it is defined such that
(A
)
i,j
= A
j,i
. (2.3)
Vectors can be thought of as matrices that contain only one column. The
transpose of a vector is therefore a matrix with only one row. Sometimes we
A =
A
1,1
A
1,2
A
2,1
A
2,2
A
3,1
A
3,2
A
=
A
1,1
A
2,1
A
3,1
A
1,2
A
2,2
A
3,2
Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the
main diagonal.
31
CHAPTER 2. LINEAR ALGEBRA
define a vector by writing out its elements in the text inline as a row matrix, then
using the transpose operator to turn it into a standard column vector, for example
x = [x
1
, x
2
, x
3
]
.
A scalar can be thought of as a matrix with only a single entry. From this, we
can see that a scalar is its own transpose: a = a
.
We can add matrices to each other, as long as they have the same shape, just
by adding their corresponding elements: C = A + B where C
i,j
= A
i,j
+ B
i,j
.
We can also add a scalar to a matrix or multiply a matrix by a scalar, just
by performing that operation on each element of a matrix:
D
=
a · B
+
c
where
D
i,j
= a · B
i,j
+ c.
In the context of deep learning, we also use some less conventional notation.
We allow the addition of matrix and a vector, yielding another matrix:
C
=
A
+
b
,
where
C
i,j
=
A
i,j
+
b
j
. In other words, the vector
b
is added to each row of the
matrix. This shorthand eliminates the need to define a matrix with
b
copied into
each row before doing the addition. This implicit copying of
b
to many locations
is called broadcasting.
2.2 Multiplying Matrices and Vectors
One of the most important operations involving matrices is multiplication of two
matrices. The
matrix product
of matrices
A
and
B
is a third matrix
C
. In
order for this product to be defined,
A
must have the same number of columns as
B
has rows. If
A
is of shape
m × n
and
B
is of shape
n × p
, then
C
is of shape
m × p
. We can write the matrix product just by placing two or more matrices
together, for example,
C = AB. (2.4)
The product operation is defined by
C
i,j
=
k
A
i,k
B
k,j
. (2.5)
Note that the standard product of two matrices is not just a matrix containing
the product of the individual elements. Such an operation exists and is called the
element-wise product, or Hadamard product, and is denoted as A B.
The
dot product
between two vectors
x
and
y
of the same dimensionality
is the matrix product
x
y
. We can think of the matrix product
C
=
AB
as
computing C
i,j
as the dot product between row i of A and column j of B.
32
CHAPTER 2. LINEAR ALGEBRA
Matrix product operations have many useful properties that make mathematical
analysis of matrices more convenient. For example, matrix multiplication is
distributive:
A(B + C) = AB + AC. (2.6)
It is also associative:
A(BC) = (AB)C. (2.7)
Matrix multiplication is not commutative (the condition
AB
=
BA
does not
always hold), unlike scalar multiplication. However, the dot product between two
vectors is commutative:
x
y = y
x. (2.8)
The transpose of a matrix product has a simple form:
(AB)
= B
A
. (2.9)
This enables us to demonstrate equation 2.8 by exploiting the fact that the value
of such a product is a scalar and therefore equal to its own transpose:
x
y =
x
y
= y
x. (2.10)
Since the focus of this textbook is not linear algebra, we do not attempt to
develop a comprehensive list of useful properties of the matrix product here, but
the reader should be aware that many more exist.
We now know enough linear algebra notation to write down a system of linear
equations:
Ax = b (2.11)
where
A R
m×n
is a known matrix,
b R