Chapter 3

Probability and Information

Theory

In this chapter, we describe probability theory and information theory.

Probability theory is a mathematical framework for representing uncertain

statements. It provides a means of quantifying uncertainty as well as axioms for

deriving new uncertain statements. In artiﬁcial intelligence applications, we use

probability theory in two major ways. First, the laws of probability tell us how AI

systems should reason, so we design our algorithms to compute or approximate

various expressions derived using probability theory. Second, we can use probability

and statistics to theoretically analyze the behavior of proposed AI systems.

Probability theory is a fundamental tool of many disciplines of science and

engineering. We provide this chapter to ensure that readers whose background is

primarily in software engineering, with limited exposure to probability theory, can

understand the material in this book.

While probability theory allows us to make uncertain statements and to reason

in the presence of uncertainty, information theory enables us to quantify the amount

of uncertainty in a probability distribution.

If you are already familiar with probability theory and information theory,

you may wish to skip this chapter except for section 3.14, which describes the

graphs we use to describe structured probabilistic models for machine learning. If

you have absolutely no prior experience with these subjects, this chapter should

be suﬃcient to successfully carry out deep learning research projects, but we do

suggest that you consult an additional resource, such as Jaynes (2003).

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

3.1 Why Probability?

Many branches of computer science deal mostly with entities that are entirely

deterministic and certain. A programmer can usually safely assume that a CPU will

execute each machine instruction ﬂawlessly. Errors in hardware do occur but are

rare enough that most software applications do not need to be designed to account

for them. Given that many computer scientists and software engineers work in a

relatively clean and certain environment, it can be surprising that machine learning

makes heavy use of probability theory.

Machine learning must always deal with uncertain quantities and sometimes

stochastic (nondeterministic) quantities. Uncertainty and stochasticity can arise

from many sources. Researchers have made compelling arguments for quantifying

uncertainty using probability since at least the 1980s. Many of the arguments

presented here are summarized from or inspired by Pearl (1988).

Nearly all activities require some ability to reason in the presence of uncertainty.

In fact, beyond mathematical statements that are true by deﬁnition, it is diﬃcult

to think of any proposition that is absolutely true or any event that is absolutely

guaranteed to occur.

There are three possible sources of uncertainty:

Inherent stochasticity in the system being modeled. For example, most

interpretations of quantum mechanics describe the dynamics of subatomic

particles as being probabilistic. We can also create theoretical scenarios that

we postulate to have random dynamics, such as a hypothetical card game

where we assume that the cards are truly shuﬄed into a random order.

Incomplete observability. Even deterministic systems can appear stochastic

when we cannot observe all the variables that drive the behavior of the

system. For example, in the Monty Hall problem, a game show contestant is

asked to choose between three doors and wins a prize held behind the chosen

door. Two doors lead to a goat while a third leads to a car. The outcome

given the contestant’s choice is deterministic, but from the contestant’s point

of view, the outcome is uncertain.

Incomplete modeling. When we use a model that must discard some of

the information we have observed, the discarded information results in

uncertainty in the model’s predictions. For example, suppose we build a

robot that can exactly observe the location of every object around it. If the

robot discretizes space when predicting the future location of these objects,

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

then the discretization makes the robot immediately become uncertain about

the precise position of objects: each object could be anywhere within the

discrete cell that it was observed to occupy.

In many cases, it is more practical to use a simple but uncertain rule rather

than a complex but certain one, even if the true rule is deterministic and our

modeling system has the ﬁdelity to accommodate a complex rule. For example, the

simple rule “Most birds ﬂy” is cheap to develop and is broadly useful, while a rule

of the form, “Birds ﬂy, except for very young birds that have not yet learned to

ﬂy, sick or injured birds that have lost the ability to ﬂy, ﬂightless species of birds

including the cassowary, ostrich and kiwi. . .” is expensive to develop, maintain

and communicate and, after all this eﬀort, is still brittle and prone to failure.

While it should be clear that we need a means of representing and reasoning

about uncertainty, it is not immediately obvious that probability theory can provide

all the tools we want for artiﬁcial intelligence applications. Probability theory

was originally developed to analyze the frequencies of events. It is easy to see

how probability theory can be used to study events like drawing a certain hand of

cards in a poker game. These kinds of events are often repeatable. When we say

that an outcome has a probability

of occurring, it means that if we repeated the

experiment (e.g., drawing a hand of cards) inﬁnitely many times, then a proportion

of the repetitions would result in that outcome. This kind of reasoning does not

seem immediately applicable to propositions that are not repeatable. If a doctor

analyzes a patient and says that the patient has a 40 percent chance of having

the ﬂu, this means something very diﬀerent—we cannot make inﬁnitely many

replicas of the patient, nor is there any reason to believe that diﬀerent replicas of

the patient would present with the same symptoms yet have varying underlying

conditions. In the case of the doctor diagnosing the patient, we use probability

to represent a

degree of belief

, with 1 indicating absolute certainty that the

patient has the ﬂu and 0 indicating absolute certainty that the patient does not

have the ﬂu. The former kind of probability, related directly to the rates at which

events occur, is known as

frequentist probability

, while the latter, related to

qualitative levels of certainty, is known as Bayesian probability.

If we list several properties that we expect common sense reasoning about

uncertainty to have, then the only way to satisfy those properties is to treat

Bayesian probabilities as behaving exactly the same as frequentist probabilities.

For example, if we want to compute the probability that a player will win a poker

game given that she has a certain set of cards, we use exactly the same formulas

as when we compute the probability that a patient has a disease given that she

has certain symptoms. For more details about why a small set of common sense

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

assumptions implies that the same axioms must control both kinds of probability,

see Ramsey (1926).

Probability can be seen as the extension of logic to deal with uncertainty. Logic

provides a set of formal rules for determining what propositions are implied to

be true or false given the assumption that some other set of propositions is true

or false. Probability theory provides a set of formal rules for determining the

likelihood of a proposition being true given the likelihood of other propositions.

3.2 Random Variables

random variable

is a variable that can take on diﬀerent values randomly. We

typically denote the random variable itself with a lowercase letter in plain typeface,

and the values it can take on with lowercase script letters. For example,

and

are both possible values that the random variable

can take on. For vector-valued

variables, we would write the random variable as

and one of its values as

. On

its own, a random variable is just a description of the states that are possible; it

must be coupled with a probability distribution that speciﬁes how likely each of

these states are.

Random variables may be discrete or continuous. A discrete random variable

is one that has a ﬁnite or countably inﬁnite number of states. Note that these

states are not necessarily the integers; they can also just be named states that

are not considered to have any numerical value. A continuous random variable is

associated with a real value.

3.3 Probability Distributions

probability distribution

is a description of how likely a random variable or

set of random variables is to take on each of its possible states. The way we

describe probability distributions depends on whether the variables are discrete or

continuous.

3.3.1 Discrete Variables and Probability Mass Functions

A probability distribution over discrete variables may be described using a

proba-

bility mass function

(PMF). We typically denote probability mass functions with

a capital

. Often we associate each random variable with a diﬀerent probability

mass function and the reader must infer which PMF to use based on the identity

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

of the random variable, rather than on the name of the function;

(

) is usually

not the same as P (y).

The probability mass function maps from a state of a random variable to

the probability of that random variable taking on that state. The probability

that

is denoted as

(

), with a probability of 1 indicating that

certain and a probability of 0 indicating that

is impossible. Sometimes

to disambiguate which PMF to use, we write the name of the random variable

explicitly:

(

). Sometimes we deﬁne a variable ﬁrst, then use

∼

notation to

specify which distribution it follows later: x ∼ P (x).

Probability mass functions can act on many variables at the same time. Such

a probability distribution over many variables is known as a

joint probability

distribution

(

x, y

) denotes the probability that

and

simultaneously. We may also write P (x, y) for brevity.

To be a PMF on a random variable

, a function

must satisfy the following

properties:

• The domain of P must be the set of all possible states of x.

• ∀x ∈ x,

≤ P

(

)

≤

An impossible event has probability 0, and no state

can be less probable than that. Likewise, an event that is guaranteed to

happen has probability 1, and no state can have a greater chance of occurring.

•



x∈x

(

) = 1. We refer to this property as being

normalized

. Without

this property, we could obtain probabilities greater than one by computing

the probability of one of many events occurring.

For example, consider a single discrete random variable

with

diﬀerent

states. We can place a

uniform distribution

—that is, make each of its

states equally likely—by setting its PMF to

P (x = x

) =

(3.1)

for all

. We can see that this ﬁts the requirements for a probability mass function.

The value

is positive because k is a positive integer. We also see that



P (x = x

) =



= 1, (3.2)

so the distribution is properly normalized.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

3.3.2 Continuous Variables and Probability Density Functions

When working with continuous random variables, we describe probability distri-

butions using a probability density function (PDF) rather than a probability

mass function. To be a probability density function, a function

must satisfy the

following properties:

• The domain of p must be the set of all possible states of x.

• ∀x ∈ x, p(x) ≥ 0. Note that we do not require p(x) ≤ 1.

•



p(x)dx = 1.

A probability density function

(

) does not give the probability of a speciﬁc

state directly; instead the probability of landing inside an inﬁnitesimal region with

volume δx is given by p(x)δx.

We can integrate the density function to ﬁnd the actual probability mass of a

set of points. Speciﬁcally, the probability that

lies in some set

is given by the

integral of

(

) over that set. In the univariate example, the probability that

lies in the interval [a, b] is given by



[a,b]

p(x)dx.

For an example of a PDF corresponding to a speciﬁc probability density over

a continuous random variable, consider a uniform distribution on an interval of

the real numbers. We can do this with a function

(

;

a, b

), where

and

are the

endpoints of the interval, with

b > a

. The “;” notation means “parametrized by”;

we consider

to be the argument of the function, while

and

are parameters

that deﬁne the function. To ensure that there is no probability mass outside the

interval, we say

(

;

a, b

) = 0 for all

x ∈

[

a, b

]. Within [

a, b

(

;

a, b

) =

b−a

. We

can see that this is non-negative everywhere. Additionally, it integrates to 1. We

often denote that

follows the uniform distribution on [

a, b

] by writing

x ∼ U

(

a, b

3.4 Marginal Probability

Sometimes we know the probability distribution over a set of variables and we want

to know the probability distribution over just a subset of them. The probability

distribution over the subset is known as the

marginal probability distribution.

For example, suppose we have discrete random variables

and

, and we know

P (x, y). We can ﬁnd P (x) with the sum rule:

∀x ∈ x, P (x = x) =



P (x = x, y = y). (3.3)

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

The name “marginal probability” comes from the process of computing marginal

probabilities on paper. When the values of

(

x, y

) are written in a grid with

diﬀerent values of

in rows and diﬀerent values of

in columns, it is natural to

sum across a row of the grid, then write

(

) in the margin of the paper just to

the right of the row.

For continuous variables, we need to use integration instead of summation:

p(x) =



p(x, y)dy. (3.4)

3.5 Conditional Probability

In many cases, we are interested in the probability of some event, given that some

other event has happened. This is called a

conditional probability

. We denote

the conditional probability that

given

(

y | x

). This

conditional probability can be computed with the formula

P (y = y | x = x) =

P (y = y, x = x)

P (x = x)

. (3.5)

The conditional probability is only deﬁned when

(

)

We cannot compute

the conditional probability conditioned on an event that never happens.

It is important not to confuse conditional probability with computing what

would happen if some action were undertaken. The conditional probability that

a person is from Germany given that they speak German is quite high, but if

a randomly selected person is taught to speak German, their country of origin

does not change. Computing the consequences of an action is called making an

intervention query

. Intervention queries are the domain of

causal modeling

which we do not explore in this book.

3.6 The Chain Rule of Conditional Probabilities

Any joint probability distribution over many random variables may be decomposed

into conditional distributions over only one variable:

P (x

(1)

, . . . , x

(n)

) = P (x

(1)

)Π

i=2

P (x

(i)

| x

(1)

, . . . , x

(i−1)

). (3.6)

This observation is known as the

chain rule

, or

product rule

, of probability.

It follows immediately from the deﬁnition of conditional probability in equation 3.5.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

For example, applying the deﬁnition twice, we get

P (a, b, c) = P (a | b, c)P (b, c)

P (b, c) = P (b | c)P (c)

P (a, b, c) = P (a | b, c)P (b | c)P (c).

3.7 Independence and Conditional Independence

Two random variables

and

are

independent

if their probability distribution

can be expressed as a product of two factors, one involving only

and one involving

only y:

∀x ∈ x, y ∈ y, p(x = x, y = y) = p(x = x)p(y = y). (3.7)

Two random variables

and

are

conditionally independent

given a random

variable

if the conditional probability distribution over

and

factorizes in this

way for every value of z:

∀x ∈ x, y ∈ y, z ∈ z, p(x = x, y = y | z = z) = p(x = x | z = z)p(y = y | z = z).

(3.8)

We can denote independence and conditional independence with compact

notation:

x⊥y

means that

and

are independent, while

x⊥y | z

means that

and y are conditionally independent given z.

3.8 Expectation, Variance and Covariance

The

expectation

, or

expected value

, of some function

(

) with respect to a

probability distribution

(

) is the average, or mean value, that

takes on when

is drawn from

. For discrete variables this can be computed with a summation:

x∼P

[f(x)] =



P (x)f(x), (3.9)

while for continuous variables, it is computed with an integral:

x∼p

[f(x)] =



p(x)f(x)dx. (3.10)

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

When the identity of the distribution is clear from the context, we may simply

write the name of the random variable that the expectation is over, as in

[

(

)].

If it is clear which random variable the expectation is over, we may omit the

subscript entirely, as in

[

(

)]. By default, we can assume that

[

] averages over

the values of all the random variables inside the brackets. Likewise, when there is

no ambiguity, we may omit the square brackets.

Expectations are linear, for example,

[αf(x) + βg(x)] = αE

[f(x)] + βE

[g(x)], (3.11)

when α and β are not dependent on x.

The

variance

gives a measure of how much the values of a function of a random

variable

vary as we sample diﬀerent values of

from its probability distribution:

Var(f(x)) = E



(f(x) − E[f(x)])



. (3.12)

When the variance is low, the values of

(

) cluster near their expected value. The

square root of the variance is known as the standard deviation.

The

covariance

gives some sense of how much two values are linearly related

to each other, as well as the scale of these variables:

Cov(f(x), g(y)) = E [(f(x) − E [f(x)]) (g(y) − E [g(y)])] . (3.13)

High absolute values of the covariance mean that the values change very much

and are both far from their respective means at the same time. If the sign of the

covariance is positive, then both variables tend to take on relatively high values

simultaneously. If the sign of the covariance is negative, then one variable tends to

take on a relatively high value at the times that the other takes on a relatively

low value and vice versa. Other measures such as

correlation

normalize the

contribution of each variable in order to measure only how much the variables are

related, rather than also being aﬀected by the scale of the separate variables.

The notions of covariance and dependence are related but distinct concepts.

They are related because two variables that are independent have zero covariance,

and two variables that have nonzero covariance are dependent. Independence,

however, is a distinct property from covariance. For two variables to have zero

covariance, there must be no linear dependence between them. Independence is

a stronger requirement than zero covariance, because independence also excludes

nonlinear relationships. It is possible for two variables to be dependent but have

zero covariance. For example, suppose we ﬁrst sample a real number

from a

uniform distribution over the interval [

−

1]. We next sample a random variable

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

With probability

, we choose the value of

to be 1. Otherwise, we choose the

value of

to be

−

1. We can then generate a random variable

by assigning

. Clearly,

and

are not independent, because

completely determines

the magnitude of y. However, Cov(x, y) = 0.

The

covariance matrix

of a random vector

x ∈ R

is an

n × n

matrix, such

that

Cov(x)

i,j

= Cov(x

, x

). (3.14)

The diagonal elements of the covariance give the variance:

Cov(x

, x

) = Var(x

). (3.15)

3.9 Common Probability Distributions

Several simple probability distributions are useful in many contexts in machine

learning.

3.9.1 Bernoulli Distribution

The

Bernoulli distribution

is a distribution over a single binary random variable.

It is controlled by a single parameter

φ ∈

1], which gives the probability of the

random variable being equal to 1. It has the following properties:

P (x = 1) = φ (3.16)

P (x = 0) = 1 − φ (3.17)

P (x = x) = φ

(1 − φ)

1−x

(3.18)

[x] = φ (3.19)

Var

(x) = φ(1 − φ) (3.20)

3.9.2 Multinoulli Distribution

The

multinoulli

, or

categorical

distribution

is a distribution over a single dis-

crete variable with

diﬀerent states, where

is ﬁnite.

The multinoulli

distribution

“Multinoulli” is a term that was recently coined by Gustavo Lacerda and popularized by

Murphy (2012). The multinoulli distribution is a special case of the

multinomial distribution

A multinomial distribution is the distribution over vectors in

{

, . . . , n}

representing how many

times each of the

categories is visited when

samples are drawn from a multinoulli distribution.

Many texts use the term “multinomial” to refer to multinoulli distributions without clarifying

that they are referring only to the n = 1 case.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

is parametrized by a vector

p ∈

k−1

, where

gives the probability of the

-th state. The ﬁnal,

-th state’s probability is given by 1

− 1



Note that

we must constrain



p ≤

1. Multinoulli distributions are often used to refer to

distributions over categories of objects, so we do not usually assume that state

1 has numerical value 1, and so on. For this reason, we do not usually need to

compute the expectation or variance of multinoulli-distributed random variables.

The Bernoulli and multinoulli distributions are suﬃcient to describe any distri-

bution over their domain. They are able to describe any distribution over their

domain not so much because they are particularly powerful but rather because their

domain is simple; they model discrete variables for which it is feasible to enumerate

all the states. When dealing with continuous variables, there are uncountably

many states, so any distribution described by a small number of parameters must

impose strict limits on the distribution.

3.9.3 Gaussian Distribution

The most commonly used distribution over real numbers is the

normal distribu-

tion, also known as the Gaussian distribution:

N(x; µ, σ

) =



2πσ

exp



−

2σ

(x − µ)



. (3.21)

See ﬁgure 3.1 for a plot of the normal distribution density function.

The two parameters

µ ∈ R

and

σ ∈

, ∞

) control the normal distribution.

The parameter

gives the coordinate of the central peak. This is also the mean of

the distribution:

[

] =

. The standard deviation of the distribution is given by

σ, and the variance by σ

When we evaluate the PDF, we need to square and invert

. When we need to

frequently evaluate the PDF with diﬀerent parameter values, a more eﬃcient way

of parametrizing the distribution is to use a parameter

β ∈

, ∞

) to control the

precision, or inverse variance, of the distribution:

N(x; µ, β

−1

) =



2π

exp



−

β(x − µ)



. (3.22)

Normal distributions are a sensible choice for many applications. In the absence

of prior knowledge about what form a distribution over the real numbers should

take, the normal distribution is a good default choice for two major reasons.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

p(x)

Maximum at x = µ

Inﬂection points at

x = µ ± σ

Figure 3.1: The normal distribution. The normal distribution

(

;

µ, σ

) exhibits a classic

“bell curve” shape, with the

coordinate of its central peak given by

, and the width of

its peak controlled by

. In this example, we depict the

standard normal distribution

with µ = 0 and σ = 1.

First, many distributions we wish to model are truly close to being normal

distributions. The

central limit theorem

shows that the sum of many indepen-

dent random variables is approximately normally distributed. This means that

in practice, many complicated systems can be modeled successfully as normally

distributed noise, even if the system can be decomposed into parts with more

structured behavior.

Second, out of all possible probability distributions with the same variance,

the normal distribution encodes the maximum amount of uncertainty over the

real numbers. We can thus think of the normal distribution as being the one

that inserts the least amount of prior knowledge into a model. Fully developing

and justifying this idea requires more mathematical tools and is postponed to

section 19.4.2.

The normal distribution generalizes to

, in which case it is known as the

multivariate normal distribution

. It may be parametrized with a positive

deﬁnite symmetric matrix Σ:

N(x; µ, Σ) =



(2π)

det(Σ)

exp



−

(x − µ)



−1

(x − µ)



. (3.23)

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

The parameter

still gives the mean of the distribution, though now it is

vector valued. The parameter

gives the covariance matrix of the distribution.

As in the univariate case, when we wish to evaluate the PDF several times for

many diﬀerent values of the parameters, the covariance is not a computationally

eﬃcient way to parametrize the distribution, since we need to invert

to evaluate

the PDF. We can instead use a precision matrix β:

N(x; µ, β

−1

) =



det(β)

(2π)

exp



−

(x − µ)



β(x − µ)



. (3.24)

We often ﬁx the covariance matrix to be a diagonal matrix. An even simpler

version is the

isotropic

Gaussian distribution, whose covariance matrix is a scalar

times the identity matrix.

3.9.4 Exponential and Laplace Distributions

In the context of deep learning, we often want to have a probability distribution

with a sharp point at

= 0. To accomplish this, we can use the

exponential

distribution:

p(x; λ) = λ1

x≥0

exp (−λx) . (3.25)

The exponential distribution uses the indicator function

x≥0

to assign probability

zero to all negative values of x.

A closely related probability distribution that allows us to place a sharp peak

of probability mass at an arbitrary point µ is the Laplace distribution

Laplace(x; µ, γ) =

2γ

exp



−

|x − µ|



. (3.26)

3.9.5 The Dirac Distribution and Empirical Distribution

In some cases, we wish to specify that all the mass in a probability distribution

clusters around a single point. This can be accomplished by deﬁning a PDF using

the Dirac delta function, δ(x):

p(x) = δ(x −µ). (3.27)

The Dirac delta function is deﬁned such that it is zero valued everywhere except

0, yet integrates to 1. The Dirac delta function is not an ordinary function that

associates each value

with a real-valued output; instead it is a diﬀerent kind of

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

mathematical object called a

generalized function

that is deﬁned in terms of

its properties when integrated. We can think of the Dirac delta function as being

the limit point of a series of functions that put less and less density on all points

other than zero.

By deﬁning

(

) to be

shifted by

−µ

we obtain an inﬁnitely narrow and

inﬁnitely high peak of probability density where x = µ.

A common use of the Dirac delta distribution is as a component of an

empirical

distribution,

ˆp(x) =



i=1

δ(x −x

(i)

) (3.28)

which puts probability mass

on each of the

points

(1)

, . . . , x

(m)

, forming

a given data set or collection of samples. The Dirac delta distribution is only

necessary to deﬁne the empirical distribution over continuous variables. For discrete

variables, the situation is simpler: an empirical distribution can be conceptualized

as a multinoulli distribution, with a probability associated with each possible

input value that is simply equal to the

empirical frequency

of that value in the

training set.

We can view the empirical distribution formed from a dataset of training

examples as specifying the distribution that we sample from when we train a model

on this dataset. Another important perspective on the empirical distribution is

that it is the probability density that maximizes the likelihood of the training data

(see section 5.5).

3.9.6 Mixtures of Distributions

It is also common to deﬁne probability distributions by combining other simpler

probability distributions. One common way of combining distributions is to

construct a

mixture distribution

. A mixture distribution is made up of several

component distributions. On each trial, the choice of which component distribution

should generate the sample is determined by sampling a component identity from

a multinoulli distribution:

P (x) =



P (c = i)P (x | c = i), (3.29)

where P (c) is the multinoulli distribution over component identities.

We have already seen one example of a mixture distribution: the empirical

distribution over real-valued variables is a mixture distribution with one Dirac

component for each training example.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

The mixture model is one simple strategy for combining probability distributions

to create a richer distribution. In chapter 16, we explore the art of building complex

probability distributions from simple ones in more detail.

The mixture model allows us to brieﬂy glimpse a concept that will be of

paramount importance later—the

latent variable

. A latent variable is a random

variable that we cannot observe directly. The component identity variable

of the

mixture model provides an example. Latent variables may be related to

through

the joint distribution, in this case,

(

x, c

) =

(

x | c

)

(

). The distribution

(

)

over the latent variable and the distribution

(

x | c

) relating the latent variables

to the visible variables determines the shape of the distribution

(

), even though

it is possible to describe

(

) without reference to the latent variable. Latent

variables are discussed further in section 16.5.

A very powerful and common type of mixture model is the

Gaussian mixture

model

, in which the components

(

x | c

) are Gaussians. Each component has

a separately parametrized mean

(i)

and covariance

(i)

. Some mixtures can have

more constraints. For example, the covariances could be shared across components

via the constraint

(i)

Σ, ∀i

. As with a single Gaussian distribution, the mixture

of Gaussians might constrain the covariance matrix for each component to be

diagonal or isotropic.

In addition to the means and covariances, the parameters of a Gaussian mixture

specify the

prior probability α

(

) given to each component

. The word

“prior” indicates that it expresses the model’s beliefs about

before it has observed

. By comparison,

(

c | x

) is a

posterior probability

, because it is computed

after observation of

. A Gaussian mixture model is a

universal approximator

of densities, in the sense that any smooth density can be approximated with

any speciﬁc nonzero amount of error by a Gaussian mixture model with enough

components.

Figure 3.2 shows samples from a Gaussian mixture model.

3.10 Useful Properties of Common Functions

Certain functions arise often while working with probability distributions,

especially the probability distributions used in deep learning models.

One of these functions is the logistic sigmoid:

σ(x) =

1 + exp(−x)

. (3.30)

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Figure 3.2: Samples from a Gaussian mixture model. In this example, there are three

components. From left to right, the ﬁrst component has an isotropic covariance matrix,

meaning it has the same amount of variance in each direction. The second has a diagonal

covariance matrix, meaning it can control the variance separately along each axis-aligned

direction. This example has more variance along the

axis than along the

axis. The

third component has a full-rank covariance matrix, enabling it to control the variance

separately along an arbitrary basis of directions.

The logistic sigmoid is commonly used to produce the

parameter of a Bernoulli

distribution because its range is (0

1), which lies within the valid range of values

for the

parameter. See ﬁgure 3.3 for a graph of the sigmoid function. The

sigmoid function

saturates

when its argument is very positive or very negative,

meaning that the function becomes very ﬂat and insensitive to small changes in its

input.

Another commonly encountered function is the

softplus function

(Dugas

et al., 2001):

ζ(x) = log (1 + exp(x)) . (3.31)

The softplus function can be useful for producing the

parameter of a normal

distribution because its range is (0

, ∞

). It also arises commonly when manipulating

expressions involving sigmoids. The name of the softplus function comes from the

fact that it is a smoothed, or “softened,” version of

= max(0, x). (3.32)

See ﬁgure 3.4 for a graph of the softplus function.

The following properties are all useful enough that you may wish to memorize

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

−10 −5 0 5 10

0.0

0.2

0.4

0.6

0.8

1.0

σ(x)

Figure 3.3: The logistic sigmoid function.

−10 −5 0 5 10

ζ(x)

Figure 3.4: The softplus function.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

them:

σ(x) =

exp(x)

exp(x) + exp(0)

(3.33)

σ(x) = σ(x)(1 −σ(x)) (3.34)

1 − σ(x) = σ(−x) (3.35)

log σ(x) = −ζ(−x) (3.36)

ζ(x) = σ(x) (3.37)

∀x ∈ (0, 1), σ

−1

(x) = log



1 − x



(3.38)

∀x > 0, ζ

−1

(x) = log (exp(x) − 1) (3.39)

ζ(x) =



−∞

σ(y)dy (3.40)

ζ(x) − ζ(−x) = x (3.41)

The function

−1

(

) is called the

logit

in statistics, but this term is rarely used in

machine learning.

Equation 3.41 provides extra justiﬁcation for the name “softplus.” The softplus

function is intended as a smoothed version of the

positive part function

max{

, x}

. The positive part function is the counterpart of the

negative part

function

−

max{

, −x}

. To obtain a smooth function that is analogous to

the negative part, one can use

(

−x

). Just as

can be recovered from its positive

part and its negative part via the identity

−x

−

, it is also possible to recover

using the same relationship between

(

) and

(

−x

), as shown in equation 3.41.

3.11 Bayes’ Rule

We often ﬁnd ourselves in a situation where we know

(

y | x

) and need to know

(

x | y

). Fortunately, if we also know

(

), we can compute the desired quantity

using Bayes’ rule:

P (x | y) =

P (x)P (y | x)

P (y)

. (3.42)

Note that while

(

) appears in the formula, it is usually feasible to compute

P (y) =



P (y | x)P (x), so we do not need to begin with knowledge of P (y).

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Bayes’ rule is straightforward to derive from the deﬁnition of conditional

probability, but it is useful to know the name of this formula since many texts

refer to it by name. It is named after the Reverend Thomas Bayes, who ﬁrst

discovered a special case of the formula. The general version presented here was

independently discovered by Pierre-Simon Laplace.

3.12 Technical Details of Continuous Variables

A proper formal understanding of continuous random variables and probability

density functions requires developing probability theory in terms of a branch of

mathematics known as

measure theory

. Measure theory is beyond the scope of

this textbook, but we can brieﬂy sketch some of the issues that measure theory is

employed to resolve.

In section 3.3.2, we saw that the probability of a continuous vector-valued

lying in some set

is given by the integral of

(

) over the set

. Some choices

of set

can produce paradoxes. For example, it is possible to construct two

sets

and

such that

(

x ∈ S

) +

(

x ∈ S

)

1 but

∩ S

∅

. These

sets are generally constructed making very heavy use of the inﬁnite precision of

real numbers, for example by making fractal-shaped sets or sets that are deﬁned

by transforming the set of rational numbers.

One of the key contributions of

measure theory is to provide a characterization of the set of sets we can compute

the probability of without encountering paradoxes. In this book, we integrate only

over sets with relatively simple descriptions, so this aspect of measure theory never

becomes a relevant concern.

For our purposes, measure theory is more useful for describing theorems that

apply to most points in

but do not apply to some corner cases. Measure theory

provides a rigorous way of describing that a set of points is negligibly small. Such

a set is said to have

measure zero

. We do not formally deﬁne this concept in this

textbook. For our purposes, it is suﬃcient to understand the intuition that a set

of measure zero occupies no volume in the space we are measuring. For example,

within

, a line has measure zero, while a ﬁlled polygon has positive measure.

Likewise, an individual point has measure zero. Any union of countably many sets

that each have measure zero also has measure zero (so the set of all the rational

numbers has measure zero, for instance).

Another useful term from measure theory is

almost everywhere

. A property

that holds almost everywhere holds throughout all space except for on a set of

The Banach-Tarski theorem provides a fun example of such sets.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

measure zero. Because the exceptions occupy a negligible amount of space, they

can be safely ignored for many applications. Some important results in probability

theory hold for all discrete values but hold “almost everywhere” only for continuous

values.

Another technical detail of continuous variables relates to handling continuous

random variables that are deterministic functions of one another. Suppose we have

two random variables,

and

, such that

(

), where

is an invertible, con-

tinuous, diﬀerentiable transformation. One might expect that

(

) =

(

−1

(

)).

This is actually not the case.

As a simple example, suppose we have scalar random variables

and

. Suppose

and

x ∼ U

1). If we use the rule

(

) =

) then

will be 0

everywhere except the interval [0

], and it will be 1 on this interval. This means



(y)dy =

, (3.43)

which violates the deﬁnition of a probability distribution. This is a common mistake.

The problem with this approach is that it fails to account for the distortion of

space introduced by the function

. Recall that the probability of

lying in an

inﬁnitesimally small region with volume

δx

is given by

(

)

δx

. Since

can expand

or contract space, the inﬁnitesimal volume surrounding

space may have

diﬀerent volume in y space.

To see how to correct the problem, we return to the scalar case. We need to

preserve the property

(g(x))dy| = |p

(x)dx|. (3.44)

Solving from this, we obtain

(y) = p

−1

(y))



∂x

∂y



(3.45)

or equivalently

(x) = p

(g(x))



∂g(x)

∂x



. (3.46)

In higher dimensions, the derivative generalizes to the determinant of the

Jacobian

matrix—the matrix with J

i,j

∂x

∂y

. Thus, for real-valued vectors x and y,

(x) = p

(g(x))



det



∂g(x)

∂x





. (3.47)

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

3.13 Information Theory

Information theory is a branch of applied mathematics that revolves around

quantifying how much information is present in a signal. It was originally invented

to study sending messages from discrete alphabets over a noisy channel, such as

communication via radio transmission. In this context, information theory tells how

to design optimal codes and calculate the expected length of messages sampled from

speciﬁc probability distributions using various encoding schemes. In the context of

machine learning, we can also apply information theory to continuous variables

where some of these message length interpretations do not apply. This ﬁeld is

fundamental to many areas of electrical engineering and computer science. In this

textbook, we mostly use a few key ideas from information theory to characterize

probability distributions or to quantify similarity between probability distributions.

For more detail on information theory, see Cover and Thomas (2006) or MacKay

(2003).

The basic intuition behind information theory is that learning that an unlikely

event has occurred is more informative than learning that a likely event has

occurred. A message saying “the sun rose this morning” is so uninformative as

to be unnecessary to send, but a message saying “there was a solar eclipse this

morning” is very informative.

We would like to quantify information in a way that formalizes this intuition.

•

Likely events should have low information content, and in the extreme case,

events that are guaranteed to happen should have no information content

whatsoever.

• Less likely events should have higher information content.

•

Independent events should have additive information. For example, ﬁnding

out that a tossed coin has come up as heads twice should convey twice as

much information as ﬁnding out that a tossed coin has come up as heads

once.

To satisfy all three of these properties, we deﬁne the

self-information

of an

event x = x to be

I(x) = −log P (x). (3.48)

In this book, we always use

log

to mean the natural logarithm, with base

. Our

deﬁnition of

(

) is therefore written in units of

nats

. One nat is the amount of

information gained by observing an event of probability

. Other texts use base-2

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

logarithms and units called

bits

shannons

; information measured in bits is

just a rescaling of information measured in nats.

When

is continuous, we use the same deﬁnition of information by analogy,

but some of the properties from the discrete case are lost. For example, an event

with unit density still has zero information, despite not being an event that is

guaranteed to occur.

Self-information deals only with a single outcome. We can quantify the amount

of uncertainty in an entire probability distribution using the Shannon entropy,

H(x) = E

x∼P

[I(x)] = −E

x∼P

[log P (x)], (3.49)

also denoted

(

). In other words, the Shannon entropy of a distribution is the

expected amount of information in an event drawn from that distribution. It gives

a lower bound on the number of bits (if the logarithm is base 2, otherwise the units

are diﬀerent) needed on average to encode symbols drawn from a distribution

Distributions that are nearly deterministic (where the outcome is nearly certain)

have low entropy; distributions that are closer to uniform have high entropy. See

ﬁgure 3.5 for a demonstration. When

is continuous, the Shannon entropy is

known as the diﬀerential entropy.

If we have two separate probability distributions

(

) and

(

) over the same

random variable

, we can measure how diﬀerent these two distributions are using

the Kullback-Leibler (KL) divergence:

(P Q) = E

x∼P



log

P (x)

Q(x)



= E

x∼P

[log P (x) − log Q(x)] . (3.50)

In the case of discrete variables, it is the extra amount of information (measured

in bits if we use the base-2 logarithm, but in machine learning we usually use nats

and the natural logarithm) needed to send a message containing symbols drawn

from probability distribution

, when we use a code that was designed to minimize

the length of messages drawn from probability distribution Q.

The KL divergence has many useful properties, most notably being non-negative.

The KL divergence is 0 if and only if

and

are the same distribution in the

case of discrete variables, or equal “almost everywhere” in the case of continuous

variables. Because the KL divergence is non-negative and measures the diﬀerence

between two distributions, it is often conceptualized as measuring some sort of

distance between these distributions. It is not a true distance measure because it

is not symmetric:

(

P Q

)



(

QP

) for some

and

. This asymmetry

means that there are important consequences to the choice of whether to use

(P Q) or D

(QP ). See ﬁgure 3.6 for more detail.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Shannon entropy in nats

Figure 3.5: Shannon entropy of a binary random variable. This plot shows how distri-

butions that are closer to deterministic have low Shannon entropy while distributions

that are close to uniform have high Shannon entropy. On the horizontal axis, we plot

, the probability of a binary random variable being equal to 1. The entropy is given

by (

p −

log

− p

)

− p log p

. When

is near 0, the distribution is nearly deterministic,

because the random variable is nearly always 0. When

is near 1, the distribution is

nearly deterministic, because the random variable is nearly always 1. When

= 0

5, the

entropy is maximal, because the distribution is uniform over the two outcomes.

A quantity that is closely related to the KL divergence is the

cross-entropy

(

P, Q

) =

(

) +

(

P Q

), which is similar to the KL divergence but lacking

the term on the left:

H(P, Q) = −E

x∼P

log Q(x). (3.51)

Minimizing the cross-entropy with respect to

is equivalent to minimizing the

KL divergence, because Q does not participate in the omitted term.

When computing many of these quantities, it is common to encounter expres-

sions of the form 0

log

0. By convention, in the context of information theory, we

treat these expressions as lim

x→0

x log x = 0.

3.14 Structured Probabilistic Models

Machine learning algorithms often involve probability distributions over a very

large number of random variables. Often, these probability distributions involve

direct interactions between relatively few variables. Using a single function to

describe the entire joint probability distribution can be very ineﬃcient (both

computationally and statistically).

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Probability Density

∗

= argmin

(pq)

p(x)

∗

(x)

Probability Density

∗

= argmin

(qp)

p(x)

∗

(x)

Figure 3.6: The KL divergence is asymmetric. Suppose we have a distribution

(

) and

wish to approximate it with another distribution

(

). We have the choice of minimizing

either

(

pq

) or

(

qp

). We illustrate the eﬀect of this choice using a mixture of

two Gaussians for

, and a single Gaussian for

. The choice of which direction of the

KL divergence to use is problem dependent. Some applications require an approximation

that usually places high probability anywhere that the true distribution places high

probability, while other applications require an approximation that rarely places high

probability anywhere that the true distribution places low probability. The choice of the

direction of the KL divergence reﬂects which of these considerations takes priority for each

application. (Left)The eﬀect of minimizing

(

pq

). In this case, we select a

that has

high probability where

has high probability. When

has multiple modes,

chooses to

blur the modes together, in order to put high probability mass on all of them. (Right)The

eﬀect of minimizing

(

qp

). In this case, we select a

that has low probability where

has low probability. When

has multiple modes that are suﬃciently widely separated, as

in this ﬁgure, the KL divergence is minimized by choosing a single mode, to avoid putting

probability mass in the low-probability areas between modes of

. Here, we illustrate

the outcome when

is chosen to emphasize the left mode. We could also have achieved

an equal value of the KL divergence by choosing the right mode. If the modes are not

separated by a suﬃciently strong low-probability region, then this direction of the KL

divergence can still choose to blur the modes.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Instead of using a single function to represent a probability distribution, we

can split a probability distribution into many factors that we multiply together.

For example, suppose we have three random variables:

and

. Suppose that

inﬂuences the value of

, and

inﬂuences the value of

, but that

and

are

independent given

. We can represent the probability distribution over all three

variables as a product of probability distributions over two variables:

p(a, b, c) = p(a)p(b | a)p(c | b). (3.52)

These factorizations can greatly reduce the number of parameters needed

to describe the distribution. Each factor uses a number of parameters that is

exponential in the number of variables in the factor. This means that we can greatly

reduce the cost of representing a distribution if we are able to ﬁnd a factorization

into distributions over fewer variables.

We can describe these kinds of factorizations using graphs. Here, we use the

word “graph” in the sense of graph theory: a set of vertices that may be connected

to each other with edges. When we represent the factorization of a probability

distribution with a graph, we call it a

structured probabilistic model

, or

graphical model.

There are two main kinds of structured probabilistic models: directed and

undirected. Both kinds of graphical models use a graph

in which each node

in the graph corresponds to a random variable, and an edge connecting two

random variables means that the probability distribution is able to represent direct

interactions between those two random variables.

Directed

models use graphs with directed edges, and they represent fac-

torizations into conditional probability distributions, as in the example above.

Speciﬁcally, a directed model contains one factor for every random variable

the distribution, and that factor consists of the conditional distribution over

given the parents of x

, denoted P a

p(x) =



p (x

| P a

)) . (3.53)

See ﬁgure 3.7 for an example of a directed graph and the factorization of probability

distributions it represents.

Undirected

models use graphs with undirected edges, and they represent

factorizations into a set of functions; unlike in the directed case, these functions

are usually not probability distributions of any kind. Any set of nodes that are all

connected to each other in

is called a clique. Each clique

(i)

in an undirected

model is associated with a factor

(i)

(

(i)

). These factors are just functions, not

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Figure 3.7: A directed graphical model over random variables

and

. This graph

corresponds to probability distributions that can be factored as

p(a, b, c, d, e) = p(a)p(b | a)p(c | a, b)p(d | b)p(e | c). (3.54)

This graphical model enables us to quickly see some properties of the distribution. For

example, a and c interact directly, but a and e interact only indirectly via c.

probability distributions. The output of each factor must be non-negative, but

there is no constraint that the factor must sum or integrate to 1 like a probability

distribution.

The probability of a conﬁguration of random variables is

proportional

to the

product of all these factors—assignments that result in larger factor values are

more likely. Of course, there is no guarantee that this product will sum to 1. We

therefore divide by a normalizing constant

, deﬁned to be the sum or integral

over all states of the product of the

functions, in order to obtain a normalized

probability distribution:

p(x) =



(i)



(i)



. (3.55)

See ﬁgure 3.8 for an example of an undirected graph and the factorization of

probability distributions it represents.

Keep in mind that these graphical representations of factorizations are a

language for describing probability distributions. They are not mutually exclusive

families of probability distributions. Being directed or undirected is not a property

of a probability distribution; it is a property of a particular

description

of a

probability distribution, but any probability distribution may be described in both

ways.

Throughout parts I and II of this book, we use structured probabilistic models

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Figure 3.8: An undirected graphical model over random variables

and

. This

graph corresponds to probability distributions that can be factored as

p(a, b, c, d, e) =

(1)

(a, b, c)φ

(2)

(b, d)φ

(3)

(c, e). (3.56)

This graphical model enables us to quickly see some properties of the distribution. For

example, a and c interact directly, but a and e interact only indirectly via c.

merely as a language to describe which direct probabilistic relationships diﬀerent

machine learning algorithms choose to represent. No further understanding of

structured probabilistic models is needed until the discussion of research topics, in

part III, where we explore structured probabilistic models in much greater detail.

This chapter has reviewed the basic concepts of probability theory that are

most relevant to deep learning. One more set of fundamental mathematical tools

remains: numerical methods.