Chapter 20

Deep Generative Models

In this chapter, we present several of the speciﬁc kinds of generative models that

can be built and trained using the techniques presented in chapters16–19. All

these models represent probability distributions over multiple variables in some

way. Some allow the probability distribution function to be evaluated explicitly.

Others do not allow the evaluation of the probability distribution function but

support operations that implicitly require knowledge of it, such as drawing samples

from the distribution. Some of these models are structured probabilistic models

described in terms of graphs and factors, using the language of graphical models

presented in chapter 16. Others cannot be easily described in terms of factors but

represent probability distributions nonetheless.

20.1 Boltzmann Machines

Boltzmann machines were originally introduced as a general “connectionist” ap-

proach to learning arbitrary probability distributions over binary vectors (Fahlman

et al., 1983; Ackley et al., 1985; Hinton et al., 1984; Hinton and Sejnowski, 1986).

Variants of the Boltzmann machine that include other kinds of variables have long

ago surpassed the popularity of the original. In this section we brieﬂy introduce

the binary Boltzmann machine and discuss the issues that come up when trying to

train and perform inference in the model.

We deﬁne the Boltzmann machine over a

-dimensional binary random vector

x ∈ {

}

. The Boltzmann machine is an energy-based model (section 16.2.4),

651

CHAPTER 20. DEEP GENERATIVE MODELS

meaning we deﬁne the joint probability distribution using an energy function:

P (x) =

exp (−E(x))

, (20.1)

where

(

) is the energy function, and

is the partition function that ensures

that



P (x) = 1. The energy function of the Boltzmann machine is given by

E(x) = −x



Ux − b



x, (20.2)

where

is the “weight” matrix of model parameters and

is the vector of bias

parameters.

In the general setting of the Boltzmann machine, we are given a set of training

examples, each of which are

-dimensional. Equation 20.1 describes the joint

probability distribution over the observed variables. While this scenario is certainly

viable, it does limit the kinds of interactions between the observed variables to

those described by the weight matrix. Speciﬁcally, it means that the probability of

one unit being on is given by a linear model (logistic regression) from the values of

the other units.

The Boltzmann machine becomes more powerful when not all the variables are

observed. In this case, the latent variables can act similarly to hidden units in a

multilayer perceptron and model higher-order interactions among the visible units.

Just as the addition of hidden units to convert logistic regression into an MLP results

in the MLP being a universal approximator of functions, a Boltzmann machine

with hidden units is no longer limited to modeling linear relationships between

variables. Instead, the Boltzmann machine becomes a universal approximator of

probability mass functions over discrete variables (Le Roux and Bengio, 2008).

Formally, we decompose the units

into two subsets: the visible units

and

the latent (or hidden) units h. The energy function becomes

E(v, h) = −v



Rv − v



W h − h



Sh −b



v − c



h. (20.3)

Boltzmann Machine Learning

Learning algorithms for Boltzmann machines

are usually based on maximum likelihood. All Boltzmann machines have an

intractable partition function, so the maximum likelihood gradient must be ap-

proximated using the techniques described in chapter 18.

One interesting property of Boltzmann machines when trained with learning

rules based on maximum likelihood is that the update for a particular weight

connecting two units depends only on the statistics of those two units, collected

under diﬀerent distributions:

model

(

) and

data

(

)

model

(

h | v

). The rest of the

652

CHAPTER 20. DEEP GENERATIVE MODELS

network participates in shaping those statistics, but the weight can be updated

without knowing anything about the rest of the network or how those statistics were

produced. This means that the learning rule is “local,” which makes Boltzmann

machine learning somewhat biologically plausible. It is conceivable that if each

neuron were a random variable in a Boltzmann machine, then the axons and

dendrites connecting two random variables could learn only by observing the ﬁring

pattern of the cells that they actually physically touch. In particular, in the

positive phase, two units that frequently activate together have their connection

strengthened. This is an example of a Hebbian learning rule (Hebb, 1949) often

summarized with the mnemonic “ﬁre together, wire together.” Hebbian learning

rules are among the oldest hypothesized explanations for learning in biological

systems and remain relevant today (Giudice et al., 2009).

Other learning algorithms that use more information than local statistics seem

to require us to hypothesize the existence of more machinery than this. For

example, for the brain to implement back-propagation in a multilayer perceptron,

it seems necessary for the brain to maintain a secondary communication network

for transmitting gradient information backward through the network. Proposals for

biologically plausible implementations (and approximations) of back-propagation

have been made (Hinton, 2007a; Bengio, 2015) but remain to be validated, and

Bengio (2015) links back-propagation of gradients to inference in energy-based

models similar to the Boltzmann machine (but with continuous latent variables).

The negative phase of Boltzmann machine learning is somewhat harder to

explain from a biological point of view. As argued in section 18.2, dream sleep

may be a form of negative phase sampling. This idea is more speculative though.

20.2 Restricted Boltzmann Machines

Invented under the name

harmonium

(Smolensky, 1986), restricted Boltzmann

machines are some of the most common building blocks of deep probabilistic

models. We brieﬂy describe RBMs in section 16.7.1. Here we review the previous

information and go into more detail. RBMs are undirected probabilistic graphical

models containing a layer of observable variables and a single layer of latent

variables. RBMs may be stacked (one on top of the other) to form deeper models.

See ﬁgure 20.1 for some examples. In particular, ﬁgure 20.1a shows the graph

structure of the RBM itself. It is a bipartite graph, with no connections permitted

between any variables in the observed layer or between any units in the latent

layer.

We begin with the binary version of the restricted Boltzmann machine, but as

653

CHAPTER 20. DEEP GENERATIVE MODELS

(1)

(2)

(1)

(a) (b)

(1)

(2)

(1)

(c)

Figure 20.1: Examples of models that may be built with restricted Boltzmann machines.

(a) The restricted Boltzmann machine itself is an undirected graphical model based on

a bipartite graph, with visible units in one part of the graph and hidden units in the

other part. There are no connections among the visible units, nor any connections among

the hidden units. Typically every visible unit is connected to every hidden unit, but it

is possible to construct sparsely connected RBMs such as convolutional RBMs. (b) A

deep belief network is a hybrid graphical model involving both directed and undirected

connections. Like an RBM, it has no intralayer connections. However, a DBN has multiple

hidden layers, and thus connections between hidden units that are in separate layers.

All the local conditional probability distributions needed by the deep belief network are

copied directly from the local conditional probability distributions of its constituent RBMs.

Alternatively, we could also represent the deep belief network with a completely undirected

graph, but it would need intralayer connections to capture the dependencies between

parents. (c) A deep Boltzmann machine is an undirected graphical model with several

layers of latent variables. Like RBMs and DBNs, DBMs lack intralayer connections.

DBMs are less closely tied to RBMs than DBNs are. When initializing a DBM from a

stack of RBMs, it is necessary to modify the RBM parameters slightly. Some kinds of

DBMs may be trained without ﬁrst training a set of RBMs.

654

CHAPTER 20. DEEP GENERATIVE MODELS

we see later, there are extensions to other types of visible and hidden units.

More formally, let the observed layer consist of a set of

binary random

variables, which we refer to collectively with the vector

. We refer to the latent,

or hidden, layer of n

binary random variables as h.

Like the general Boltzmann machine, the restricted Boltzmann machine is an

energy-based model with the joint probability distribution speciﬁed by its energy

function:

P (v = v, h = h) =

exp (−E(v, h)) . (20.4)

The energy function for an RBM is given by

E(v, h) = −b



v − c



h − v



W h, (20.5)

and Z is the normalizing constant known as the partition function:

Z =



exp {−E(v, h)}. (20.6)

It is apparent from the deﬁnition of the partition function

that the naive method

of computing

(exhaustively summing over all states) could be computationally

intractable, unless a cleverly designed algorithm could exploit regularities in the

probability distribution to compute

faster. In the case of restricted Boltzmann

machines, Long and Servedio (2010) formally proved that the partition function

is intractable. The intractable partition function

implies that the normalized

joint probability distribution P (v) is also intractable to evaluate.

20.2.1 Conditional Distributions

Though

(

) is intractable, the bipartite graph structure of the RBM has the

special property of its conditional distributions

(

h | v

) and

(

v | h

) being

factorial and relatively simple to compute and sample from.

Deriving the conditional distributions from the joint distribution is straightfor-

ward:

P (h | v) =

P (h, v)

P (v)

(20.7)

P (v)

exp





v + c



h + v



W h



(20.8)



exp





h + v



W h



(20.9)

655

CHAPTER 20. DEEP GENERATIVE MODELS



exp









j=1



j=1



:,j







(20.10)





j=1

exp



+ v



:,j



. (20.11)

Since we are conditioning on the visible units

, we can treat these as constant

with respect to the distribution

(

h | v

). The factorial nature of the conditional

(

h | v

) follows immediately from our ability to write the joint probability over

the vector

as the product of (unnormalized) distributions over the individual

elements,

. It is now a simple matter of normalizing the distributions over the

individual binary h

P (h

= 1 | v) =

P (h

= 1 | v)

P (h

= 0 | v) +

P (h

= 1 | v)

(20.12)

exp



+ v



:,j



exp {0} + exp {c

+ v



:,j

}

(20.13)

= σ



+ v



:,j



. (20.14)

We can now express the full conditional over the hidden layer as the factorial

distribution:

P (h | v) =



j=1



(2h − 1)  (c + W





. (20.15)

A similar derivation will show that the other condition of interest to us,

(

v | h

is also a factorial distribution:

P (v | h) =



i=1

σ ((2v − 1)  (b + W h))

. (20.16)

20.2.2 Training Restricted Boltzmann Machines

Because the RBM admits eﬃcient evaluation and diﬀerentiation of

(

) and

eﬃcient MCMC sampling in the form of block Gibbs sampling, it can readily be

trained with any of the techniques described in chapter 18 for training models

that have intractable partition functions. This includes CD, SML (PCD), ratio

matching, and so on. Compared to other undirected models used in deep learning,

the RBM is relatively straightforward to train because we can compute

(

h | v

)

656

CHAPTER 20. DEEP GENERATIVE MODELS

exactly in closed form. Some other deep models, such as the deep Boltzmann

machine, combine both the diﬃculty of an intractable partition function and the

diﬃculty of intractable inference.

20.3 Deep Belief Networks

Deep belief networks

(DBNs) were one of the ﬁrst nonconvolutional models

to successfully admit training of deep architectures (Hinton et al., 2006; Hinton,

2007b). The introduction of deep belief networks in 2006 began the current deep

learning renaissance. Prior to the introduction of deep belief networks, deep models

were considered too diﬃcult to optimize. Kernel machines with convex objective

functions dominated the research landscape. Deep belief networks demonstrated

that deep architectures can be successful by outperforming kernelized support

vector machines on the MNIST dataset (Hinton et al., 2006). Today, deep belief

networks have mostly fallen out of favor and are rarely used, even compared to

other unsupervised or generative learning algorithms, but they are still deservedly

recognized for their important role in deep learning history.

Deep belief networks are generative models with several layers of latent variables.

The latent variables are typically binary, while the visible units may be binary

or real. There are no intralayer connections. Usually, every unit in each layer is

connected to every unit in each neighboring layer, though it is possible to construct

more sparsely connected DBNs. The connections between the top two layers are

undirected. The connections between all other layers are directed, with the arrows

pointed toward the layer that is closest to the data. See ﬁgure 20.1b for an example.

A DBN with

hidden layers contains

weight matrices:

(1)

, . . . , W

(l)

. It

also contains

+ 1 bias vectors

(0)

, . . . , b

(l)

, with

(0)

providing the biases for the

visible layer. The probability distribution represented by the DBN is given by

P (h

(l)

, h

(l−1)

) ∝ exp



(l)



(l)

+ b

(l−1)

(l−1)

+ h

(l−1)

(l)



, (20.17)

P (h

(k)

= 1 | h

(k+1)

) = σ



(k)

+ W

(k+1)

:,i

(k+1)



∀i, ∀k ∈ 1, . . . , l − 2, (20.18)

P (v

= 1 | h

(1)

) = σ



(0)

+ W

(1)

:,i

(1)



∀i. (20.19)

In the case of real-valued visible units, substitute

v ∼ N



v; b

(0)

+ W

(1)

(1)

, β

−1



(20.20)

657

CHAPTER 20. DEEP GENERATIVE MODELS

with

diagonal for tractability. Generalizations to other exponential family visible

units are straightforward, at least in theory. A DBN with only one hidden layer is

just an RBM.

To generate a sample from a DBN, we ﬁrst run several steps of Gibbs sampling

on the top two hidden layers. This stage is essentially drawing a sample from

the RBM deﬁned by the top two hidden layers. We can then use a single pass of

ancestral sampling through the rest of the model to draw a sample from the visible

units.

Deep belief networks incur many of the problems associated with both directed

models and undirected models.

Inference in a deep belief network is intractable because of the explaining away

eﬀect within each directed layer and the interaction between the two hidden layers

that have undirected connections. Evaluating or maximizing the standard evidence

lower bound on the log-likelihood is also intractable, because the evidence lower

bound takes the expectation of cliques whose size is equal to the network width.

Evaluating or maximizing the log-likelihood requires confronting not just the

problem of intractable inference to marginalize out the latent variables, but also

the problem of an intractable partition function within the undirected model of

the top two layers.

To train a deep belief network, one begins by training an RBM to maximize

v∼p

data

log p

(

) using contrastive divergence or stochastic maximum likelihood.

The parameters of the RBM then deﬁne the parameters of the ﬁrst layer of the

DBN. Next, a second RBM is trained to approximately maximize

v∼p

data

(1)

∼p

(1)

|v)

log p

(2)

(1)

), (20.21)

where

(1)

is the probability distribution represented by the ﬁrst RBM, and

(2)

is the probability distribution represented by the second RBM. In other words,

the second RBM is trained to model the distribution deﬁned by sampling the

hidden units of the ﬁrst RBM, when the ﬁrst RBM is driven by the data. This

procedure can be repeated indeﬁnitely, to add as many layers to the DBN as

desired, with each new RBM modeling the samples of the previous one. Each RBM

deﬁnes another layer of the DBN. This procedure can be justiﬁed as increasing a

variational lower bound on the log-likelihood of the data under the DBN (Hinton

et al., 2006).

In most applications, no eﬀort is made to jointly train the DBN after the greedy

layer-wise procedure is complete. However, it is possible to perform generative

ﬁne-tuning using the wake-sleep algorithm.

658

CHAPTER 20. DEEP GENERATIVE MODELS

The trained DBN may be used directly as a generative model, but most of the

interest in DBNs arose from their ability to improve classiﬁcation models. We can

take the weights from the DBN and use them to deﬁne an MLP:

(1)

= σ



(1)

+ v



(1)



, (20.22)

(l)

= σ



(l)

+ h

(l−1)

(l)



∀l ∈ 2, . . . , m. (20.23)

After initializing this MLP with the weights and biases learned via generative

training of the DBN, we can train the MLP to perform a classiﬁcation task. This

additional training of the MLP is an example of discriminative ﬁne-tuning.

This speciﬁc choice of MLP is somewhat arbitrary, compared to many of the

inference equations in chapter 19 that are derived from ﬁrst principles. This MLP

is a heuristic choice that seems to work well in practice and is used consistently

in the literature. Many approximate inference techniques are motivated by their

ability to ﬁnd a maximally tight variational lower bound on the log-likelihood

under some set of constraints. One can construct a variational lower bound on the

log-likelihood using the hidden unit expectations deﬁned by the DBN’s MLP, but

this is true of any probability distribution over the hidden units, and there is no

reason to believe that this MLP provides a particularly tight bound. In particular,

the MLP ignores many important interactions in the DBN graphical model. The

MLP propagates information upward from the visible units to the deepest hidden

units, but it does not propagate any information downward or sideways. The DBN

graphical model has explaining away interactions between all the hidden units

within the same layer as well as in top-down interactions between layers.

While the log-likelihood of a DBN is intractable, it may be approximated with

AIS (Salakhutdinov and Murray, 2008). This permits evaluating its quality as a

generative model.

The term “deep belief network” is commonly used incorrectly to refer to any

kind of deep neural network, even networks without latent variable semantics.

The term should refer speciﬁcally to models with undirected connections in the

deepest layer and directed connections pointing downward between all other pairs

of consecutive layers.

The term may also cause some confusion because “belief network” is sometimes

used to refer to purely directed models, while deep belief networks contain an

undirected layer. Deep belief networks also share the acronym DBN with dynamic

Bayesian networks (Dean and Kanazawa, 1989), which are Bayesian networks for

representing Markov chains.

659

CHAPTER 20. DEEP GENERATIVE MODELS

20.4 Deep Boltzmann Machines

deep Boltzmann machine

, or DBM (Salakhutdinov and Hinton, 2009a) is

another kind of deep generative model. Unlike the deep belief network (DBN),

it is an entirely undirected model. Unlike the RBM, the DBM has several layers

of latent variables (RBMs have just one). But like the RBM, within each layer,

each of the variables are mutually independent, conditioned on the variables in

the neighboring layers. See ﬁgure 20.2 for the graph structure. Deep Boltzmann

machines have been applied to a variety of tasks, including document modeling

(Srivastava et al., 2013).

Like RBMs and DBNs, DBMs typically contain only binary units—as we

assume for simplicity of our presentation of the model—but it is straightforward

to include real-valued visible units.

A DBM is an energy-based model, meaning that the joint probability distribu-

tion over the model variables is parametrized by an energy function

. In the case

of a deep Boltzmann machine with one visible layer,

, and three hidden layers,

(1)

, h

(2)

, and h

(3)

, the joint probability is given by



v, h

(1)

, h

(2)

, h

(3)



Z(θ)

exp



−E(v, h

(1)

, h

(2)

, h

(3)

; θ)



. (20.24)

To simplify our presentation, we omit the bias parameters below. The DBM energy

function is then deﬁned as follows:

E(v, h

(1)

, h

(2)

, h

(3)

; θ) = −v



(1)

− h

(1)

(2)

− h

(2)

(3)

(20.25)

(1)

(2)

(1)

Figure 20.2: The graphical model for a deep Boltzmann machine with one visible layer

(bottom) and two hidden layers. Connections are only between units in neighboring layers.

There are no intralayer connections.

660

CHAPTER 20. DEEP GENERATIVE MODELS

In comparison to the RBM energy function (equation 20.5), the DBM energy

function includes connections between the hidden units (latent variables) in the

form of the weight matrices (

(2)

and

(3)

). As we will see, these connections

have signiﬁcant consequences for the model behavior as well as how we go about

performing inference in the model.

In comparison to fully connected Boltzmann machines (with every unit con-

nected to every other unit), the DBM oﬀers some advantages that are similar

to those oﬀered by the RBM. Speciﬁcally, as illustrated in ﬁgure 20.3, the DBM

layers can be organized into a bipartite graph, with odd layers on one side and

even layers on the other. This immediately implies that when we condition on the

variables in the even layer, the variables in the odd layers become conditionally

independent. Of course, when we condition on the variables in the odd layers, the

variables in the even layers also become conditionally independent.

The bipartite structure of the DBM means that we can apply the same equa-

tions we have previously used for the conditional distributions of an RBM to

determine the conditional distributions in a DBM. The units within a layer are

conditionally independent from each other given the values of the neighboring

layers, so the distributions over binary variables can be fully described by the

Bernoulli parameters, giving the probability of each unit being active. In our

example with two hidden layers, the activation probabilities are given by

P (v

= 1 | h

(1)

) = σ



(1)

i,:

(1)



, (20.26)

(1)

(2)

(3)

(2)

(1)

(3)

Figure 20.3: A deep Boltzmann machine, rearranged to reveal its bipartite graph structure.

661

CHAPTER 20. DEEP GENERATIVE MODELS

P (h

(1)

= 1 | v, h

(2)

) = σ





(1)

:,i

+ W

(2)

i,:

(2)



, (20.27)

and

P (h

(2)

= 1 | h

(1)

) = σ



(1)

(2)

:,k



. (20.28)

The bipartite structure makes Gibbs sampling in a deep Boltzmann machine

eﬃcient. The naive approach to Gibbs sampling is to update only one variable

at a time. RBMs allow all the visible units to be updated in one block and all

the hidden units to be updated in a second block. One might naively assume

that a DBM with

layers requires

+ 1 updates, with each iteration updating

a block consisting of one layer of units. Instead, it is possible to update all the

units in only two iterations. Gibbs sampling can be divided into two blocks of

updates, one including all even layers (including the visible layer) and the other

including all odd layers. Because of the bipartite DBM connection pattern, given

the even layers, the distribution over the odd layers is factorial and thus can be

sampled simultaneously and independently as a block. Likewise, given the odd

layers, the even layers can be sampled simultaneously and independently as a

block. Eﬃcient sampling is especially important for training with the stochastic

maximum likelihood algorithm.

20.4.1 Interesting Properties

Deep Boltzmann machines have many interesting properties.

DBMs were developed after DBNs. Compared to DBNs, the posterior distribu-

tion

(

h | v

) is simpler for DBMs. Somewhat counterintuitively, the simplicity

of this posterior distribution allows richer approximations of the posterior. In

the case of the DBN, we perform classiﬁcation using a heuristically motivated

approximate inference procedure, in which we guess that a reasonable value for

the mean ﬁeld expectation of the hidden units can be provided by an upward pass

through the network in an MLP that uses sigmoid activation functions and the

same weights as the original DBN. Any distribution

(

) can be used to obtain a

variational lower bound on the log-likelihood. This heuristic procedure therefore

enables us to obtain such a bound. Yet the bound is not explicitly optimized in

any way, so it may be far from tight. In particular, the heuristic estimate of

ignores interactions between hidden units within the same layer as well as the

top-down feedback inﬂuence of hidden units in deeper layers on hidden units that

are closer to the input. Because the heuristic MLP-based inference procedure in

the DBN is not able to account for these interactions, the resulting

is presumably

662

CHAPTER 20. DEEP GENERATIVE MODELS

far from optimal. In DBMs, all the hidden units within a layer are conditionally

independent given the other layers. This lack of intralayer interaction makes it

possible to use ﬁxed-point equations to optimize the variational lower bound and

ﬁnd the true optimal mean ﬁeld expectations (to within some numerical tolerance).

The use of proper mean ﬁeld allows the approximate inference procedure for

DBMs to capture the inﬂuence of top-down feedback interactions. This makes

DBMs interesting from the point of view of neuroscience, because the human brain

is known to use many top-down feedback connections. Because of this property,

DBMs have been used as computational models of real neuroscientiﬁc phenomena

(Series et al., 2010; Reichert et al., 2011).

One unfortunate property of DBMs is that sampling from them is relatively

diﬃcult. DBNs only need to use MCMC sampling in their top pair of layers. The

other layers are used only at the end of the sampling process, in one eﬃcient

ancestral sampling pass. To generate a sample from a DBM, it is necessary to

use MCMC across all layers, with every layer of the model participating in every

Markov chain transition.

20.4.2 DBM Mean Field Inference

The conditional distribution over one DBM layer given the neighboring layers is

factorial. In the example of the DBM with two hidden layers, these distributions

are

(

v | h

(1)

(

(1)

| v, h

(2)

), and

(

(2)

| h

(1)

). The distribution over all

hidden layers generally does not factorize because of interactions between layers. In

the example with two hidden layers,

(

(1)

, h

(2)

| v

) does not factorize because of

the interaction weights

(2)

between

(1)

and

(2)

, which render these variables

mutually dependent.

As was the case with the DBN, we are left to seek out methods to approximate

the DBM posterior distribution. Unlike the DBN, however, the DBM posterior

distribution over their hidden units—while complicated—is easy to approximate

with a variational approximation (as discussed in section 19.4), speciﬁcally a

mean ﬁeld approximation. The mean ﬁeld approximation is a simple form of

variational inference, where we restrict the approximating distribution to fully

factorial distributions. In the context of DBMs, the mean ﬁeld equations capture

the bidirectional interactions between layers. In this section we derive the iterative

approximate inference procedure originally introduced in Salakhutdinov and Hinton

(2009a).

In variational approximations to inference, we approach the task of approxi-

mating a particular target distribution—in our case, the posterior distribution over

663

CHAPTER 20. DEEP GENERATIVE MODELS

the hidden units given the visible units—by some reasonably simple family of dis-

tributions. In the case of the mean ﬁeld approximation, the approximating family

is the set of distributions where the hidden units are conditionally independent.

We now develop the mean ﬁeld approach for the example with two hidden

layers. Let

(

(1)

, h

(2)

| v

) be the approximation of

(

(1)

, h

(2)

| v

). The mean

ﬁeld assumption implies that

Q(h

(1)

, h

(2)

| v) =



Q(h

(1)

| v)



Q(h

(2)

| v). (20.29)

The mean ﬁeld approximation attempts to ﬁnd a member of this family of

distributions that best ﬁts the true posterior

(

(1)

, h

(2)

| v

). Importantly, the

inference process must be run again to ﬁnd a diﬀerent distribution

every time

we use a new value of v.

One can conceive of many ways of measuring how well

(

h | v

) ﬁts

(

h | v

The mean ﬁeld approach is to minimize

KL(QP ) =



Q(h

(1)

, h

(2)

| v) log



Q(h

(1)

, h

(2)

| v)

P (h

(1)

, h

(2)

| v)



. (20.30)

In general, we do not have to provide a parametric form of the approximating

distribution beyond enforcing the independence assumptions. The variational

approximation procedure is generally able to recover a functional form of the

approximate distribution. However, in the case of a mean ﬁeld assumption on

binary hidden units (the case we are developing here) there is no loss of generality

resulting from ﬁxing a parametrization of the model in advance.

We parametrize

as a product of Bernoulli distributions; that is, we associate

the probability of each element of

(1)

with a parameter. Speciﬁcally, for each

(1)

(

(1)

= 1

| v

), where

(1)

∈

1], and for each

(2)

(

(2)

= 1

| v

where

(2)

∈ [0, 1]. Thus we have the following approximation to the posterior:

Q(h

(1)

, h

(2)

| v) =



Q(h

(1)

| v)



Q(h

(2)

| v) (20.31)



(

(1)

)

(1)

(1 −

(1)

)

(1−h

(1)

)



(

(2)

)

(2)

(1 −

(2)

)

(1−h

(2)

)

(20.32)

Of course, for DBMs with more layers, the approximate posterior parametrization

can be extended in the obvious way, exploiting the bipartite structure of the graph

664

CHAPTER 20. DEEP GENERATIVE MODELS

to update all the even layers simultaneously and then to update all the odd layers

simultaneously, following the same schedule as Gibbs sampling.

Now that we have speciﬁed our family of approximating distributions

, it

remains to specify a procedure for choosing the member of this family that best

ﬁts

. The most straightforward way to do this is to use the mean ﬁeld equations

speciﬁed by equation 19.56. These equations were derived by solving for where the

derivatives of the variational lower bound are zero. They describe in an abstract

manner how to optimize the variational lower bound for any model, simply by

taking expectations with respect to Q.

Applying these general equations, we obtain the update rules (again, ignoring

bias terms):

(1)

= σ





(1)

i,j





(2)

j,k



(2)





, ∀j, (20.33)

(2)

= σ









(2)



(1)







, ∀k. (20.34)

At a ﬁxed point of this system of equations, we have a local maximum of the

variational lower bound

(

). Thus these ﬁxed-point update equations deﬁne an

iterative algorithm where we alternate updates of

(1)

(using equation 20.33) and

updates of

(2)

(using equation 20.34). On small problems such as MNIST, as few

as ten iterations can be suﬃcient to ﬁnd an approximate positive phase gradient

for learning, and ﬁfty usually suﬃce to obtain a high-quality representation of

a single speciﬁc example to be used for high-accuracy classiﬁcation. Extending

approximate variational inference to deeper DBMs is straightforward.

20.4.3 DBM Parameter Learning

Learning in the DBM must confront both the challenge of an intractable partition

function, using the techniques from chapter 18, and the challenge of an intractable

posterior distribution, using the techniques from chapter 19.

As described in section 20.4.2, variational inference allows the construction of

a distribution

(

h | v

) that approximates the intractable

(

h | v

). Learning then

proceeds by maximizing

(

v, Q, θ

), the variational lower bound on the intractable

log-likelihood, log P (v; θ).

665

CHAPTER 20. DEEP GENERATIVE MODELS

For a deep Boltzmann machine with two hidden layers, L is given by

L(Q, θ) =





(1)

i,j



(1)











(1)



(2)



(2)



−log Z(θ) + H(Q). (20.35)

This expression still contains the log partition function,

log Z

(

). Because a deep

Boltzmann machine contains restricted Boltzmann machines as components, the

hardness results for computing the partition function and sampling that apply to

restricted Boltzmann machines also apply to deep Boltzmann machines. This means

that evaluating the probability mass function of a Boltzmann machine requires

approximate methods such as annealed importance sampling. Likewise, training

the model requires approximations to the gradient of the log partition function. See

chapter 18 for a general description of these methods. DBMs are typically trained

using stochastic maximum likelihood. Many of the other techniques described in

chapter 18 are not applicable. Techniques such as pseudolikelihood require the

ability to evaluate the unnormalized probabilities, rather than merely obtain a

variational lower bound on them. Contrastive divergence is slow for deep Boltzmann

machines because they do not allow eﬃcient sampling of the hidden units given the

visible units—instead, contrastive divergence would require burning in a Markov

chain every time a new negative phase sample is needed.

The nonvariational version of the stochastic maximum likelihood algorithm is

discussed in section 18.2. Variational stochastic maximum likelihood as applied to

the DBM is given in algorithm 20.1. Recall that we describe a simpliﬁed variant

of the DBM that lacks bias parameters; including them is trivial.

20.4.4 Layer-Wise Pretraining

Unfortunately, training a DBM using stochastic maximum likelihood (as described

above) from a random initialization usually results in failure. In some cases, the

model fails to learn to represent the distribution adequately. In other cases, the

DBM may represent the distribution well, but with no higher likelihood than could

be obtained with just an RBM. A DBM with very small weights in all but the ﬁrst

layer represents approximately the same distribution as an RBM.

Various techniques that permit joint training have been developed and are

described in section 20.4.5. However, the original and most popular method for

overcoming the joint training problem of DBMs is greedy layer-wise pretraining.

In this method, each layer of the DBM is trained in isolation as an RBM. The

ﬁrst layer is trained to model the input data. Each subsequent RBM is trained

to model samples from the previous RBM’s posterior distribution. After all the

666

CHAPTER 20. DEEP GENERATIVE MODELS

Algorithm 20.1

The variational stochastic maximum likelihood algorithm for

training a DBM with two hidden layers

Set , the step size, to a small positive number

Set

, the number of Gibbs steps, high enough to allow a Markov chain of

(

v, h

(1)

, h

(2)

;



∆

) to burn in, starting from samples from

(

v, h

(1)

, h

(2)

;

Initialize three matrices,

(1)

, and

(2)

each with

rows set to random

values (e.g., from Bernoulli distributions, possibly with marginals matched to

the model’s marginals).

while not converged (learning loop) do

Sample a minibatch of

examples from the training data and arrange them

as the rows of a design matrix V .

Initialize matrices

(1)

and

(2)

, possibly to the model’s marginals.

while not converged (mean ﬁeld inference loop) do

(1)

← σ



V W

(1)

(2)

(2)



(2)

← σ



(1)

(2)



end while

∆

(1)

←



(1)

∆

(2)

←

(1) 

(2)

for l = 1 to k (Gibbs sampling) do

Gibbs block 1:

∀i, j,

i,j

sampled from P (

i,j

= 1) = σ



(1)

j,:



(1)

i,:







∀i, j,

(2)

i,j

sampled from P (

(2)

i,j

= 1) = σ



(1)

i,:

(2)

:,j



Gibbs block 2:

∀i, j,

(1)

i,j

sampled from P (

(1)

i,j

= 1) = σ



i,:

(1)

:,j

(2)

i,:

(2)

j,:



end for

∆

(1)

← ∆

(1)

−



(1)

∆

(2)

← ∆

(2)

−

(1)

(2)

(1)

← W

(1)



∆

(1)

(this is a cartoon illustration, in practice use a more

eﬀective algorithm, such as momentum with a decaying learning rate)

(2)

← W

(2)

+ ∆

(2)

end while

RBMs have been trained in this way, they can be combined to form a DBM. The

DBM may then be trained with PCD. Typically PCD training will make only a

small change in the model’s parameters and in its performance as measured by the

667

CHAPTER 20. DEEP GENERATIVE MODELS

log-likelihood it assigns to the data, or its ability to classify inputs. See ﬁgure 20.4

for an illustration of the training procedure.

This greedy layer-wise training procedure is not just coordinate ascent. It bears

some passing resemblance to coordinate ascent because we optimize one subset of

the parameters at each step. The two methods diﬀer because the greedy layer-wise

training procedure uses a diﬀerent objective function at each step.

Greedy layer-wise pretraining of a DBM diﬀers from greedy layer-wise pre-

training of a DBN. The parameters of each individual RBM may be copied to

the corresponding DBN directly. In the case of the DBM, the RBM parameters

must be modiﬁed before inclusion in the DBM. A layer in the middle of the stack

of RBMs is trained with only bottom-up input, but after the stack is combined

to form the DBM, the layer will have both bottom-up and top-down input. To

account for this eﬀect, Salakhutdinov and Hinton (2009a) advocate dividing the

weights of all but the top and bottom RBM in half before inserting them into the

DBM. Additionally, the bottom RBM must be trained using two “copies” of each

visible unit and the weights tied to be equal between the two copies. This means

that the weights are eﬀectively doubled during the upward pass. Similarly, the top

RBM should be trained with two copies of the topmost layer.

Obtaining the state of the art results with the deep Boltzmann machine requires

a modiﬁcation of the standard SML algorithm, which is to use a small amount of

mean ﬁeld during the negative phase of the joint PCD training step (Salakhutdinov

and Hinton, 2009a). Speciﬁcally, the expectation of the energy gradient should

be computed with respect to the mean ﬁeld distribution in which all the units

are independent from each other. The parameters of this mean ﬁeld distribution

should be obtained by running the mean ﬁeld ﬁxed-point equations for just one

step. See Goodfellow et al. (2013b) for a comparison of the performance of centered

DBMs with and without the use of partial mean ﬁeld in the negative phase.

20.4.5 Jointly Training Deep Boltzmann Machines

Classic DBMs require greedy unsupervised pretraining and, to perform classiﬁcation

well, require a separate MLP-based classiﬁer on top of the hidden features they

extract. This has some undesirable properties. It is hard to track performance

during training because we cannot evaluate properties of the full DBM while

training the ﬁrst RBM. Thus, it is hard to tell how well our hyperparameters

are working until quite late in the training process. Software implementations

of DBMs need to have many diﬀerent components for CD training of individual

RBMs, PCD training of the full DBM, and training based on back-propagation

668

CHAPTER 20. DEEP GENERATIVE MODELS

through the MLP. Finally, the MLP on top of the Boltzmann machine loses many

of the advantages of the Boltzmann machine probabilistic model, such as being

able to perform inference when some input values are missing.

There are two main ways to resolve the joint training problem of the deep

a) b)

Figure 20.4: The deep Boltzmann machine training procedure used to classify the MNIST

dataset (Salakhutdinov and Hinton, 2009a; Srivastava et al., 2014). (a)Train an RBM by

using CD to approximately maximize

log P

(

). (b)Train a second RBM that models

(1)

and target class

by using CD-

to approximately maximize

log P

(

(1)

, y

), where

(1)

is drawn from the ﬁrst RBM’s posterior conditioned on the data. Increase

from 1 to

20 during learning. (c)Combine the two RBMs into a DBM. Train it to approximately

maximize

log P

(

v, y

) using stochastic maximum likelihood with

= 5. (d)Delete

from

the model. Deﬁne a new set of features

(1)

and

(2)

that are obtained by running mean

ﬁeld inference in the model lacking

. Use these features as input to an MLP whose

structure is the same as an additional pass of mean ﬁeld, with an additional output layer

for the estimate of

. Initialize the MLP’s weights to be the same as the DBM’s weights.

Train the MLP to approximately maximize

log P

(

y | v

) using stochastic gradient descent

and dropout. Figure reprinted from Goodfellow et al. (2013b).

669

CHAPTER 20. DEEP GENERATIVE MODELS

Boltzmann machine. The ﬁrst is the

centered deep Boltzmann machine

(Montavon and Muller, 2012), which reparametrizes the model in order to make

the Hessian of the cost function better conditioned at the beginning of the learning

process. This yields a model that can be trained without a greedy layer-wise

pretraining stage. The resulting model obtains excellent test set log-likelihood

and produces high-quality samples. Unfortunately, it remains unable to compete

with appropriately regularized MLPs as a classiﬁer. The second way to jointly

train a deep Boltzmann machine is to use a

multi-prediction deep Boltzmann

machine

(Goodfellow et al., 2013b). This model uses an alternative training

criterion that allows the use of the back-propagation algorithm to avoid the

problems with MCMC estimates of the gradient. Unfortunately, the new criterion

does not lead to good likelihood or samples, but, compared to the MCMC approach,

it does lead to superior classiﬁcation performance and ability to reason well about

missing inputs.

The centering trick for the Boltzmann machine is easiest to describe if we

return to the general view of a Boltzmann machine as consisting of a set of units

with a weight matrix

and biases

. Recall from equation 20.2 that the energy

function is given by

E(x) = −x



Ux − b



x. (20.36)

Using diﬀerent sparsity patterns in the weight matrix

, we can implement

structures of Boltzmann machines, such as RBMs or DBMs with diﬀerent numbers

of layers. This is accomplished by partitioning

into visible and hidden units and

zeroing out elements of

for units that do not interact. The centered Boltzmann

machine introduces a vector µ that is subtracted from all the states:



(x; U, b) = −(x − µ)



U(x − µ) − (x − µ)



b. (20.37)

Typically

is a hyperparameter ﬁxed at the beginning of training. It is usu-

ally chosen to make sure that

x − µ ≈ 0

when the model is initialized. This

reparametrization does not change the set of probability distributions that the

model can represent, but it does change the dynamics of stochastic gradient descent

applied to the likelihood. Speciﬁcally, in many cases, this reparametrization results

in a Hessian matrix that is better conditioned. Melchior et al. (2013) experimentally

conﬁrmed that the conditioning of the Hessian matrix improves, and observed that

the centering trick is equivalent to another Boltzmann machine learning technique,

the

enhanced gradient

(Cho et al., 2011). The improved conditioning of the

Hessian matrix enables learning to succeed, even in diﬃcult cases like training a

deep Boltzmann machine with multiple layers.

The other approach to jointly training deep Boltzmann machines is the multi-

prediction deep Boltzmann machine (MP-DBM), which works by viewing the mean

670

CHAPTER 20. DEEP GENERATIVE MODELS

ﬁeld equations as deﬁning a family of recurrent networks for approximately solving

every possible inference problem (Goodfellow et al., 2013b). Rather than training

the model to maximize the likelihood, the model is trained to make each recurrent

network obtain an accurate answer to the corresponding inference problem. The

training process is illustrated in ﬁgure 20.5. It consists of randomly sampling a

training example, randomly sampling a subset of inputs to the inference network,

and then training the inference network to predict the values of the remaining

units.

This general principle of back-propagating through the computational graph

for approximate inference has been applied to other models (Stoyanov et al., 2011;

Brakel et al., 2013). In these models and in the MP-DBM, the ﬁnal loss is not

the lower bound on the likelihood. Instead, the ﬁnal loss is typically based on

the approximate conditional distribution that the approximate inference network

imposes over the missing values. This means that the training of these models

is somewhat heuristically motivated. If we inspect the

(

) represented by the

Boltzmann machine learned by the MP-DBM, it tends to be somewhat defective,

in the sense that Gibbs sampling yields poor samples.

Back-propagation through the inference graph has two main advantages. First,

it trains the model as it is really used—with approximate inference. This means

that approximate inference, for example, to ﬁll in missing inputs or to perform

classiﬁcation despite the presence of missing inputs, is more accurate in the MP-

DBM than in the original DBM. The original DBM does not make an accurate

classiﬁer on its own; the best classiﬁcation results with the original DBM were

based on training a separate classiﬁer to use features extracted by the DBM,

rather than by using inference in the DBM to compute the distribution over the

class labels. Mean ﬁeld inference in the MP-DBM performs well as a classiﬁer

without special modiﬁcations. The other advantage of back-propagating through

approximate inference is that back-propagation computes the exact gradient of

the loss. This is better for optimization than the approximate gradients of SML

training, which suﬀer from both bias and variance. This probably explains why MP-

DBMs may be trained jointly while DBMs require a greedy layer-wise pretraining.

The disadvantage of back-propagating through the approximate inference graph is

that it does not provide a way to optimize the log-likelihood, but rather gives a

heuristic approximation of the generalized pseudolikelihood.

The MP-DBM inspired the NADE-

(Raiko et al., 2014) extension to the

NADE framework, which is described in section 20.10.10.

The MP-DBM has some connections to dropout. Dropout shares the same pa-

rameters among many diﬀerent computational graphs, with the diﬀerence between

671

CHAPTER 20. DEEP GENERATIVE MODELS

Figure 20.5: An illustration of the multiprediction training process for a deep Boltzmann

machine. Each row indicates a diﬀerent example within a minibatch for the same training

step. Each column represents a time step within the mean ﬁeld inference process. For

each example, we sample a subset of the data variables to serve as inputs to the inference

process. These variables are shaded black to indicate conditioning. We then run the

mean ﬁeld inference process, with arrows indicating which variables inﬂuence which other

variables in the process. In practical applications, we unroll mean ﬁeld for several steps.

In this illustration, we unroll for only two steps. Dashed arrows indicate how the process

could be unrolled for more steps. The data variables that were not used as inputs to the

inference process become targets, shaded in gray. We can view the inference process for

each example as a recurrent network. We use gradient descent and back-propagation to

train these recurrent networks to produce the correct targets given their inputs. This

trains the mean ﬁeld process for the MP-DBM to produce accurate estimates. Figure

adapted from Goodfellow et al. (2013b).

672

CHAPTER 20. DEEP GENERATIVE MODELS

each graph being whether it includes or excludes each unit. The MP-DBM also

shares parameters across many computational graphs. In the case of the MP-DBM,

the diﬀerence between the graphs is whether each input unit is observed or not.

When a unit is not observed, the MP-DBM does not delete it entirely as dropout

does. Instead, the MP-DBM treats it as a latent variable to be inferred. One could

imagine applying dropout to the MP-DBM by additionally removing some units

rather than making them latent.

20.5 Boltzmann Machines for Real-Valued Data

While Boltzmann machines were originally developed for use with binary data,

many applications such as image and audio modeling seem to require the ability

to represent probability distributions over real values. In some cases, it is possible

to treat real-valued data in the interval [0, 1] as representing the expectation of a

binary variable. For example, Hinton (2000) treats grayscale images in the training

set as deﬁning [0,1] probability values. Each pixel deﬁnes the probability of a

binary value being 1, and the binary pixels are all sampled independently from each

other. This is a common procedure for evaluating binary models on grayscale image

datasets. Nonetheless, it is not a particularly theoretically satisfying approach,

and binary images sampled independently in this way have a noisy appearance. In

this section, we present Boltzmann machines that deﬁne a probability density over

real-valued data.

20.5.1 Gaussian-Bernoulli RBMs

Restricted Boltzmann machines may be developed for many exponential family

conditional distributions (Welling et al., 2005). Of these, the most common is the

RBM with binary hidden units and real-valued visible units, with the conditional

distribution over the visible units being a Gaussian distribution whose mean is a

function of the hidden units.

There are many ways of parametrizing Gaussian-Bernoulli RBMs. One choice

is whether to use a covariance matrix or a precision matrix for the Gaussian

distribution. Here we present the precision formulation. The modiﬁcation to obtain

the covariance formulation is straightforward. We wish to have the conditional

distribution

p(v | h) = N(v; W h, β

−1

). (20.38)

We can ﬁnd the terms we need to add to the energy function by expanding the

673

CHAPTER 20. DEEP GENERATIVE MODELS

unnormalized log conditional distribution:

log N(v; W h, β

−1

) = −

(v − W h)



β (v − W h) + f(β). (20.39)

Here

encapsulates all the terms that are a function only of the parameters

and not the random variables in the model. We can discard

because its only

role is to normalize the distribution, and the partition function of whatever energy

function we choose will carry out that role.

If we include all the terms (with their sign ﬂipped) involving

from equa-

tion 20.39 in our energy function and do not add any other terms involving

, then

our energy function will represent the desired conditional p(v | h).

We have some freedom regarding the other conditional distribution,

(

h | v

Note that equation 20.39 contains a term



βW h. (20.40)

This term cannot be included in its entirety because it includes

terms. These

correspond to edges between the hidden units. If we included these terms, we

would have a linear factor model instead of a restricted Boltzmann machine. When

designing our Boltzmann machine, we simply omit these

cross terms. Omitting

them does not change the conditional

(

v | h

), so equation 20.39 is still respected.

We still have a choice, however, about whether to include the terms involving only

a single

. If we assume a diagonal precision matrix, we ﬁnd that for each hidden

unit h

, we have a term



j,i

. (20.41)

In the above, we used the fact that

because

∈ {

}

. If we include this

term (with its sign ﬂipped) in the energy function, then it will naturally bias

to be turned oﬀ when the weights for that unit are large and connected to visible

units with high precision. The choice of whether to include this bias term does

not aﬀect the family of distributions that the model can represent (assuming that

we include bias parameters for the hidden units), but it does aﬀect the learning

dynamics of the model. Including the term may help the hidden unit activations

remain reasonable even when the weights rapidly increase in magnitude.

One way to deﬁne the energy function on a Gaussian-Bernoulli RBM is thus:

E(v, h) =



(β  v) − (v  β)



W h − b



h, (20.42)

674

CHAPTER 20. DEEP GENERATIVE MODELS

but we may also add extra terms or parametrize the energy in terms of the variance

rather than precision if we choose.

In this derivation, we have not included a bias term on the visible units, but one

could easily be added. One ﬁnal source of variability in the parametrization of a

Gaussian-Bernoulli RBM is the choice of how to treat the precision matrix. It may

be either ﬁxed to a constant (perhaps estimated based on the marginal precision

of the data) or learned. It may also be a scalar times the identity matrix, or it

may be a diagonal matrix. Typically we do not allow the precision matrix to be

nondiagonal in this context, because some operations on the Gaussian distribution

require inverting the matrix, and a diagonal matrix can be inverted trivially. In

the sections ahead, we will see that other forms of Boltzmann machines permit

modeling the covariance structure, using various techniques to avoid inverting the

precision matrix.

20.5.2 Undirected Models of Conditional Covariance

While the Gaussian RBM has been the canonical energy model for real-valued

data, Ranzato et al. (2010a) argue that the Gaussian RBM inductive bias is not

well suited to the statistical variations present in some types of real-valued data,

especially natural images. The problem is that much of the information content

present in natural images is embedded in the covariance between pixels rather than

in the raw pixel values. In other words, it is the relationships between pixels and

not their absolute values where most of the useful information in images resides.

Since the Gaussian RBM only models the conditional mean of the input given the

hidden units, it cannot capture conditional covariance information. In response

to these criticisms, alternative models have been proposed that attempt to better

account for the covariance of real-valued data. These models include the mean and

covariance RBM (mcRBM

), the mean product of Student

-distribution (mPoT)

model, and the spike and slab RBM (ssRBM).

Mean and Covariance RBM

The mcRBM uses its hidden units to indepen-

dently encode the conditional mean and covariance of all observed units. The

mcRBM hidden layer is divided into two groups of units: mean units and covariance

units. The group that models the conditional mean is simply a Gaussian RBM.

The other half is a covariance RBM (Ranzato et al., 2010a), also called a cRBM,

whose components model the conditional covariance structure, as described below.

The term “mcRBM” is pronounced by saying the name of the letters M-C-R-B-M; the “mc”

is not pronounced like the “Mc” in “McDonald’s.”

675

CHAPTER 20. DEEP GENERATIVE MODELS

Speciﬁcally, with binary mean units

(m)

and binary covariance units

(c)

, the

mcRBM model is deﬁned as the combination of two energy functions:

(x, h

(m)

, h

(c)

) = E

(x, h

(m)

) + E

(x, h

(c)

), (20.43)

where E

is the standard Gaussian-Bernoulli RBM energy function,

(x, h

(m)

) =



x −





:,j

(m)

−



(m)

, (20.44)

and

is the cRBM energy function that models the conditional covariance

information:

(x, h

(c)

) =



(c)





(j)



−



(c)

. (20.45)

The parameter

(j)

corresponds to the covariance weight vector associated with

(c)

, and

(c)

is a vector of covariance oﬀsets. The combined energy function deﬁnes

a joint distribution,

(x, h

(m)

, h

(c)

) =

exp



−E

(x, h

(m)

, h

(c)

)



, (20.46)

and a corresponding conditional distribution over the observations given

(m)

and

(c)

as a multivariate Gaussian distribution:

(x | h

(m)

, h

(c)

) = N





x; C

x|h







:,j

(m)





, C

x|h





. (20.47)

Note that the covariance matrix

x|h





(c)

(j)

(j)

+ I



−1

is nondiagonal

and that

is the weight matrix associated with the Gaussian RBM modeling the

conditional means. It is diﬃcult to train the mcRBM via contrastive divergence or

persistent contrastive divergence because of its nondiagonal conditional covariance

structure. CD and PCD require sampling from the joint distribution of

x, h

(m)

, h

(c)

which, in a standard RBM, is accomplished by Gibbs sampling over the conditionals.

However, in the mcRBM, sampling from

(

x | h

(m)

, h

(c)

) requires computing

(

)

−1

at every iteration of learning. This can be an impractical computational

burden for larger observations. Ranzato and Hinton (2010) avoid direct sampling

from the conditional

(

x | h

(m)

, h

(c)

) by sampling directly from the marginal

(

) using Hamiltonian (hybrid) Monte Carlo (Neal, 1993) on the mcRBM free

energy.

This version of the Gaussian-Bernoulli RBM energy function assumes the image data have

zero mean per pixel. Pixel oﬀsets can easily be added to the model to account for nonzero pixel

means.

676

CHAPTER 20. DEEP GENERATIVE MODELS

Mean Product of Student t-distributions

The mean product of Student

distribution (mPoT) model (Ranzato et al., 2010b) extends the PoT model (Welling

et al., 2003a) in a manner similar to how the mcRBM extends the cRBM. This

is achieved by including nonzero Gaussian means by the addition of Gaussian

RBM-like hidden units. Like the mcRBM, the PoT conditional distribution over the

observation is a multivariate Gaussian (with nondiagonal covariance) distribution;

however, unlike the mcRBM, the complementary conditional distribution over the

hidden variables is given by conditionally independent Gamma distributions. The

Gamma distribution

(

k, θ

) is a probability distribution over positive real numbers,

with mean

kθ

. It is not necessary to have a more detailed understanding of the

Gamma distribution to understand the basic ideas underlying the mPoT model.

The mPoT energy function is

mPoT

(x, h

(m)

, h

(c)

) (20.48)

= E

(x, h

(m)

) +





(c)



1 +



(j)





+ (1 − γ

) log h

(c)



(20.49)

where

(j)

is the covariance weight vector associated with unit

(c)

, and

(

x, h

(m)

)

is as deﬁned in equation 20.44.

Just as with the mcRBM, the mPoT model energy function speciﬁes a mul-

tivariate Gaussian, with a conditional distribution over

that has

nondiagonal

covariance. Learning in the mPoT model—again, like the mcRBM—is com-

plicated by the inability to sample from the nondiagonal Gaussian conditional

mPoT

(

x | h

(m)

, h

(c)

), so Ranzato et al. (2010b) also advocate direct sampling of

p(x) via Hamiltonian (hybrid) Monte Carlo.

Spike and Slab Restricted Boltzmann Machines

Spike and slab restricted

Boltzmann machines (Courville et al., 2011) or ssRBMs provide another means

of modeling the covariance structure of real-valued data. Compared to mcRBMs,

ssRBMs have the advantage of requiring neither matrix inversion nor Hamiltonian

Monte Carlo methods. Like the mcRBM and the mPoT model, the ssRBM’s binary

hidden units encode the conditional covariance across pixels through the use of

auxiliary real-valued variables.

The spike and slab RBM has two sets of hidden units: binary

spike

units

and real-valued

slab

units

. The mean of the visible units conditioned on the

hidden units is given by (

h  s

)



. In other words, each column

:,i

deﬁnes a

component that can appear in the input when

= 1. The corresponding spike

677

CHAPTER 20. DEEP GENERATIVE MODELS

variable

determines whether that component is present at all. The corresponding

slab variable

determines the intensity of that component, if it is present. When

a spike variable is active, the corresponding slab variable adds variance to the

input along the axis deﬁned by

:,i

. This allows us to model the covariance of the

inputs. Fortunately, contrastive divergence and persistent contrastive divergence

with Gibbs sampling are still applicable. There is no need to invert any matrix.

Formally, the ssRBM model is deﬁned via its energy function:

(x, s, h) = −





:,i





Λ +





x (20.50)



−



−



, (20.51)

where

is the oﬀset of the spike

, and

is a diagonal precision matrix on the

observations

. The parameter

0 is a scalar precision parameter for the

real-valued slab variable

. The parameter

is a nonnegative diagonal matrix

that deﬁnes an

-modulated quadratic penalty on

. Each

is a mean parameter

for the slab variable s

With the joint distribution deﬁned via the energy function, deriving the ssRBM

conditional distributions is relatively straightforward. For example, by marginal-

izing out the slab variables

, the conditional distribution over the observations

given the binary spike variables h is given by

(x | h) =

P (h)



exp {−E(x, s, h)} ds (20.52)

= N



x; C

x|h



:,i

, C

x|h



(20.53)

where

x|h



Λ +



−



−1

:,i



:,i



−1

. The last equality holds only if

the covariance matrix C

x|h

is positive deﬁnite.

Gating by the spike variables means that the true marginal distribution over

h s

is sparse. This is diﬀerent from sparse coding, where samples from the model

“almost never” (in the measure theoretic sense) contain zeros in the code, and MAP

inference is required to impose sparsity.

Comparing the ssRBM to the mcRBM and the mPoT models, the ssRBM

parametrizes the conditional covariance of the observation in a signiﬁcantly diﬀerent

way. The mcRBM and mPoT both model the covariance structure of the observation





(c)

(j)

(j)

+ I



−1

, using the activation of the hidden units

0 to

678

CHAPTER 20. DEEP GENERATIVE MODELS

enforce constraints on the conditional covariance in the direction

(j)

. In contrast,

the ssRBM speciﬁes the conditional covariance of the observations using the hidden

spike activations

= 1 to pinch the precision matrix along the direction speciﬁed

by the corresponding weight vector. The ssRBM conditional covariance is similar to

that given by a diﬀerent model: the product of probabilistic principal components

analysis (PoPPCA) (Williams and Agakov, 2002). In the overcomplete setting,

sparse activations with the ssRBM parametrization permit signiﬁcant variance

(above the nominal variance given by

−1

) only in the selected directions of

the sparsely activated

. In the mcRBM or mPoT models, an overcomplete

representation would mean that to capture variation in a particular direction in the

observation space would require removing potentially all constraints with positive

projection in that direction. This would suggest that these models are less well

suited to the overcomplete setting.

The primary disadvantage of the spike and slab restricted Boltzmann machine

is that some settings of the parameters can correspond to a covariance matrix

that is not positive deﬁnite. Such a covariance matrix places more unnormalized

probability on values that are farther from the mean, causing the integral over

all possible outcomes to diverge. Generally this issue can be avoided with simple

heuristic tricks. There is not yet any theoretically satisfying solution. Using

constrained optimization to explicitly avoid the regions where the probability is

undeﬁned is diﬃcult to do without being overly conservative and also preventing

the model from accessing high-performing regions of parameter space.

Qualitatively, convolutional variants of the ssRBM produce excellent samples

of natural images. Some examples are shown in ﬁgure 16.1.

The ssRBM allows for several extensions. Including higher-order interactions

and average-pooling of the slab variables (Courville et al., 2014) enables the model

to learn excellent features for a classiﬁer when labeled data is scarce. Adding a

term to the energy function that prevents the partition function from becoming

undeﬁned results in a sparse coding model, spike and slab sparse coding (Goodfellow

et al., 2013d), also known as S3C.

20.6 Convolutional Boltzmann Machines

As we discuss in chapter 9, extremely high-dimensional inputs such as images place

great strain on the computation, memory and statistical requirements of machine

learning models. Replacing matrix multiplication by discrete convolution with a

small kernel is the standard way of solving these problems for inputs that have

translation invariant spatial or temporal structure. Desjardins and Bengio (2008)

679

CHAPTER 20. DEEP GENERATIVE MODELS

showed that this approach works well when applied to RBMs.

Deep convolutional networks usually require a pooling operation so that the

spatial size of each successive layer decreases. Feedforward convolutional networks

often use a pooling function such as the maximum of the elements to be pooled.

It is unclear how to generalize this to the setting of energy-based models. We

could introduce a binary pooling unit

over

binary detector units

and enforce

max

by setting the energy function to be

∞

whenever that constraint is

violated. This does not scale well though, as it requires evaluating 2

diﬀerent

energy conﬁgurations to compute the normalization constant. For a small 3

pooling region this requires 2

= 512 energy function evaluations per pooling unit!

Lee et al. (2009) developed a solution to this problem called

probabilistic

max pooling

(not to be confused with “stochastic pooling,” which is a technique

for implicitly constructing ensembles of convolutional feedforward networks). The

strategy behind probabilistic max pooling is to constrain the detector units so

at most one may be active at a time. This means there are only

+ 1 total

states (one state for each of the

detector units being on, and an additional state

corresponding to all the detector units being oﬀ). The pooling unit is on if and

only if one of the detector units is on. The state with all units oﬀ is assigned

energy zero. We can think of this as describing a model with a single variable that

has

+ 1 states, or equivalently as a model that has

+ 1 variables that assigns

energy ∞ to all but n + 1 joint assignments of variables.

While eﬃcient, probabilistic max pooling does force the detector units to be

mutually exclusive, which may be a useful regularizing constraint in some contexts

or a harmful limit on model capacity in other contexts. It also does not support

overlapping pooling regions. Overlapping pooling regions are usually required

to obtain the best performance from feedforward convolutional networks, so this

constraint probably greatly reduces the performance of convolutional Boltzmann

machines.

Lee et al. (2009) demonstrated that probabilistic max pooling could be used

to build convolutional deep Boltzmann machines.

This model is able to perform

operations such as ﬁlling in missing portions of its input. While intellectually

appealing, this model is challenging to make work in practice, and usually does

not perform as well as a classiﬁer as traditional convolutional networks trained

with supervised learning.

Many convolutional models work equally well with inputs of many diﬀerent

The publication describes the model as a “deep belief network,” but because it can be

described as a purely undirected model with tractable layer-wise mean ﬁeld ﬁxed-point updates,

it best ﬁts the deﬁnition of a deep Boltzmann machine.

680

CHAPTER 20. DEEP GENERATIVE MODELS

spatial sizes. For Boltzmann machines, it is diﬃcult to change the input size for

various reasons. The partition function changes as the size of the input changes.

Moreover, many convolutional networks achieve size invariance by scaling up the

size of their pooling regions proportional to the size of the input, but scaling

Boltzmann machine pooling regions is awkward. Traditional convolutional neural

networks can use a ﬁxed number of pooling units and dynamically increase the

size of their pooling regions to obtain a ﬁxed-size representation of a variably sized

input. For Boltzmann machines, large pooling regions become too expensive for the

naive approach. The approach of Lee et al. (2009) of making each of the detector

units in the same pooling region mutually exclusive solves the computational

problems but still does not allow variably sized pooling regions. For example,

suppose we learn a model with 2

2 probabilistic max pooling over detector units

that learn edge detectors. This enforces the constraint that only one of these

edges may appear in each 2

2 region. If we then increase the size of the input

image by 50 percent in each direction, we would expect the number of edges to

increase correspondingly. Instead, if we increase the size of the pooling regions

by 50 percent in each direction to 3

3, then the mutual exclusivity constraint

now speciﬁes that each of these edges may appear only once in a 3

3 region.

As we grow a model’s input image in this way, the model generates edges with

less density. Of course, these issues only arise when the model must use variable

amounts of pooling in order to emit a ﬁxed-size output vector. Models that use

probabilistic max pooling may still accept variably sized input images as long as

the output of the model is a feature map that can scale in size proportional to the

input image.

Pixels at the boundary of the image also pose some diﬃculty, which is exacer-

bated by the fact that connections in a Boltzmann machine are symmetric. If we

do not implicitly zero pad the input, there will be fewer hidden units than visible

units, and the visible units at the boundary of the image will not be modeled well

because they lie in the receptive ﬁeld of fewer hidden units. However, if we do

implicitly zero pad the input, then the hidden units at the boundary will be driven

by fewer input pixels and may fail to activate when needed.

20.7 Boltzmann Machines for Structured or Sequential

Outputs

In the structured output scenario, we wish to train a model that can map from

some input

to some output

, and the diﬀerent entries of

are related to each

other and must obey some constraints. For example, in the speech synthesis task,

681

CHAPTER 20. DEEP GENERATIVE MODELS

y is a waveform, and the entire waveform must sound like a coherent utterance.

A natural way to represent the relationships between the entries in

is to

use a probability distribution

(

y | x

). Boltzmann machines, extended to model

conditional distributions, can supply this probabilistic model.

The same tool of conditional modeling with a Boltzmann machine can be used

not just for structured output tasks, but also for sequence modeling. In the latter

case, rather than mapping an input

to an output

, the model must estimate a

probability distribution over a sequence of variables,

(

(1)

, . . . , x

(τ)

). Conditional

Boltzmann machines can represent factors of the form

(

(t)

| x

(1)

, . . . , x

(t−1)

) in

order to accomplish this task.

An important sequence modeling task for the video game and ﬁlm industry

is modeling sequences of joint angles of skeletons used to render 3-D characters.

These sequences are often collected using motion capture systems to record the

movements of actors. A probabilistic model of a character’s movement allows

the generation of new, previously unseen, but realistic animations. To solve

this sequence modeling task, Taylor et al. (2007) introduced a conditional RBM

modeling

(

(t)

| x

(t−1)

, . . . , x

(t−m)

) for small

. The model is an RBM over

(

(t)

) whose bias parameters are a linear function of the preceding

values of

When we condition on diﬀerent values of

(t−1)

and earlier variables, we get a new

RBM over

. The weights in the RBM over

never change, but by conditioning on

diﬀerent past values, we can change the probability of diﬀerent hidden units in the

RBM being active. By activating and deactivating diﬀerent subsets of hidden units,

we can make large changes to the probability distribution induced on

. Other

variants of conditional RBM (Mnih et al., 2011) and other variants of sequence

modeling using conditional RBMs are possible (Taylor and Hinton, 2009; Sutskever

et al., 2009; Boulanger-Lewandowski et al., 2012).

Another sequence modeling task is to model the distribution over sequences

of musical notes used to compose songs. Boulanger-Lewandowski et al. (2012)

introduced the

RNN-RBM

sequence model and applied it to this task. The

RNN-RBM is a generative model of a sequence of frames

(t)

consisting of an RNN

that emits the RBM parameters for each time step. Unlike previous approaches in

which only the bias parameters of the RBM varied from one time step to the next,

the RNN-RBM uses the RNN to emit all the parameters of the RBM, including

the weights. To train the model, we need to be able to back-propagate the gradient

of the loss function through the RNN. The loss function is not applied directly to

the RNN outputs. Instead, it is applied to the RBM. This means that we must

approximately diﬀerentiate the loss with respect to the RBM parameters using

contrastive divergence or a related algorithm. This approximate gradient may then

682

CHAPTER 20. DEEP GENERATIVE MODELS

be back-propagated through the RNN using the usual back-propagation through

time algorithm.

20.8 Other Boltzmann Machines

Many other variants of Boltzmann machines are possible.

Boltzmann machines may be extended with diﬀerent training criteria. We have

focused on Boltzmann machines trained to approximately maximize the generative

criterion

log p

(

). It is also possible to train discriminative RBMs that aim to

maximize

log p

(

y | v

) instead (Larochelle and Bengio, 2008). This approach often

performs the best when using a linear combination of both the generative and

the discriminative criteria. Unfortunately, RBMs do not seem to be as powerful

supervised learners as MLPs, at least using existing methodology.

Most Boltzmann machines used in practice have only second-order interactions

in their energy functions, meaning that their energy functions are the sum of many

terms, and each individual term includes only the product between two random

variables. An example of such a term is

i,j

. It is also possible to train

higher-order Boltzmann machines (Sejnowski, 1987) whose energy function terms

involve the products between many variables. Three-way interactions between a

hidden unit and two diﬀerent images can model spatial transformations from one

frame of video to the next (Memisevic and Hinton, 2007, 2010). Multiplication by

a one-hot class variable can change the relationship between visible and hidden

units depending on which class is present (Nair and Hinton, 2009). One recent

example of the use of higher-order interactions is a Boltzmann machine with two

groups of hidden units, one group that interacts with both the visible units

and

the class label

, and another group that interacts only with the

input values

(Luo et al., 2011). This can be interpreted as encouraging some hidden units to

learn to model the input using features that are relevant to the class, but also to

learn extra hidden units that explain nuisance details necessary for the samples

to be realistic without determining the class of the example. Another use of

higher-order interactions is to gate some features. Sohn et al. (2013) introduced

a Boltzmann machine with third-order interactions and binary mask variables

associated with each visible unit. When these masking variables are set to zero,

they remove the inﬂuence of a visible unit on the hidden units. This allows visible

units that are not relevant to the classiﬁcation problem to be removed from the

inference pathway that estimates the class.

More generally, the Boltzmann machine framework is a rich space of models

permitting many more model structures than have been explored so far. Developing

683

CHAPTER 20. DEEP GENERATIVE MODELS

a new form of Boltzmann machine requires some more care and creativity than

developing a new neural network layer, because it is often diﬃcult to ﬁnd an energy

function that maintains tractability of all the diﬀerent conditional distributions

needed to use the Boltzmann machine. Despite this required eﬀort, the ﬁeld

remains open to innovation.

20.9 Back-Propagation through Random Operations

Traditional neural networks implement a deterministic transformation of some

input variables

. When developing generative models, we often wish to extend

neural networks to implement stochastic transformations of

. One straightforward

way to do this is to augment the neural network with extra inputs

that are

sampled from some simple probability distribution, such as a uniform or Gaussian

distribution. The neural network can then continue to perform deterministic

computation internally, but the function

(

x, z

) will appear stochastic to an

observer who does not have access to

. Provided that

is continuous and

diﬀerentiable, we can then compute the gradients necessary for training using

back-propagation as usual.

As an example, let us consider the operation consisting of drawing samples

from a Gaussian distribution with mean µ and variance σ

y ∼ N(µ, σ

). (20.54)

Because an individual sample of

is produced not by a function, but rather by

a sampling process whose output changes every time we query it, it may seem

counterintuitive to take the derivatives of

with respect to the parameters of

its distribution,

and

. However, we can rewrite the sampling process as

transforming an underlying random value

z ∼ N

(

; 0

1) to obtain a sample from

the desired distribution:

y = µ + σz. (20.55)

We are now able to back-propagate through the sampling operation, by regard-

ing it as a deterministic operation with an extra input

. Crucially, the extra input

is a random variable whose distribution is not a function of any of the variables

whose derivatives we want to calculate. The result tells us how an inﬁnitesimal

change in

would change the output if we could repeat the sampling operation

again with the same value of z.

Being able to back-propagate through this sampling operation allows us to

incorporate it into a larger graph. We can build elements of the graph on top of the

684

CHAPTER 20. DEEP GENERATIVE MODELS

output of the sampling distribution. For example, we can compute the derivatives

of some loss function

(

). We can also build elements of the graph whose outputs

are the inputs or the parameters of the sampling operation. For example, we could

build a larger graph with

(

;

) and

(

;

). In this augmented graph,

we can use back-propagation through these functions to derive ∇

J(y).

The principle used in this Gaussian sampling example is more generally appli-

cable. We can express any probability distribution of the form

(

;

) or

(

y | x

;

)

(

y | ω

), where

is a variable containing both parameters

, and if applicable,

the inputs

. Given a value

sampled from distribution

(

y | ω

), where

may in

turn be a function of other variables, we can rewrite

y ∼ p(y | ω) (20.56)

y = f(z; ω), (20.57)

where

is a source of randomness. We may then compute the derivatives of

with

respect to

using traditional tools such as the back-propagation algorithm applied

, as long as

is continuous and diﬀerentiable almost everywhere. Crucially,

must not be a function of

, and

must not be a function of

. This technique

is often called the

reparametrization trick

stochastic back-propagation

, or

perturbation analysis.

The requirement that

be continuous and diﬀerentiable of course requires

to be continuous. If we wish to back-propagate through a sampling process that

produces discrete-valued samples, it may still be possible to estimate a gradient on

, using reinforcement learning algorithms such as variants of the REINFORCE

algorithm (Williams, 1992), discussed in section 20.9.1.

In neural network applications, we typically choose

to be drawn from some

simple distribution, such as a unit uniform or unit Gaussian distribution, and

achieve more complex distributions by allowing the deterministic portion of the

network to reshape its input.

The idea of propagating gradients or optimizing through stochastic operations

dates back to the mid-twentieth century (Price, 1958; Bonnet, 1964) and was

ﬁrst used for machine learning in the context of reinforcement learning (Williams,

1992). More recently, it has been applied to variational approximations (Opper

and Archambeau, 2009) and stochastic and generative neural networks (Bengio

et al., 2013b; Kingma, 2013; Kingma and Welling, 2014b,a; Rezende et al., 2014;

Goodfellow et al., 2014c). Many networks, such as denoising autoencoders or

networks regularized with dropout, are also naturally designed to take noise

685

CHAPTER 20. DEEP GENERATIVE MODELS

as an input without requiring any special reparametrization to make the noise

independent from the model.

20.9.1 Back-Propagating through Discrete Stochastic Operations

When a model emits a discrete variable

, the reparametrization trick is not

applicable. Suppose that the model takes inputs

and parameters

, both

encapsulated in the vector

, and combines them with random noise

to produce

y = f(z; ω). (20.58)

Because

is discrete,

must be a step function. The derivatives of a step function

are not useful at any point. Right at each step boundary, the derivatives are

undeﬁned, but that is a small problem. The large problem is that the derivatives

are zero almost everywhere on the regions between step boundaries. The derivatives

of any cost function

(

) therefore do not give any information for how to update

the model parameters θ.

The REINFORCE algorithm (REward Increment = nonnegative Factor

Oﬀset Reinforcement

Characteristic Eligibility) provides a framework deﬁning a

family of simple but powerful solutions (Williams, 1992). The core idea is that

even though

(

;

)) is a step function with useless derivatives, the expected

cost

z∼p(z)

J (f(z; ω))

is often a smooth function amenable to gradient descent.

Although that expectation is typically not tractable when

is high-dimensional

(or is the result of the composition of many discrete stochastic decisions), it can be

estimated without bias using a Monte Carlo average. The stochastic estimate of

the gradient can be used with SGD or other stochastic gradient-based optimization

techniques.

The simplest version of REINFORCE can be derived by simply diﬀerentiating

the expected cost:

[J(y)] =



J(y)p(y), (20.59)

∂E[J(y)]

∂ω



J(y)

∂p(y)

∂ω

(20.60)



J(y)p(y)

∂ log p(y)

∂ω

(20.61)

≈



(i)

∼p(y), i=1

J(y

(i)

)

∂ log p(y

(i)

)

∂ω

. (20.62)

686

CHAPTER 20. DEEP GENERATIVE MODELS

Equation 20.60 relies on the assumption that

does not reference

directly. It is

trivial to extend the approach to relax this assumption. Equation 20.61 exploits

the derivative rule for the logarithm,

∂ log p(y)

∂ω

p(y)

∂p(y)

∂ω

. Equation 20.62 gives

an unbiased Monte Carlo estimator of the gradient.

Anywhere we write

(

) in this section, one could equally write

(

y | x

). This

is because

(

) is parametrized by

, and

contains both

and

, if

is present.

One issue with the simple REINFORCE estimator is that it has a very high

variance, so that many samples of

need to be drawn to obtain a good estimator of

the gradient, or equivalently, if only one sample is drawn, SGD will converge very

slowly and will require a smaller learning rate. It is possible to considerably reduce

the variance of that estimator by using

variance reduction

methods (Wilson,

1984; L’Ecuyer, 1994). The idea is to modify the estimator so that its expected value

remains unchanged but its variance gets reduced. In the context of REINFORCE,

the proposed variance reduction methods involve the computation of a

baseline

that is used to oﬀset

(

). Note that any oﬀset

(

) that does not depend on

would not change the expectation of the estimated gradient because

p(y)



∂ log p(y)

∂ω





p(y)

∂ log p(y)

∂ω

(20.63)



∂p(y)

∂ω

(20.64)

∂

∂ω



p(y) =

∂

∂ω

1 = 0, (20.65)

which means that

p(y)



(J(y) − b(ω))

∂ log p(y)

∂ω



= E

p(y)



J(y)

∂ log p(y)

∂ω



− b(ω)E

p(y)



∂ log p(y)

∂ω



(20.66)

= E

p(y)



J(y)

∂ log p(y)

∂ω



. (20.67)

Furthermore, we can obtain the optimal

(

) by computing the variance of (

(

)

−

(

))

∂ log p(y)

∂ω

under

(

) and minimizing with respect to

(

). What we ﬁnd is

that this optimal baseline b

∗

(ω)

is diﬀerent for each element ω

of the vector ω:

∗

(ω)

p(y)



J(y)

∂ log p(y)

∂ω



p(y)



∂ log p(y)

∂ω



. (20.68)

687

CHAPTER 20. DEEP GENERATIVE MODELS

The gradient estimator with respect to ω

then becomes

(J(y) − b(ω)

)

∂ log p(y)

∂ω

, (20.69)

where

(

)

estimates the above

∗

(

)

. The estimate

is usually obtained by

adding extra outputs to the neural network and training the new outputs to estimate

p(y)

[

(

)

∂ log p(y)

∂ω

] and

p(y)



∂ log p(y)

∂ω



for each element of

. These extra

outputs can be trained with the mean squared error objective, using respectively

(

)

∂ log p(y)

∂ω

and

∂ log p(y)

∂ω

as targets when

is sampled from

(

), for a given

. The estimate

may then be recovered by substituting these estimates into

equation 20.68. Mnih and Gregor (2014) preferred to use a single shared output

(across all elements

) trained with the target

(

), using as baseline

(

)

≈

p(y)

[J(y)].

Variance reduction methods have been introduced in the reinforcement learning

context (Sutton et al., 2000; Weaver and Tao, 2001), generalizing previous work

on the case of binary reward by Dayan (1990). See Bengio et al. (2013b), Mnih

and Gregor (2014), Ba et al. (2014), Mnih et al. (2014), or Xu et al. (2015) for

examples of modern uses of the REINFORCE algorithm with reduced variance in

the context of deep learning. In addition to the use of an input-dependent baseline

(

), Mnih and Gregor (2014) found that the scale of (

(

)

− b

(

)) could be

adjusted during training by dividing it by its standard deviation estimated by a

moving average during training, as a kind of adaptive learning rate, to counter

the eﬀect of important variations that occur during the course of training in the

magnitude of this quantity. Mnih and Gregor (2014) called this heuristic

variance

normalization.

REINFORCE-based estimators can be understood as estimating the gradient

by correlating choices of

with corresponding values of

(

). If a good value of

is unlikely under the current parametrization, it might take a long time to obtain it

by chance and get the required signal that this conﬁguration should be reinforced.

20.10 Directed Generative Nets

As discussed in chapter 16, directed graphical models make up a prominent class

of graphical models. While directed graphical models have been very popular

within the greater machine learning community, within the smaller deep learning

community they have until roughly 2013 been overshadowed by undirected models

such as the RBM.

688

CHAPTER 20. DEEP GENERATIVE MODELS

In this section we review some of the standard directed graphical models that

have traditionally been associated with the deep learning community.

We have already described deep belief networks, which are a partially directed

model. We have also already described sparse coding models, which can be thought

of as shallow directed generative models. They are often used as feature learners

in the context of deep learning, though they tend to perform poorly at sample

generation and density estimation. We now describe a variety of deep, fully directed

models.

20.10.1 Sigmoid Belief Networks

Sigmoid belief networks (Neal, 1990) are a simple form of directed graphical model

with a speciﬁc kind of conditional probability distribution. In general, we can

think of a sigmoid belief network as having a vector of binary states

, with each

element of the state inﬂuenced by its ancestors:

p(s

) = σ







j<i

j,i

+ b





. (20.70)

The most common structure of sigmoid belief network is one that is divided

into many layers, with ancestral sampling proceeding through a series of many

hidden layers and then ultimately generating the visible layer. This structure is

very similar to the deep belief network, except that the units at the beginning

of the sampling process are independent from each other, rather than sampled

from a restricted Boltzmann machine. Such a structure is interesting for a variety

of reasons. One is that the structure is a universal approximator of probability

distributions over the visible units, in the sense that it can approximate any

probability distribution over binary variables arbitrarily well, given enough depth,

even if the width of the individual layers is restricted to the dimensionality of the

visible layer (Sutskever and Hinton, 2008).

While generating a sample of the visible units is very eﬃcient in a sigmoid

belief network, most other operations are not. Inference over the hidden units given

the visible units is intractable. Mean ﬁeld inference is also intractable because the

variational lower bound involves taking expectations of cliques that encompass

entire layers. This problem has remained diﬃcult enough to restrict the popularity

of directed discrete networks.

One approach for performing inference in a sigmoid belief network is to con-

struct a diﬀerent lower bound that is specialized for sigmoid belief networks (Saul

689

CHAPTER 20. DEEP GENERATIVE MODELS

et al., 1996). This approach has only been applied to very small networks. Another

approach is to use learned inference mechanisms as described in section 19.5. The

Helmholtz machine (Dayan et al., 1995; Dayan and Hinton, 1996) is a sigmoid

belief network combined with an inference network that predicts the parame-

ters of the mean ﬁeld distribution over the hidden units. Modern approaches

(Gregor et al., 2014; Mnih and Gregor, 2014) to sigmoid belief networks still use

this inference network approach. These techniques remain diﬃcult because of

the discrete nature of the latent variables. One cannot simply back-propagate

through the output of the inference network, but instead must use the relatively

unreliable machinery for back-propagating through discrete sampling processes,

as described in section 20.9.1. Recent approaches based on importance sampling,

reweighted wake-sleep (Bornschein and Bengio, 2015) and bidirectional Helmholtz

machines (Bornschein et al., 2015) make it possible to quickly train sigmoid belief

networks and reach state-of-the-art performance on benchmark tasks.

A special case of sigmoid belief networks is the case where there are no latent

variables. Learning in this case is eﬃcient, because there is no need to marginalize

latent variables out of the likelihood. A family of models called auto-regressive

networks generalize this fully visible belief network to other kinds of variables

besides binary variables and other structures of conditional distributions besides

log-linear relationships. Auto-regressive networks are described in section 20.10.7.

20.10.2 Diﬀerentiable Generator Networks

Many generative models are based on the idea of using a diﬀerentiable

generator

network

. The model transforms samples of latent variables

to samples

to distributions over samples

using a diﬀerentiable function

(

;

(g)

), which is

typically represented by a neural network. This model class includes variational

autoencoders, which pair the generator net with an inference net; generative

adversarial networks, which pair the generator network with a discriminator

network; and techniques that train generator networks in isolation.

Generator networks are essentially just parametrized computational procedures

for generating samples, where the architecture provides the family of possible

distributions to sample from and the parameters select a distribution from within

that family.

As an example, the standard procedure for drawing samples from a normal

distribution with mean

and covariance

is to feed samples

from a normal

distribution with zero mean and identity covariance into a very simple generator

690

CHAPTER 20. DEEP GENERATIVE MODELS

network. This generator network contains just one aﬃne layer:

x = g(z) = µ + Lz, (20.71)

where L is given by the Cholesky decomposition of Σ.

Pseudorandom number generators can also use nonlinear transformations of

simple distributions. For example,

inverse transform sampling

(Devroye, 2013)

draws a scalar

from

1) and applies a nonlinear transformation to a scalar

. In this case

(

) is given by the inverse of the cumulative distribution function

(

) =



−∞

(

)

. If we are able to specify

(

), integrate over

, and invert the

resulting function, we can sample from p(x) without using machine learning.

To generate samples from more complicated distributions that are diﬃcult

to specify directly, diﬃcult to integrate over, or whose resulting integrals are

diﬃcult to invert, we use a feedforward network to represent a parametric family

of nonlinear functions

, and use training data to infer the parameters selecting

the desired function.

We can think of

as providing a nonlinear change of variables that transforms

the distribution over z into the desired distribution over x.

Recall from equation 3.47 that, for invertible, diﬀerentiable, continuous g,

(z) = p

(g(z))



det(

∂g

∂z

)



. (20.72)

This implicitly imposes a probability distribution over x:

(x) =

−1

(x))



det(

∂g

∂z

)



. (20.73)

Of course, this formula may be diﬃcult to evaluate, depending on the choice of

, so we often use indirect means of learning

, rather than trying to maximize

log p(x) directly.

In some cases, rather than using

to provide a sample of

directly, we use

to deﬁne a conditional distribution over

. For example, we could use a generator

net whose ﬁnal layer consists of sigmoid outputs to provide the mean parameters

of Bernoulli distributions:

p(x

= 1 | z) = g(z)

. (20.74)

In this case, when we use

to deﬁne

(

x | z

), we impose a distribution over

marginalizing z:

p(x) = E

p(x | z). (20.75)

691

CHAPTER 20. DEEP GENERATIVE MODELS

Both approaches deﬁne a distribution

(

) and allow us to train various

criteria of p

using the reparametrization trick of section 20.9.

The two diﬀerent approaches to formulating generator nets—emitting the

parameters of a conditional distribution versus directly emitting samples—have

complementary strengths and weaknesses. When the generator net deﬁnes a

conditional distribution over

, it is capable of generating discrete data as well as

continuous data. When the generator net provides samples directly, it is capable of

generating only continuous data (we could introduce discretization in the forward

propagation, but doing so would mean the model could no longer be trained using

back-propagation). The advantage to direct sampling is that we are no longer

forced to use conditional distributions whose form can be easily written down and

algebraically manipulated by a human designer.

Approaches based on diﬀerentiable generator networks are motivated by the

success of gradient descent applied to diﬀerentiable feedforward networks for

classiﬁcation. In the context of supervised learning, deep feedforward networks

trained with gradient-based learning seem practically guaranteed to succeed given

enough hidden units and enough training data. Can this same recipe for success

transfer to generative modeling?

Generative modeling seems to be more diﬃcult than classiﬁcation or regression

because the learning process requires optimizing intractable criteria. In the context

of diﬀerentiable generator nets, the criteria are intractable because the data does

not specify both the inputs

and the outputs

of the generator net. In the case

of supervised learning, both the inputs

and the outputs

were given, and the

optimization procedure needs only to learn how to produce the speciﬁed mapping.

In the case of generative modeling, the learning procedure needs to determine how

to arrange z space in a useful way and additionally how to map from z to x.

Dosovitskiy et al. (2015) studied a simpliﬁed problem, where the correspondence

between

and

is given. Speciﬁcally, the training data is computer-rendered

imagery of chairs. The latent variables

are parameters given to the rendering

engine describing the choice of which chair model to use, the position of the chair,

and other conﬁguration details that aﬀect the rendering of the image. Using this

synthetically generated data, a convolutional network is able to learn to map

descriptions of the content of an image to

approximations of rendered images.

This suggests that contemporary diﬀerentiable generator networks have suﬃcient

model capacity to be good generative models, and that contemporary optimization

algorithms have the ability to ﬁt them. The diﬃculty lies in determining how to

train generator networks when the value of

for each

is not ﬁxed and known

ahead of each time.

692

CHAPTER 20. DEEP GENERATIVE MODELS

The following sections describe several approaches to training diﬀerentiable

generator nets given only training samples of x.

20.10.3 Variational Autoencoders

The

variational autoencoder

, or VAE (Kingma, 2013; Rezende et al., 2014),

is a directed model that uses learned approximate inference and can be trained

purely with gradient-based methods.

To generate a sample from the model, the VAE ﬁrst draws a sample

from

the code distribution

model

(

). The sample is then run through a diﬀerentiable

generator network

(

). Finally,

is sampled from a distribution

model

(

;

(

)) =

model

(

x | z

). During training, however, the approximate inference network (or

encoder)

(

z | x

) is used to obtain

, and

model

(

x | z

) is then viewed as a decoder

network.

The key insight behind variational autoencoders is that they can be trained by

maximizing the variational lower bound L(q) associated with data point x:

L(q) = E

z∼q(z|x)

log p

model

(z, x) + H(q(z | x)) (20.76)

= E

z∼q(z|x)

log p

model

(x | z) − D

(q(z | x)||p

model

(z)) (20.77)

≤ log p

model

(x). (20.78)

In equation 20.76, we recognize the ﬁrst term as the joint log-likelihood of the visible

and hidden variables under the approximate posterior over the latent variables (just

as with EM, except that we use an approximate rather than the exact posterior).

We recognize also a second term, the entropy of the approximate posterior. When

is chosen to be a Gaussian distribution, with noise added to a predicted mean

value, maximizing this entropy term encourages increasing the standard deviation

of this noise. More generally, this entropy term encourages the variational posterior

to place high probability mass on many

values that could have generated

rather than collapsing to a single point estimate of the most likely value. In

equation 20.77, we recognize the ﬁrst term as the reconstruction log-likelihood

found in other autoencoders. The second term tries to make the approximate

posterior distribution

(

z | x

) and the model prior

model

(

) approach each other.

Traditional approaches to variational inference and learning infer

via an opti-

mization algorithm, typically iterated ﬁxed-point equations (section 19.4). These

approaches are slow and often require the ability to compute

z∼q

log p

model

(

z, x

)

in closed form. The main idea behind the variational autoencoder is to train a

parametric encoder (also sometimes called an inference network or recognition

model) that produces the parameters of

. As long as

is a continuous variable, we

693

CHAPTER 20. DEEP GENERATIVE MODELS

can then back-propagate through samples of

drawn from

(

z | x

) =

(

;

(

;

))

to obtain a gradient with respect to

. Learning then consists solely of maximizing

with respect to the parameters of the encoder and decoder. All the expectations

in L may be approximated by Monte Carlo sampling.

The variational autoencoder approach is elegant, theoretically pleasing, and

simple to implement. It also obtains excellent results and is among the state-of-the-

art approaches to generative modeling. Its main drawback is that samples from

variational autoencoders trained on images tend to be somewhat blurry. The causes

of this phenomenon are not yet known. One possibility is that the blurriness is

an intrinsic eﬀect of maximum likelihood, which minimizes

(

data

p

model

). As

illustrated in ﬁgure 3.6, this means that the model will assign high probability to

points that occur in the training set but may also assign high probability to other

points. These other points may include blurry images. Part of the reason that the

model would choose to put probability mass on blurry images rather than some

other part of the space is that the variational autoencoders used in practice usually

have a Gaussian distribution for

model

(

;

(

)). Maximizing a lower bound on

the likelihood of such a distribution is similar to training a traditional autoencoder

with mean squared error, in the sense that it has a tendency to ignore features

of the input that occupy few pixels or that cause only a small change in the

brightness of the pixels that they occupy. This issue is not speciﬁc to VAEs and

is shared with generative models that optimize a log-likelihood, or equivalently,

(

data

p

model

), as argued by Theis et al. (2015) and by Huszar (2015). Another

troubling issue with contemporary VAE models is that they tend to use only a small

subset of the dimensions of

, as if the encoder were not able to transform enough

of the local directions in input space to a space where the marginal distribution

matches the factorized prior.

The VAE framework is straightforward to extend to a wide range of model

architectures. This is a key advantage over Boltzmann machines, which require

extremely careful model design to maintain tractability. VAEs work very well

with a diverse family of diﬀerentiable operators. One particularly sophisticated

VAE is the

deep recurrent attention writer

(DRAW) model (Gregor et al.,

2015). DRAW uses a recurrent encoder and recurrent decoder combined with

an attention mechanism. The generation process for the DRAW model consists

of sequentially visiting diﬀerent small image patches and drawing the values of

the pixels at those points. VAEs can also be extended to generate sequences by

deﬁning variational RNNs (Chung et al., 2015b) by using a recurrent encoder and

decoder within the VAE framework. Generating a sample from a traditional RNN

involves only nondeterministic operations at the output space. Variational RNNs

also have random variability at the potentially more abstract level captured by

694

CHAPTER 20. DEEP GENERATIVE MODELS

the VAE latent variables.

The VAE framework has been extended to maximize not just the traditional vari-

ational lower bound, but also the

importance-weighted autoencoder

(Burda

et al., 2015) objective:

(x, q) = E

(1)

,...,z

(k)

∼q(z|x)



log



i=1

model

(x, z

(i)

)

q(z

(i)

| x)



. (20.79)

This new objective is equivalent to the traditional lower bound

when

= 1. How-

ever, it may also be interpreted as forming an estimate of the true

log p

model

(

) using

importance sampling of

from proposal distribution

(

z | x

). The importance-

weighted autoencoder objective is also a lower bound on

log p

model

(

) and becomes

tighter as k increases.

Variational autoencoders have some interesting connections to the MP-DBM

and other approaches that involve back-propagation through the approximate

inference graph (Goodfellow et al., 2013b; Stoyanov et al., 2011; Brakel et al., 2013).

These previous approaches required an inference procedure such as mean ﬁeld ﬁxed-

point equations to provide the computational graph. The variational autoencoder

is deﬁned for arbitrary computational graphs, which makes it applicable to a wider

range of probabilistic model families because there is no need to restrict the choice

of models to those with tractable mean ﬁeld ﬁxed-point equations. The variational

autoencoder also has the advantage of increasing a bound on the log-likelihood

of the model, while the criteria for the MP-DBM and related models are more

heuristic and have little probabilistic interpretation beyond making the results of

approximate inference accurate. One disadvantage of the variational autoencoder

is that it learns an inference network for only one problem, inferring

given

The older methods are able to perform approximate inference over any subset of

variables given any other subset of variables, because the mean ﬁeld ﬁxed-point

equations specify how to share parameters between the computational graphs for

all these diﬀerent problems.

One very nice property of the variational autoencoder is that simultaneously

training a parametric encoder in combination with the generator network forces the

model to learn a predictable coordinate system that the encoder can capture. This

makes it an excellent manifold learning algorithm. See ﬁgure 20.6 for examples of

low-dimensional manifolds learned by the variational autoencoder. In one of the

cases demonstrated in the ﬁgure, the algorithm discovered two independent factors

of variation present in images of faces: angle of rotation and emotional expression.

695

CHAPTER 20. DEEP GENERATIVE MODELS

Figure 20.6: Examples of 2-D coordinate systems for high-dimensional manifolds, learned

by a variational autoencoder (Kingma and Welling, 2014a). Two dimensions may be

plotted directly on the page for visualization, so we can gain an understanding of how the

model works by training a model with a 2-D latent code, even if we believe the intrinsic

dimensionality of the data manifold is much higher. The images shown are not examples

from the training set but images

actually generated by the model

(

x | z

), simply by

changing the 2-D “code”

(each image corresponds to a diﬀerent choice of “code”

on a

2-D uniform grid). (Left) The 2-D map of the Frey faces manifold. One dimension that

has been discovered (horizontal) mostly corresponds to a rotation of the face, while the

other (vertical) corresponds to the emotional expression. (Right) The 2-D map of the

MNIST manifold.

20.10.4 Generative Adversarial Networks

Generative adversarial networks, or GANs (Goodfellow et al., 2014c), are another

generative modeling approach based on diﬀerentiable generator networks.

Generative adversarial networks are based on a game theoretic scenario in

which the generator network must compete against an adversary. The generator

network directly produces samples

(

;

(g)

). Its adversary, the

discriminator

network

, attempts to distinguish between samples drawn from the training data

and samples drawn from the generator. The discriminator emits a probability value

given by

(

;

(d)

), indicating the probability that

is a real training example

rather than a fake sample drawn from the model.

The simplest way to formulate learning in generative adversarial networks is

as a zero-sum game, in which a function

(

(g)

, θ

(d)

) determines the payoﬀ of the

696

CHAPTER 20. DEEP GENERATIVE MODELS

discriminator. The generator receives

−v

(

(g)

, θ

(d)

) as its own payoﬀ. During

learning, each player attempts to maximize its own payoﬀ, so that at convergence

∗

= arg min

max

v(g, d). (20.80)

The default choice for v is

v(θ

(g)

, θ

(d)

) = E

x∼p

data

log d(x) + E

x∼p

model

log (1 − d(x)) . (20.81)

This drives the discriminator to attempt to learn to correctly classify samples as real

or fake. Simultaneously, the generator attempts to fool the classiﬁer into believing

its samples are real. At convergence, the generator’s samples are indistinguishable

from real data, and the discriminator outputs

everywhere. The discriminator

may then be discarded.

The main motivation for the design of GANs is that the learning process

requires neither approximate inference nor approximation of a partition function

gradient. When

max

(

g, d

) is convex in

(g)

(such as the case where optimization

is performed directly in the space of probability density functions), the procedure

is guaranteed to converge and is asymptotically consistent.

Unfortunately, learning in GANs can be diﬃcult in practice when

and

are represented by neural networks and

max

(

g, d

) is not convex. Goodfellow

(2014) identiﬁed nonconvergence as an issue that may cause GANs to underﬁt.

In general, simultaneous gradient descent on two players’ costs is not guaranteed

to reach an equilibrium. Consider, for example, the value function

(

a, b

) =

where one player controls

and incurs cost

, while the other player controls

and receives a cost

−ab

. If we model each player as making inﬁnitesimally small

gradient steps, each player reducing their own cost at the expense of the other

player, then

and

go into a stable, circular orbit, rather than arriving at the

equilibrium point at the origin. Note that the equilibria for a minimax game are

not local minima of

. Instead, they are points that are simultaneously minima

for both players’ costs. This means that they are saddle points of

that are local

minima with respect to the ﬁrst player’s parameters and local maxima with respect

to the second player’s parameters. It is possible for the two players to take turns

increasing then decreasing

forever, rather than landing exactly on the saddle

point, where neither player is capable of reducing its cost. It is not known to what

extent this nonconvergence problem aﬀects GANs.

Goodfellow (2014) identiﬁed an alternative formulation of the payoﬀs, in which

the game is no longer zero-sum, that has the same expected gradient as maximum

likelihood learning whenever the discriminator is optimal. Because maximum

697

CHAPTER 20. DEEP GENERATIVE MODELS

likelihood training converges, this reformulation of the GAN game should also

converge, given enough samples. Unfortunately, this alternative formulation does

not seem to improve convergence in practice, possibly because of suboptimality of

the discriminator or high variance around the expected gradient.

In realistic experiments, the best-performing formulation of the GAN game

is a diﬀerent formulation that is neither zero-sum nor equivalent to maximum

likelihood, introduced by Goodfellow et al. (2014c) with a heuristic motivation. In

this best-performing formulation, the generator aims to increase the log-probability

that the discriminator makes a mistake, rather than aiming to decrease the log-

probability that the discriminator makes the correct prediction. This reformulation

is motivated solely by the observation that it causes the derivative of the generator’s

cost function with respect to the discriminator’s logits to remain large even in the

situation when the discriminator conﬁdently rejects all generator samples.

Stabilization of GAN learning remains an open problem. Fortunately, GAN

learning performs well when the model architecture and hyperparameters are care-

fully selected. Radford et al. (2015) crafted a deep convolutional GAN (DCGAN)

that performs very well for image synthesis tasks, and showed that its latent repre-

sentation space captures important factors of variation, as shown in ﬁgure 15.9.

See ﬁgure 20.7 for examples of images generated by a DCGAN generator.

The GAN learning problem can also be simpliﬁed by breaking the generation

process into many levels of detail. It is possible to train conditional GANs (Mirza

and Osindero, 2014) that learn to sample from a distribution

(

x | y

) rather

than simply sampling from a marginal distribution

(

). Denton et al. (2015)

showed that a series of conditional GANs can be trained to ﬁrst generate a very

low-resolution version of an image, then incrementally add details to the image.

This technique is called the LAPGAN model, due to the use of a Laplacian pyramid

to generate the images containing varying levels of detail. LAPGAN generators

are able to fool not only discriminator networks but also human observers, with

experimental subjects identifying up to 40 percent of the outputs of the network as

being real data. See ﬁgure 20.7 for examples of images generated by a LAPGAN

generator.

One unusual capability of the GAN training procedure is that it can ﬁt proba-

bility distributions that assign zero probability to the training points. Rather than

maximizing the log-probability of speciﬁc points, the generator net learns to trace

out a manifold whose points resemble training points in some way. Somewhat para-

doxically, this means that the model may assign a log-likelihood of negative inﬁnity

to the test set, while still representing a manifold that a human observer judges

to capture the essence of the generation task. This is not clearly an advantage or

698

CHAPTER 20. DEEP GENERATIVE MODELS

Figure 20.7: Images generated by GANs trained on the LSUN dataset. (Left)Images

of bedrooms generated by a DCGAN model, reproduced with permission from Radford

et al. (2015). (Right)Images of churches generated by a LAPGAN model, reproduced with

permission from Denton et al. (2015).

a disadvantage, and one may also guarantee that the generator network assigns

nonzero probability to all points simply by making the last layer of the generator

network add Gaussian noise to all the generated values. Generator networks that

add Gaussian noise in this manner sample from the same distribution that one

obtains by using the generator network to parametrize the mean of a conditional

Gaussian distribution.

Dropout seems to be important in the discriminator network. In particular,

units should be stochastically dropped while computing the gradient for the

generator network to follow. Following the gradient of the deterministic version of

the discriminator with its weights divided by two does not seem to be as eﬀective.

Likewise, never using dropout seems to yield poor results.

While the GAN framework is designed for diﬀerentiable generator networks,

similar principles can be used to train other kinds of models. For example,

self-

supervised boosting

can be used to train an RBM generator to fool a logistic

regression discriminator (Welling et al., 2002).

20.10.5 Generative Moment Matching Networks

Generative moment matching networks

(Li et al., 2015; Dziugaite et al.,

2015) are another form of generative model based on diﬀerentiable generator

networks. Unlike VAEs and GANs, they do not need to pair the generator network

with any other network—neither an inference network, as used with VAEs, nor a

699

CHAPTER 20. DEEP GENERATIVE MODELS

discriminator network, as used with GANs.

Generative moment matching networks are trained with a technique called

moment matching

. The basic idea behind moment matching is to train the

generator in such a way that many of the statistics of samples generated by the

model are as similar as possible to those of the statistics of the examples in the

training set. In this context, a

moment

is an expectation of diﬀerent powers of a

random variable. For example, the ﬁrst moment is the mean, the second moment

is the mean of the squared values, and so on. In multiple dimensions, each element

of the random vector may be raised to diﬀerent powers, so that a moment may be

any quantity of the form

, (20.82)

where n = [n

, n

, . . . , n

]



is a vector of nonnegative integers.

Upon ﬁrst examination, this approach seems to be computationally infeasible.

For example, if we want to match all the moments of the form

, then we need

to minimize the diﬀerence between a number of values that is quadratic in the

dimension of

. Moreover, even matching all the ﬁrst and second moments would

only be suﬃcient to ﬁt a multivariate Gaussian distribution, which captures only

linear relationships between values. Our ambitions for neural networks are to

capture complex nonlinear relationships, which would require far more moments.

GANs avoid this problem of exhaustively enumerating all moments by using a

dynamically updated discriminator which automatically focuses its attention on

whichever statistic the generator network is matching the least eﬀectively.

Instead, generative moment matching networks can be trained by minimizing

a cost function called

maximum mean discrepancy

, or MMD (Schölkopf and

Smola, 2002; Gretton et al., 2012). This cost function measures the error in the

ﬁrst moments in an inﬁnite-dimensional space, using an implicit mapping to feature

space deﬁned by a kernel function to make computations on inﬁnite-dimensional

vectors tractable. The MMD cost is zero if and only if the two distributions being

compared are equal.

Visually, the samples from generative moment matching networks are somewhat

disappointing. Fortunately, they can be improved by combining the generator

network with an autoencoder. First, an autoencoder is trained to reconstruct the

training set. Next, the encoder of the autoencoder is used to transform the entire

training set into code space. The generator network is then trained to generate

code samples, which may be mapped to visually pleasing samples via the decoder.

Unlike GANs, the cost function is deﬁned only with respect to a batch of

examples from both the training set and the generator network. It is not possible

to make a training update as a function of only one training example or only

700

CHAPTER 20. DEEP GENERATIVE MODELS

one sample from the generator network. This is because the moments must be

computed as an empirical average across many samples. When the batch size is too

small, MMD can underestimate the true amount of variation in the distributions

being sampled. No ﬁnite batch size is suﬃciently large to eliminate this problem

entirely, but larger batches reduce the amount of underestimation. When the batch

size is too large, the training procedure becomes infeasibly slow, because many

examples must be processed in order to compute a single small gradient step.

As with GANs, it is possible to train a generator net using MMD even if that

generator net assigns zero probability to the training points.

20.10.6 Convolutional Generative Networks

When generating images, it is often useful to use a generator network that includes a

convolutional structure (see, for example, Goodfellow et al. [2014c] or Dosovitskiy

et al. [2015]). To do so, we use the “transpose” of the convolution operator,

described in section 9.5. This approach often yields more realistic images and does

so using fewer parameters than using fully connected layers without parameter

sharing.

Convolutional networks for recognition tasks have information ﬂow from the

image to some summarization layer at the top of the network, often a class label.

As this image ﬂows upward through the network, information is discarded as the

representation of the image becomes more invariant to nuisance transformations.

In a generator network, the opposite is true. Rich details must be added as

the representation of the image to be generated propagates through the network,

culminating in the ﬁnal representation of the image, which is of course the image

itself, in all its detailed glory, with object positions and poses and textures and

lighting. The primary mechanism for discarding information in a convolutional

recognition network is the pooling layer. The generator network seems to need to

add information. We cannot put the inverse of a pooling layer into the generator

network because most pooling functions are not invertible. A simpler operation is

to merely increase the spatial size of the representation. An approach that seems

to perform acceptably is to use an “un-pooling” as introduced by Dosovitskiy et al.

(2015). This layer corresponds to the inverse of the max-pooling operation under

certain simplifying conditions. First, the stride of the max-pooling operation is

constrained to be equal to the width of the pooling region. Second, the maximum

input within each pooling region is assumed to be the input in the upper-left

corner. Finally, all nonmaximal inputs within each pooling region are assumed to

be zero. These are very strong and unrealistic assumptions, but they do allow the

max-pooling operator to be inverted. The inverse un-pooling operation allocates

701

CHAPTER 20. DEEP GENERATIVE MODELS

a tensor of zeros, then copies each value from spatial coordinate

of the input

to spatial coordinate

i × k

of the output. The integer value

deﬁnes the size

of the pooling region. Even though the assumptions motivating the deﬁnition of

the un-pooling operator are unrealistic, the subsequent layers are able to learn to

compensate for its unusual output, so the samples generated by the model as a

whole are visually pleasing.

20.10.7 Auto-Regressive Networks

Auto-regressive networks are directed probabilistic models with no latent random

variables. The conditional probability distributions in these models are represented

by neural networks (sometimes extremely simple neural networks, such as logistic

regression). The graph structure of these models is the complete graph. They

decompose a joint probability over the observed variables using the chain rule of

probability to obtain a product of conditionals of the form

(

| x

d−1

, . . . , x

Such models have been called

fully-visible Bayes networks

(FVBNs) and used

successfully in many forms, ﬁrst with logistic regression for each conditional

distribution (Frey, 1998), and then with neural networks with hidden units (Bengio

and Bengio, 2000b; Larochelle and Murray, 2011). In some forms of auto-

regressive networks, such as NADE (Larochelle and Murray, 2011), described in

section 20.10.10, we can introduce a form of parameter sharing that brings both a

statistical advantage (fewer unique parameters) and a computational advantage

(less computation). This is one more instance of the recurring deep learning motif

of reuse of features.

20.10.8 Linear Auto-Regressive Networks

The simplest form of auto-regressive network has no hidden units and no sharing

of parameters or features. Each

(

| x

i−1

, . . . , x

) is parametrized as a linear

model (linear regression for real-valued data, logistic regression for binary data,

softmax regression for discrete data). This model was introduced by Frey (1998)

and has

(

) parameters when there are

variables to model. It is illustrated in

ﬁgure 20.8.

If the variables are continuous, a linear auto-regressive model is merely another

way to formulate a multivariate Gaussian distribution, capturing linear pairwise

interactions between the observed variables.

Linear auto-regressive networks are essentially the generalization of linear

classiﬁcation methods to generative modeling. They therefore have the same

702

CHAPTER 20. DEEP GENERATIVE MODELS

P (x

| x

)P (x

| x

)

P (x

| x

)P (x

| x

)

P (x

| x

)P (x

| x

)

P (x

)P (x

)

Figure 20.8: A fully visible belief network predicts the

-th variable from the

i −

previous ones. (Top)The directed graphical model for an FVBN. (Bottom)Corresponding

computational graph for the logistic FVBN, where each prediction is made by a linear

predictor.

advantages and disadvantages as linear classiﬁers. Like linear classiﬁers, they may

be trained with convex loss functions and sometimes admit closed form solutions

(as in the Gaussian case). Like linear classiﬁers, the model itself does not oﬀer

a way of increasing its capacity, so capacity must be raised using techniques like

basis expansions of the input or the kernel trick.

20.10.9 Neural Auto-Regressive Networks

Neural auto-regressive networks (Bengio and Bengio, 2000a,b) have the same

left-to-right graphical model as logistic auto-regressive networks (ﬁgure 20.8) but

employ a diﬀerent parametrization of the conditional distributions within that

graphical model structure. The new parametrization is more powerful in the sense

that its capacity can be increased as much as needed, allowing approximation of

any joint distribution. The new parametrization can also improve generalization

by introducing a parameter sharing and feature sharing principle common to deep

learning in general. The models were motivated by the objective of avoiding the

curse of dimensionality arising out of traditional tabular graphical models, sharing

703

CHAPTER 20. DEEP GENERATIVE MODELS

the same structure as ﬁgure 20.8. In tabular discrete probabilistic models, each

conditional distribution is represented by a table of probabilities, with one entry

and one parameter for each possible conﬁguration of the variables involved. By

using a neural network instead, two advantages are obtained:

The parametrization of each

(

| x

i−1

, . . . , x

) by a neural network with

(

i −

× k

inputs and

outputs (if the variables are discrete and take

values, encoded one-hot) enables one to estimate the conditional probability

without requiring an exponential number of parameters (and examples), yet

still is able to capture high-order dependencies between the random variables.

Instead of having a diﬀerent neural network for the prediction of each

a left-to-right connectivity, illustrated in ﬁgure 20.9, allows one to merge

all the neural networks into one. Equivalently, it means that the hidden

layer features computed for predicting

can be reused for predicting

i+k

(

k >

0). The hidden units are thus organized in groups that have the

particularity that all the units in the

-th group only depend on the input

values

, . . . , x

. The parameters used to compute these hidden units are

jointly optimized to improve the prediction of all the variables in the sequence.

This is an instance of the reuse principle that recurs throughout deep learning

in scenarios ranging from recurrent and convolutional network architectures

to multitask and transfer learning.

P (x

| x

)P (x

| x

)

P (x

| x

)P (x

| x

)

P (x

| x

)P (x

| x

)

P (x

)P (x

)

Figure 20.9: A neural auto-regressive network predicts the

-th variable

from the

i −

previous ones, but is parametrized so that features (groups of hidden units denoted

)

that are functions of

, . . . , x

can be reused in predicting all the subsequent variables

i+1

, x

i+2

, . . . , x

704

CHAPTER 20. DEEP GENERATIVE MODELS

Each

(

| x

i−1

, . . . , x

) can represent a conditional distribution by having

outputs of the neural network predict parameters of the conditional distribution

, as discussed in section 6.2.1.1. Although the original neural auto-regressive

networks were initially evaluated in the context of purely discrete multivariate

data (with a sigmoid output for a Bernoulli variable or softmax output for a

multinoulli variable), it is natural to extend such models to continuous variables

or joint distributions involving both discrete and continuous variables.

20.10.10 NADE

The

neural auto-regressive density estimator

(NADE) is a very successful

recent form of neural auto-regressive network (Larochelle and Murray, 2011). The

connectivity is the same as for the original neural auto-regressive network of Bengio

and Bengio (2000b), but NADE introduces an additional parameter sharing scheme,

as illustrated in ﬁgure 20.10. The parameters of the hidden units of diﬀerent groups

j are shared.

The weights



j,k,i

from the

-th input

to the

-th element of the

-th group

P (x

| x

)P (x

| x

)

P (x

| x

)P (x

| x

)

P (x

| x

)P (x

| x

)

P (x

)P (x

)

:,1

:,2

:,3

Figure 20.10: An illustration of the neural autoregressive density estimator (NADE). The

hidden units are organized in groups

(j)

so that only the inputs

, . . . , x

participate

in computing

(i)

and predicting

(

| x

j−1

, . . . , x

), for

j > i

. NADE is diﬀerentiated

from earlier neural auto-regressive networks by the use of a particular weight sharing

pattern:



j,k,i

k,i

is shared (indicated in the ﬁgure by the use of the same line pattern

for every instance of a replicated weight) for all the weights going out from

to the

-th

unit of any group j ≥ i. Recall that the vector (W

1,i

, W

2,i

, . . . , W

n,i

) is denoted W

:,i

705

CHAPTER 20. DEEP GENERATIVE MODELS

of hidden unit h

(j)

(j ≥ i) are shared among the groups:



j,k,i

= W

k,i

. (20.83)

The remaining weights, where j < i, are zero.

Larochelle and Murray (2011) chose this sharing scheme so that forward

propagation in a NADE model would loosely resemble the computations performed

in mean ﬁeld inference to ﬁll in missing inputs in an RBM. This mean ﬁeld inference

corresponds to running a recurrent network with shared weights, and the ﬁrst step

of that inference is the same as in NADE. The only diﬀerence is that with NADE,

the output weights connecting the hidden units to the output are parametrized

independently from the weights connecting the input units to the hidden units. In

the RBM, the hidden-to-output weights are the transpose of the input-to-hidden

weights. The NADE architecture can be extended to mimic not just one time step

of the mean ﬁeld recurrent inference but

steps. This approach is called NADE-

(Raiko et al., 2014).

As mentioned previously, auto-regressive networks may be extended to process

continuous-valued data. A particularly powerful and generic way of parametrizing

a continuous density is as a Gaussian mixture (introduced in section 3.9.6) with

mixture weights

(the coeﬃcient or prior probability for component

), per-

component conditional mean

and per-component conditional variance

. A

model called RNADE (Uria et al., 2013) uses this parametrization to extend NADE

to real values. As with other mixture density networks, the parameters of this

distribution are outputs of the network, with the mixture weight probabilities

produced by a softmax unit, and the variances parametrized so that they are

positive. Stochastic gradient descent can be numerically ill-behaved due to the

interactions between the conditional means

and the conditional variances

To reduce this diﬃculty, Uria et al. (2013) use a pseudogradient that replaces the

gradient on the mean, in the back-propagation phase.

Another very interesting extension of the neural auto-regressive architectures

gets rid of the need to choose an arbitrary order for the observed variables (Uria

et al., 2014). In auto-regressive networks, the idea is to train the network to be able

to cope with any order by randomly sampling orders and providing the information

to hidden units specifying which of the inputs are observed (on the right side of the

conditioning bar) and which are to be predicted and are thus considered missing

(on the left side of the conditioning bar). This is nice because it allows one to use

a trained auto-regressive network to perform any inference problem (i.e., predict

or sample from the probability distribution over any subset of variables given any

subset) extremely eﬃciently. Finally, since many orders of variables are possible

706

CHAPTER 20. DEEP GENERATIVE MODELS

(

! for

variables) and each order

of variables yields a diﬀerent

(

x | o

), we can

form an ensemble of models for many values of o:

ensemble

(x) =



i=1

p(x | o

(i)

). (20.84)

This ensemble model usually generalizes better and assigns higher probability to

the test set than does an individual model deﬁned by a single ordering.

In the same paper, the authors propose deep versions of the architecture, but

unfortunately that immediately makes computation as expensive as in the original

neural auto-regressive network (Bengio and Bengio, 2000b). The ﬁrst layer and

the output layer can still be computed in

(

) multiply-add operations, as in the

regular NADE, where

is the number of hidden units (the size of the groups

in ﬁgures 20.10 and 20.9), whereas it is

(

) in Bengio and Bengio (2000b). For

the other hidden layers, however, the computation is

(

) if every “previous”

group at layer

participates in predicting the “next” group at layer

+ 1, assuming

groups of

hidden units at each layer. Making the

-th group at layer

+ 1 only

depend on the

-th group, as in Uria et al. (2014), at layer

reduces it to

(

which is still h times worse than the regular NADE.

20.11 Drawing Samples from Autoencoders

In chapter 14, we saw that many kinds of autoencoders learn the data distribution.

There are close connections between score matching, denoising autoencoders, and

contractive autoencoders. These connections demonstrate that some kinds of

autoencoders learn the data distribution in some way. We have not yet seen how

to draw samples from such models.

Some kinds of autoencoders, such as the variational autoencoder, explicitly

represent a probability distribution and admit straightforward ancestral sampling.

Most other kinds of autoencoders require MCMC sampling.

Contractive autoencoders are designed to recover an estimate of the tangent

plane of the data manifold. This means that repeated encoding and decoding with

injected noise will induce a random walk along the surface of the manifold (Rifai

et al., 2012; Mesnil et al., 2012). This manifold diﬀusion technique is a kind of

Markov chain.

There is also a more general Markov chain that can sample from any denoising

autoencoder.

707

CHAPTER 20. DEEP GENERATIVE MODELS

20.11.1 Markov Chain Associated with Any Denoising Autoen-

coder

The above discussion left open the question of what noise to inject and where to

obtain a Markov chain that would generate from the distribution estimated by the

autoencoder. Bengio et al. (2013c) showed how to construct such a Markov chain

for

generalized denoising autoencoders

. Generalized denoising autoencoders

are speciﬁed by a denoising distribution for sampling an estimate of the clean input

given the corrupted input.

Each step of the Markov chain that generates from the estimated distribution

consists of the following substeps, illustrated in ﬁgure 20.11:

Starting from the previous state

, inject corruption noise, sampling

from

x | x).

ωω

x | x)

p(x | ω)

Figure 20.11: Each step of the Markov chain associated with a trained denoising autoen-

coder, which generates the samples from the probabilistic model implicitly trained by the

denoising log-likelihood criterion. Each step consists in (a) injecting noise via corruption

process

in state

, yielding

, (b) encoding it with function

, yielding

(

, yielding parameters

for the reconstruction

distribution, and (d) given

, sampling a new state from the reconstruction distribution

(

x | ω

(

))). In the typical squared reconstruction error case,

(

) =

, which

estimates

[

x |

], corruption consists of adding Gaussian noise, and sampling from

(

x | ω

) consists of adding Gaussian noise a second time to the reconstruction

. The

latter noise level should correspond to the mean squared error of reconstructions, whereas

the injected noise is a hyperparameter that controls the mixing speed as well as the

extent to which the estimator smooths the empirical distribution (Vincent, 2011). In the

example illustrated here, only the

and

conditionals are stochastic steps (

and

are

deterministic computations), although noise can also be injected inside the autoencoder,

as in generative stochastic networks (Bengio et al., 2014).

708

CHAPTER 20. DEEP GENERATIVE MODELS

2. Encode

x into h = f(

x).

3. Decode h to obtain the parameters ω = g(h) of p(x | ω = g(h)) = p(x |

x).

4. Sample the next state x from p(x | ω = g(h)) = p(x |

x).

Bengio et al. (2014) showed that if the autoencoder

(

x |

) forms a consistent

estimator of the corresponding true conditional distribution, then the stationary

distribution of the above Markov chain forms a consistent estimator (albeit an

implicit one) of the data-generating distribution of x.

20.11.2 Clamping and Conditional Sampling

Similarly to Boltzmann machines, denoising autoencoders and their generalizations

(such as GSNs, described below) can be used to sample from a conditional distri-

bution

(

| x

), simply by clamping the observed units

and only resampling

the free units

given

and the sampled latent variables (if any). For example,

MP-DBMs can be interpreted as a form of denoising autoencoder and are able

to sample missing inputs. GSNs later generalized some of the ideas present in

MP-DBMs to perform the same operation (Bengio et al., 2014). Alain et al. (2015)

identiﬁed a missing condition from Proposition 1 of Bengio et al. (2014), which is

that the transition operator (deﬁned by the stochastic mapping going from one

state of the chain to the next) should satisfy a property called

detailed balance

which speciﬁes that a Markov chain at equilibrium will remain in equilibrium

whether the transition operator is run in forward or reverse.

An experiment in clamping half of the pixels (the right part of the image) and

running the Markov chain on the other half is shown in ﬁgure 20.12.

20.11.3 Walk-Back Training Procedure

The walk-back training procedure was proposed by Bengio et al. (2013c) as a way

to accelerate the convergence of generative training of denoising autoencoders.

Instead of performing a one-step encode-decode reconstruction, this procedure

consists of alternative multiple stochastic encode-decode steps (as in the generative

Markov chain), initialized at a training example (just as with the contrastive

divergence algorithm, described in section 18.2), and penalizing the last probabilistic

reconstructions (or all the reconstructions along the way).

Training with

steps is equivalent (in the sense of achieving the same stationary

distribution) as training with one step but practically has the advantage that

spurious modes further from the data can be removed more eﬃciently.

709

CHAPTER 20. DEEP GENERATIVE MODELS

Figure 20.12: Illustration of clamping the right half of the image and running the Markov

chain by resampling only the left half at each step. These samples come from a GSN

trained to reconstruct MNIST digits at each time step using the walk-back procedure.

20.12 Generative Stochastic Networks

Generative stochastic networks

, or GSNs (Bengio et al., 2014) are generaliza-

tions of denoising autoencoders that include latent variables

in the generative

Markov chain, in addition to the visible variables (usually denoted x).

A GSN is parametrized by two conditional probability distributions that specify

one step of the Markov chain:

1. p

(

(k)

| h

(k)

) tells how to generate the next visible variable given the current

latent state. Such a “reconstruction distribution” is also found in denoising

autoencoders, RBMs, DBNs and DBMs.

2. p

(

(k)

| h

(k−1)

, x

(k−1)

) tells how to update the latent state variable, given

the previous latent state and visible variable.

Denoising autoencoders and GSNs diﬀer from classical probabilistic models

(directed or undirected) in that they parametrize the generative process itself rather

than the mathematical speciﬁcation of the joint distribution of visible and latent

variables. Instead, the latter is deﬁned implicitly, if it exists, as the stationary

710

CHAPTER 20. DEEP GENERATIVE MODELS

distribution of the generative Markov chain. The conditions for existence of the

stationary distribution are mild and are the same conditions required by standard

MCMC methods (see section 17.3). These conditions are necessary to guarantee

that the chain mixes, but they can be violated by some choices of the transition

distributions (for example, if they are deterministic).

One could imagine diﬀerent training criteria for GSNs. The one proposed and

evaluated by Bengio et al. (2014) is simply reconstruction log-probability on the

visible units, just as for denoising autoencoders. This is achieved by clamping

(0)

to the observed example and maximizing the probability of generating

at some subsequent time steps, that is, maximizing

log p

(

(k)

x | h

(k)

), where

(k)

is sampled from the chain, given

(0)

. In order to estimate the gradient

log p

(

(k)

x | h

(k)

) with respect to the other pieces of the model, Bengio et al.

(2014) use the reparametrization trick, introduced in section 20.9.

The walk-back training procedure (described in section 20.11.3) was used

(Bengio et al., 2014) to improve training convergence of GSNs.

20.12.1 Discriminant GSNs

The original formulation of GSNs (Bengio et al., 2014) was meant for unsupervised

learning and implicitly modeling

(

) for observed data

, but it is possible to

modify the framework to optimize p(y | x).

For example, Zhou and Troyanskaya (2014) generalize GSNs in this way, by

only back-propagating the reconstruction log-probability over the output variables,

keeping the input variables ﬁxed. They applied this successfully to model sequences

(protein secondary structure) and introduced a (one-dimensional) convolutional

structure in the transition operator of the Markov chain. It is important to

remember that, for each step of the Markov chain, one generates a new sequence

for each layer, and that sequence is the input for computing other layer values (say

the one below and the one above) at the next time step.

Hence, the Markov chain is really over the output variable (and associated

higher-level hidden layers), and the input sequence only serves to condition that

chain, with back-propagation enabling it to learn how the input sequence can

condition the output distribution implicitly represented by the Markov chain. It is

therefore a case of using the GSN in the context of structured outputs.

Zöhrer and Pernkopf (2014) introduced a hybrid model that combines a su-

pervised objective (as in the above work) and an unsupervised objective (as in

the original GSN work) by simply adding (with a diﬀerent weight) the supervised

and unsupervised costs, that is, the reconstruction log-probabilities of

and

711

CHAPTER 20. DEEP GENERATIVE MODELS

respectively. Such a hybrid criterion had previously been introduced for RBMs

by Larochelle and Bengio (2008). They show improved classiﬁcation performance

using this scheme.

20.13 Other Generation Schemes

The methods we have described so far use either MCMC sampling, ancestral

sampling, or some mixture of the two to generate samples. While these are the

most popular approaches to generative modeling, they are by no means the only

approaches.

Sohl-Dickstein et al. (2015) developed a

diﬀusion inversion

training scheme

for learning a generative model, based on nonequilibrium thermodynamics. The

approach is based on the idea that the probability distributions we wish to sample

from have structure. This structure can gradually be destroyed by a diﬀusion

process that incrementally changes the probability distribution to have more

entropy. To form a generative model, we can run the process in reverse, by training

a model that gradually restores the structure to an unstructured distribution.

By iteratively applying a process that brings a distribution closer to the target

one, we can gradually approach that target distribution. This approach resembles

MCMC methods in the sense that it involves many iterations to produce a sample.

However, the model is deﬁned to be the probability distribution produced by

the ﬁnal step of the chain. In this sense, there is no approximation induced by

the iterative procedure. The approach introduced by Sohl-Dickstein et al. (2015)

is also very close to the generative interpretation of the denoising autoencoder

(section 20.11.1). As with the denoising autoencoder, diﬀusion inversion trains a

transition operator that attempts to probabilistically undo the eﬀect of adding

some noise. The diﬀerence is that diﬀusion inversion requires undoing only one step

of the diﬀusion process, rather than traveling all the way back to a clean data point.

This addresses the following dilemma present with the ordinary reconstruction

log-likelihood objective of denoising autoencoders: with small levels of noise the

learner only sees conﬁgurations near the data points, while with large levels of

noise it is asked to do an almost impossible job (because the denoising distribution

is highly complex and multimodal). With the diﬀusion inversion objective, the

learner can learn the shape of the density around the data points more precisely

as well as remove spurious modes that could show up far from the data points.

Another approach to sample generation is the

approximate Bayesian com-

putation

(ABC) framework (Rubin et al., 1984). In this approach, samples are

rejected or modiﬁed to make the moments of selected functions of the samples

712

CHAPTER 20. DEEP GENERATIVE MODELS

match those of the desired distribution. While this idea uses the moments of the

samples as in moment matching, it is diﬀerent from moment matching because it

modiﬁes the samples themselves, rather than training the model to automatically

emit samples with the correct moments. Bachman and Precup (2015) showed how

to use ideas from ABC in the context of deep learning, by using ABC to shape the

MCMC trajectories of GSNs.

We expect that many other possible approaches to generative modeling await

discovery.

20.14 Evaluating Generative Models

Researchers studying generative models often need to compare one generative

model to another, usually in order to demonstrate that a newly invented generative

model is better at capturing some distribution than the pre-existing models.

This can be a diﬃcult and subtle task. Often, we cannot actually evaluate the

log-probability of the data under the model, but can evaluate only an approximation.

In these cases, it is important to think and communicate clearly about what exactly

is being measured. For example, suppose we can evaluate a stochastic estimate of

the log-likelihood for model A, and a deterministic lower bound on the log-likelihood

for model B. If model A gets a higher score than model B, which is better? If we

care about determining which model has a better internal representation of the

distribution, we actually cannot tell, unless we have some way of determining how

loose the bound for model B is. However, if we care about how well we can use

the model in practice, for example to perform anomaly detection, then it is fair to

say that a model is preferable based on a criterion speciﬁc to the practical task of

interest, for example, based on ranking test examples and ranking criteria such as

precision and recall.

Another subtlety of evaluating generative models is that the evaluation metrics

are often hard research problems in and of themselves. It can be very diﬃcult

to establish that models are being compared fairly. For example, suppose we use

AIS to estimate

log Z

in order to compute

log ˜p

(

)

− log Z

for a new model we

have just invented. A computationally economical implementation of AIS may fail

to ﬁnd several modes of the model distribution and underestimate

, which will

result in us overestimating

log p

(

). It can thus be diﬃcult to tell whether a high

likelihood estimate is the result of a good model or a bad AIS implementation.

Other ﬁelds of machine learning usually allow for some variation in the pre-

processing of the data. For example, when comparing the accuracy of object

713

CHAPTER 20. DEEP GENERATIVE MODELS

recognition algorithms, it is usually acceptable to preprocess the input images

slightly diﬀerently for each algorithm based on what kind of input requirements

it has. Generative modeling is diﬀerent because changes in preprocessing, even

very small and subtle ones, are completely unacceptable. Any change to the input

data changes the distribution to be captured and fundamentally alters the task.

For example, multiplying the input by 0.1 will artiﬁcially increase likelihood by a

factor of 10.

Issues with preprocessing commonly arise when benchmarking generative models

on the MNIST dataset, one of the more popular generative modeling benchmarks.

MNIST consists of grayscale images. Some models treat MNIST images as points

in a real vector space, while others treat them as binary. Yet others treat the

grayscale values as probabilities for binary samples. It is essential to compare

real-valued models only to other real-valued models and binary-valued models only

to other binary-valued models. Otherwise the likelihoods measured are not on the

same space. For binary-valued models, the log-likelihood can be at most zero, while

for real-valued models, it can be arbitrarily high, since it is the measurement of a

density. Among binary models, it is important to compare models using exactly

the same kind of binarization. For example, we might binarize a gray pixel to 0 or 1

by thresholding at 0.5, or by drawing a random sample whose probability of being

1 is given by the gray pixel intensity. If we use the random binarization, we might

binarize the whole dataset once, or we might draw a diﬀerent random example for

each step of training and then draw multiple samples for evaluation. Each of these

three schemes yields wildly diﬀerent likelihood numbers, and when comparing

diﬀerent models it is important that both models use the same binarization scheme

for training and for evaluation. In fact, researchers who apply a single random

binarization step share a ﬁle containing the results of the random binarization, so

that there is no diﬀerence in results based on diﬀerent outcomes of the binarization

step.

Because being able to generate realistic samples from the data distribution

is one of the goals of a generative model, practitioners often evaluate generative

models by visually inspecting the samples. In the best case, this is done not by the

researchers themselves, but by experimental subjects who do not know the source

of the samples (Denton et al., 2015). Unfortunately, it is possible for a very poor

probabilistic model to produce very good samples. A common practice to verify if

the model only copies some of the training examples is illustrated in ﬁgure 16.1.

The idea is to show for some of the generated samples their nearest neighbor in

the training set, according to Euclidean distance in the space of

. This test is

intended to detect the case where the model overﬁts the training set and just

reproduces training instances. It is even possible to simultaneously underﬁt and

714

CHAPTER 20. DEEP GENERATIVE MODELS

overﬁt yet still produce samples that individually look good. Imagine a generative

model trained on images of dogs and cats that simply learns to reproduce the

training images of dogs. Such a model has clearly overﬁt, because it does not

produces images that were not in the training set, but it has also underﬁt, because

it assigns no probability to the training images of cats. Yet a human observer

would judge each individual image of a dog to be high quality. In this simple

example, it would be easy for a human observer who can inspect many samples to

determine that the cats are absent. In more realistic settings, a generative model

trained on data with tens of thousands of modes may ignore a small number of

modes, and a human observer would not easily be able to inspect or remember

enough images to detect the missing variation.

Since the visual quality of samples is not a reliable guide, we often also

evaluate the log-likelihood that the model assigns to the test data, when this is

computationally feasible. Unfortunately, in some cases the likelihood seems not

to measure any attribute of the model that we really care about. For example,

real-valued models of MNIST can obtain arbitrarily high likelihood by assigning

arbitrarily low variance to background pixels that never change. Models and

algorithms that detect these constant features can reap unlimited rewards, even

though this is not a very useful thing to do. The potential to achieve a cost

approaching negative inﬁnity is present for any kind of maximum likelihood

problem with real values, but it is especially problematic for generative models of

MNIST because so many of the output values are trivial to predict. This strongly

suggests a need for developing other ways of evaluating generative models.

Theis et al. (2015) review many of the issues involved in evaluating generative

models, including many of the ideas described above. They highlight the fact

that there are many diﬀerent uses of generative models and that the choice of

metric must match the intended use of the model. For example, some generative

models are better at assigning high probability to most realistic points, while other

generative models are better at rarely assigning high probability to unrealistic

points. These diﬀerences can result from whether a generative model is designed

to minimize

(

data

p

model

) or

(

model

p

data

), as illustrated in ﬁgure 3.6.

Unfortunately, even when we restrict the use of each metric to the task it is most

suited for, all the metrics currently in use continue to have serious weaknesses.

One of the most important research topics in generative modeling is therefore not

just how to improve generative models, but in fact, designing new techniques to

measure our progress.

715

CHAPTER 20. DEEP GENERATIVE MODELS

20.15 Conclusion

Training generative models with hidden units is a powerful way to make models

understand the world represented in the given training data. By learning a model

model

(

) and a representation

model

(

h | x

), a generative model can provide

answers to many inference problems about the relationships between input variables

and can oﬀer many diﬀerent ways of representing

by taking expectations

at diﬀerent layers of the hierarchy. Generative models hold the promise to

provide AI systems with a framework for all the many diﬀerent intuitive concepts

they need to understand, giving them the ability to reason about these concepts

in the face of uncertainty. We hope that our readers will ﬁnd new ways to make

these approaches more powerful and continue the journey to understanding the

principles that underlie learning and intelligence.

716