Chapter 20
Deep Generative Models
In this chapter, we present several of the specific kinds of generative models that
can be built and trained using the techniques presented in chapters1619. All
these models represent probability distributions over multiple variables in some
way. Some allow the probability distribution function to be evaluated explicitly.
Others do not allow the evaluation of the probability distribution function but
support operations that implicitly require knowledge of it, such as drawing samples
from the distribution. Some of these models are structured probabilistic models
described in terms of graphs and factors, using the language of graphical models
presented in chapter 16. Others cannot be easily described in terms of factors but
represent probability distributions nonetheless.
20.1 Boltzmann Machines
Boltzmann machines were originally introduced as a general “connectionist” ap-
proach to learning arbitrary probability distributions over binary vectors (Fahlman
et al., 1983; Ackley et al., 1985; Hinton et al., 1984; Hinton and Sejnowski, 1986).
Variants of the Boltzmann machine that include other kinds of variables have long
ago surpassed the popularity of the original. In this section we briefly introduce
the binary Boltzmann machine and discuss the issues that come up when trying to
train and perform inference in the model.
We define the Boltzmann machine over a
d
-dimensional binary random vector
x {
0
,
1
}
d
. The Boltzmann machine is an energy-based model (section 16.2.4),
651
CHAPTER 20. DEEP GENERATIVE MODELS
meaning we define the joint probability distribution using an energy function:
P (x) =
exp (E(x))
Z
, (20.1)
where
E
(
x
) is the energy function, and
Z
is the partition function that ensures
that
x
P (x) = 1. The energy function of the Boltzmann machine is given by
E(x) = x
Ux b
x, (20.2)
where
U
is the “weight” matrix of model parameters and
b
is the vector of bias
parameters.
In the general setting of the Boltzmann machine, we are given a set of training
examples, each of which are
n
-dimensional. Equation 20.1 describes the joint
probability distribution over the observed variables. While this scenario is certainly
viable, it does limit the kinds of interactions between the observed variables to
those described by the weight matrix. Specifically, it means that the probability of
one unit being on is given by a linear model (logistic regression) from the values of
the other units.
The Boltzmann machine becomes more powerful when not all the variables are
observed. In this case, the latent variables can act similarly to hidden units in a
multilayer perceptron and model higher-order interactions among the visible units.
Just as the addition of hidden units to convert logistic regression into an MLP results
in the MLP being a universal approximator of functions, a Boltzmann machine
with hidden units is no longer limited to modeling linear relationships between
variables. Instead, the Boltzmann machine becomes a universal approximator of
probability mass functions over discrete variables (Le Roux and Bengio, 2008).
Formally, we decompose the units
x
into two subsets: the visible units
v
and
the latent (or hidden) units h. The energy function becomes
E(v, h) = v
Rv v
W h h
Sh b
v c
h. (20.3)
Boltzmann Machine Learning
Learning algorithms for Boltzmann machines
are usually based on maximum likelihood. All Boltzmann machines have an
intractable partition function, so the maximum likelihood gradient must be ap-
proximated using the techniques described in chapter 18.
One interesting property of Boltzmann machines when trained with learning
rules based on maximum likelihood is that the update for a particular weight
connecting two units depends only on the statistics of those two units, collected
under different distributions:
P
model
(
v
) and
ˆ
P
data
(
v
)
P
model
(
h | v
). The rest of the
652
CHAPTER 20. DEEP GENERATIVE MODELS
network participates in shaping those statistics, but the weight can be updated
without knowing anything about the rest of the network or how those statistics were
produced. This means that the learning rule is “local,” which makes Boltzmann
machine learning somewhat biologically plausible. It is conceivable that if each
neuron were a random variable in a Boltzmann machine, then the axons and
dendrites connecting two random variables could learn only by observing the firing
pattern of the cells that they actually physically touch. In particular, in the
positive phase, two units that frequently activate together have their connection
strengthened. This is an example of a Hebbian learning rule (Hebb, 1949) often
summarized with the mnemonic “fire together, wire together.” Hebbian learning
rules are among the oldest hypothesized explanations for learning in biological
systems and remain relevant today (Giudice et al., 2009).
Other learning algorithms that use more information than local statistics seem
to require us to hypothesize the existence of more machinery than this. For
example, for the brain to implement back-propagation in a multilayer perceptron,
it seems necessary for the brain to maintain a secondary communication network
for transmitting gradient information backward through the network. Proposals for
biologically plausible implementations (and approximations) of back-propagation
have been made (Hinton, 2007a; Bengio, 2015) but remain to be validated, and
Bengio (2015) links back-propagation of gradients to inference in energy-based
models similar to the Boltzmann machine (but with continuous latent variables).
The negative phase of Boltzmann machine learning is somewhat harder to
explain from a biological point of view. As argued in section 18.2, dream sleep
may be a form of negative phase sampling. This idea is more speculative though.
20.2 Restricted Boltzmann Machines
Invented under the name
harmonium
(Smolensky, 1986), restricted Boltzmann
machines are some of the most common building blocks of deep probabilistic
models. We briefly describe RBMs in section 16.7.1. Here we review the previous
information and go into more detail. RBMs are undirected probabilistic graphical
models containing a layer of observable variables and a single layer of latent
variables. RBMs may be stacked (one on top of the other) to form deeper models.
See figure 20.1 for some examples. In particular, figure 20.1a shows the graph
structure of the RBM itself. It is a bipartite graph, with no connections permitted
between any variables in the observed layer or between any units in the latent
layer.
We begin with the binary version of the restricted Boltzmann machine, but as
653
CHAPTER 20. DEEP GENERATIVE MODELS
h
1
h
1
h
2
h
2
h
3
h
3
v
1
v
1
v
2
v
2
v
3
v
3
h
4
h
4
h
(1)
1
h
(1)
1
h
(1)
2
h
(1)
2
h
(1)
3
h
(1)
3
v
1
v
1
v
2
v
2
v
3
v
3
h
(2)
1
h
(2)
1
h
(2)
2
h
(2)
2
h
(2)
3
h
(2)
3
h
(1)
4
h
(1)
4
(a) (b)
h
(1)
1
h
(1)
1
h
(1)
2
h
(1)
2
h
(1)
3
h
(1)
3
v
1
v
1
v
2
v
2
v
3
v
3
h
(2)
1
h
(2)
1
h
(2)
2
h
(2)
2
h
(2)
3
h
(2)
3
h
(1)
4
h
(1)
4
(c)
Figure 20.1: Examples of models that may be built with restricted Boltzmann machines.
(a) The restricted Boltzmann machine itself is an undirected graphical model based on
a bipartite graph, with visible units in one part of the graph and hidden units in the
other part. There are no connections among the visible units, nor any connections among
the hidden units. Typically every visible unit is connected to every hidden unit, but it
is possible to construct sparsely connected RBMs such as convolutional RBMs. (b) A
deep belief network is a hybrid graphical model involving both directed and undirected
connections. Like an RBM, it has no intralayer connections. However, a DBN has multiple
hidden layers, and thus connections between hidden units that are in separate layers.
All the local conditional probability distributions needed by the deep belief network are
copied directly from the local conditional probability distributions of its constituent RBMs.
Alternatively, we could also represent the deep belief network with a completely undirected
graph, but it would need intralayer connections to capture the dependencies between
parents. (c) A deep Boltzmann machine is an undirected graphical model with several
layers of latent variables. Like RBMs and DBNs, DBMs lack intralayer connections.
DBMs are less closely tied to RBMs than DBNs are. When initializing a DBM from a
stack of RBMs, it is necessary to modify the RBM parameters slightly. Some kinds of
DBMs may be trained without first training a set of RBMs.
654
CHAPTER 20. DEEP GENERATIVE MODELS
we see later, there are extensions to other types of visible and hidden units.
More formally, let the observed layer consist of a set of
n
v
binary random
variables, which we refer to collectively with the vector
v
. We refer to the latent,
or hidden, layer of n
h
binary random variables as h.
Like the general Boltzmann machine, the restricted Boltzmann machine is an
energy-based model with the joint probability distribution specified by its energy
function:
P (v = v, h = h) =
1
Z
exp (E(v, h)) . (20.4)
The energy function for an RBM is given by
E(v, h) = b
v c
h v
W h, (20.5)
and Z is the normalizing constant known as the partition function:
Z =
v
h
exp {−E(v, h)}. (20.6)
It is apparent from the definition of the partition function
Z
that the naive method
of computing
Z
(exhaustively summing over all states) could be computationally
intractable, unless a cleverly designed algorithm could exploit regularities in the
probability distribution to compute
Z
faster. In the case of restricted Boltzmann
machines, Long and Servedio (2010) formally proved that the partition function
Z
is intractable. The intractable partition function
Z
implies that the normalized
joint probability distribution P (v) is also intractable to evaluate.
20.2.1 Conditional Distributions
Though
P
(
v
) is intractable, the bipartite graph structure of the RBM has the
special property of its conditional distributions
P
(
h | v
) and
P
(
v | h
) being
factorial and relatively simple to compute and sample from.
Deriving the conditional distributions from the joint distribution is straightfor-
ward:
P (h | v) =
P (h, v)
P (v)
(20.7)
=
1
P (v)
1
Z
exp
b
v + c
h + v
W h
(20.8)
=
1
Z
exp
c
h + v
W h
(20.9)
655
CHAPTER 20. DEEP GENERATIVE MODELS
=
1
Z
exp
n
h
j=1
c
j
h
j
+
n
h
j=1
v
W
:,j
h
j
(20.10)
=
1
Z
n
h
j=1
exp
c
j
h
j
+ v
W
:,j
h
j
. (20.11)
Since we are conditioning on the visible units
v
, we can treat these as constant
with respect to the distribution
P
(
h | v
). The factorial nature of the conditional
P
(
h | v
) follows immediately from our ability to write the joint probability over
the vector
h
as the product of (unnormalized) distributions over the individual
elements,
h
j
. It is now a simple matter of normalizing the distributions over the
individual binary h
j
.
P (h
j
= 1 | v) =
˜
P (h
j
= 1 | v)
˜
P (h
j
= 0 | v) +
˜
P (h
j
= 1 | v)
(20.12)
=
exp
c
j
+ v
W
:,j
exp {0} + exp {c
j
+ v
W
:,j
}
(20.13)
= σ
c
j
+ v
W
:,j
. (20.14)
We can now express the full conditional over the hidden layer as the factorial
distribution:
P (h | v) =
n
h
j=1
σ
(2h 1) (c + W
v)
j
. (20.15)
A similar derivation will show that the other condition of interest to us,
P
(
v | h
),
is also a factorial distribution:
P (v | h) =
n
v
i=1
σ ((2v 1) (b + W h))
i
. (20.16)
20.2.2 Training Restricted Boltzmann Machines
Because the RBM admits efficient evaluation and differentiation of
˜
P
(
v
) and
efficient MCMC sampling in the form of block Gibbs sampling, it can readily be
trained with any of the techniques described in chapter 18 for training models
that have intractable partition functions. This includes CD, SML (PCD), ratio
matching, and so on. Compared to other undirected models used in deep learning,
the RBM is relatively straightforward to train because we can compute
P
(
h | v
)
656
CHAPTER 20. DEEP GENERATIVE MODELS
exactly in closed form. Some other deep models, such as the deep Boltzmann
machine, combine both the difficulty of an intractable partition function and the
difficulty of intractable inference.
20.3 Deep Belief Networks
Deep belief networks
(DBNs) were one of the first nonconvolutional models
to successfully admit training of deep architectures (Hinton et al., 2006; Hinton,
2007b). The introduction of deep belief networks in 2006 began the current deep
learning renaissance. Prior to the introduction of deep belief networks, deep models
were considered too difficult to optimize. Kernel machines with convex objective
functions dominated the research landscape. Deep belief networks demonstrated
that deep architectures can be successful by outperforming kernelized support
vector machines on the MNIST dataset (Hinton et al., 2006). Today, deep belief
networks have mostly fallen out of favor and are rarely used, even compared to
other unsupervised or generative learning algorithms, but they are still deservedly
recognized for their important role in deep learning history.
Deep belief networks are generative models with several layers of latent variables.
The latent variables are typically binary, while the visible units may be binary
or real. There are no intralayer connections. Usually, every unit in each layer is
connected to every unit in each neighboring layer, though it is possible to construct
more sparsely connected DBNs. The connections between the top two layers are
undirected. The connections between all other layers are directed, with the arrows
pointed toward the layer that is closest to the data. See figure 20.1b for an example.
A DBN with
l
hidden layers contains
l
weight matrices:
W
(1)
, . . . , W
(l)
. It
also contains
l
+ 1 bias vectors
b
(0)
, . . . , b
(l)
, with
b
(0)
providing the biases for the
visible layer. The probability distribution represented by the DBN is given by
P (h
(l)
, h
(l1)
) exp
b
(l)
h
(l)
+ b
(l1)
h
(l1)
+ h
(l1)
W
(l)
h
(l)
, (20.17)
P (h
(k)
i
= 1 | h
(k+1)
) = σ
b
(k)
i
+ W
(k+1)
:,i
h
(k+1)
i, k 1, . . . , l 2, (20.18)
P (v
i
= 1 | h
(1)
) = σ
b
(0)
i
+ W
(1)
:,i
h
(1)
i. (20.19)
In the case of real-valued visible units, substitute
v N
v; b
(0)
+ W
(1)
h
(1)
, β
1
(20.20)
657
CHAPTER 20. DEEP GENERATIVE MODELS
with
β
diagonal for tractability. Generalizations to other exponential family visible
units are straightforward, at least in theory. A DBN with only one hidden layer is
just an RBM.
To generate a sample from a DBN, we first run several steps of Gibbs sampling
on the top two hidden layers. This stage is essentially drawing a sample from
the RBM defined by the top two hidden layers. We can then use a single pass of
ancestral sampling through the rest of the model to draw a sample from the visible
units.
Deep belief networks incur many of the problems associated with both directed
models and undirected models.
Inference in a deep belief network is intractable because of the explaining away
effect within each directed layer and the interaction between the two hidden layers
that have undirected connections. Evaluating or maximizing the standard evidence
lower bound on the log-likelihood is also intractable, because the evidence lower
bound takes the expectation of cliques whose size is equal to the network width.
Evaluating or maximizing the log-likelihood requires confronting not just the
problem of intractable inference to marginalize out the latent variables, but also
the problem of an intractable partition function within the undirected model of
the top two layers.
To train a deep belief network, one begins by training an RBM to maximize
E
vp
data
log p
(
v
) using contrastive divergence or stochastic maximum likelihood.
The parameters of the RBM then define the parameters of the first layer of the
DBN. Next, a second RBM is trained to approximately maximize
E
vp
data
E
h
(1)
p
(1)
(h
(1)
|v)
log p
(2)
(h
(1)
), (20.21)
where
p
(1)
is the probability distribution represented by the first RBM, and
p
(2)
is the probability distribution represented by the second RBM. In other words,
the second RBM is trained to model the distribution defined by sampling the
hidden units of the first RBM, when the first RBM is driven by the data. This
procedure can be repeated indefinitely, to add as many layers to the DBN as
desired, with each new RBM modeling the samples of the previous one. Each RBM
defines another layer of the DBN. This procedure can be justified as increasing a
variational lower bound on the log-likelihood of the data under the DBN (Hinton
et al., 2006).
In most applications, no effort is made to jointly train the DBN after the greedy
layer-wise procedure is complete. However, it is possible to perform generative
fine-tuning using the wake-sleep algorithm.
658
CHAPTER 20. DEEP GENERATIVE MODELS
The trained DBN may be used directly as a generative model, but most of the
interest in DBNs arose from their ability to improve classification models. We can
take the weights from the DBN and use them to define an MLP:
h
(1)
= σ
b
(1)
+ v
W
(1)
, (20.22)
h
(l)
= σ
b
(l)
+ h
(l1)
W
(l)
l 2, . . . , m. (20.23)
After initializing this MLP with the weights and biases learned via generative
training of the DBN, we can train the MLP to perform a classification task. This
additional training of the MLP is an example of discriminative fine-tuning.
This specific choice of MLP is somewhat arbitrary, compared to many of the
inference equations in chapter 19 that are derived from first principles. This MLP
is a heuristic choice that seems to work well in practice and is used consistently
in the literature. Many approximate inference techniques are motivated by their
ability to find a maximally tight variational lower bound on the log-likelihood
under some set of constraints. One can construct a variational lower bound on the
log-likelihood using the hidden unit expectations defined by the DBN’s MLP, but
this is true of any probability distribution over the hidden units, and there is no
reason to believe that this MLP provides a particularly tight bound. In particular,
the MLP ignores many important interactions in the DBN graphical model. The
MLP propagates information upward from the visible units to the deepest hidden
units, but it does not propagate any information downward or sideways. The DBN
graphical model has explaining away interactions between all the hidden units
within the same layer as well as in top-down interactions between layers.
While the log-likelihood of a DBN is intractable, it may be approximated with
AIS (Salakhutdinov and Murray, 2008). This permits evaluating its quality as a
generative model.
The term “deep belief network” is commonly used incorrectly to refer to any
kind of deep neural network, even networks without latent variable semantics.
The term should refer specifically to models with undirected connections in the
deepest layer and directed connections pointing downward between all other pairs
of consecutive layers.
The term may also cause some confusion because “belief network” is sometimes
used to refer to purely directed models, while deep belief networks contain an
undirected layer. Deep belief networks also share the acronym DBN with dynamic
Bayesian networks (Dean and Kanazawa, 1989), which are Bayesian networks for
representing Markov chains.
659
CHAPTER 20. DEEP GENERATIVE MODELS
20.4 Deep Boltzmann Machines
A
deep Boltzmann machine
, or DBM (Salakhutdinov and Hinton, 2009a) is
another kind of deep generative model. Unlike the deep belief network (DBN),
it is an entirely undirected model. Unlike the RBM, the DBM has several layers
of latent variables (RBMs have just one). But like the RBM, within each layer,
each of the variables are mutually independent, conditioned on the variables in
the neighboring layers. See figure 20.2 for the graph structure. Deep Boltzmann
machines have been applied to a variety of tasks, including document modeling
(Srivastava et al., 2013).
Like RBMs and DBNs, DBMs typically contain only binary units—as we
assume for simplicity of our presentation of the model—but it is straightforward
to include real-valued visible units.
A DBM is an energy-based model, meaning that the joint probability distribu-
tion over the model variables is parametrized by an energy function
E
. In the case
of a deep Boltzmann machine with one visible layer,
v
, and three hidden layers,
h
(1)
, h
(2)
, and h
(3)
, the joint probability is given by
P
v, h
(1)
, h
(2)
, h
(3)
=
1
Z(θ)
exp
E(v, h
(1)
, h
(2)
, h
(3)
; θ)
. (20.24)
To simplify our presentation, we omit the bias parameters below. The DBM energy
function is then defined as follows:
E(v, h
(1)
, h
(2)
, h
(3)
; θ) = v
W
(1)
h
(1)
h
(1)
W
(2)
h
(2)
h
(2)
W
(3)
h
(3)
.
(20.25)
h
(1)
1
h
(1)
1
h
(1)
2
h
(1)
2
h
(1)
3
h
(1)
3
v
1
v
1
v
2
v
2
v
3
v
3
h
(2)
1
h
(2)
1
h
(2)
2
h
(2)
2
h
(2)
3
h
(2)
3
h
(1)
4
h
(1)
4
Figure 20.2: The graphical model for a deep Boltzmann machine with one visible layer
(bottom) and two hidden layers. Connections are only between units in neighboring layers.
There are no intralayer connections.
660
CHAPTER 20. DEEP GENERATIVE MODELS
In comparison to the RBM energy function (equation 20.5), the DBM energy
function includes connections between the hidden units (latent variables) in the
form of the weight matrices (
W
(2)
and
W
(3)
). As we will see, these connections
have significant consequences for the model behavior as well as how we go about
performing inference in the model.
In comparison to fully connected Boltzmann machines (with every unit con-
nected to every other unit), the DBM offers some advantages that are similar
to those offered by the RBM. Specifically, as illustrated in figure 20.3, the DBM
layers can be organized into a bipartite graph, with odd layers on one side and
even layers on the other. This immediately implies that when we condition on the
variables in the even layer, the variables in the odd layers become conditionally
independent. Of course, when we condition on the variables in the odd layers, the
variables in the even layers also become conditionally independent.
The bipartite structure of the DBM means that we can apply the same equa-
tions we have previously used for the conditional distributions of an RBM to
determine the conditional distributions in a DBM. The units within a layer are
conditionally independent from each other given the values of the neighboring
layers, so the distributions over binary variables can be fully described by the
Bernoulli parameters, giving the probability of each unit being active. In our
example with two hidden layers, the activation probabilities are given by
P (v
i
= 1 | h
(1)
) = σ
W
(1)
i,:
h
(1)
, (20.26)
h
(1)
1
h
(1)
1
h
(1)
2
h
(1)
2
h
(1)
3
h
(1)
3
v
1
v
1
v
2
v
2
h
(2)
1
h
(2)
1
h
(2)
2
h
(2)
2
h
(2)
3
h
(2)
3
h
(3)
1
h
(3)
1
h
(3)
2
h
(3)
2
v
1
v
2
h
(2)
1
h
(2)
1
h
(2)
2
h
(2)
2
h
(2)
3
h
(2)
3
h
(1)
1
h
(1)
1
h
(1)
2
h
(1)
2
h
(1)
3
h
(1)
3
h
(3)
1
h
(3)
1
h
(3)
2
h
(3)
2
Figure 20.3: A deep Boltzmann machine, rearranged to reveal its bipartite graph structure.
661
CHAPTER 20. DEEP GENERATIVE MODELS
P (h
(1)
i
= 1 | v, h
(2)
) = σ
v
W
(1)
:,i
+ W
(2)
i,:
h
(2)
, (20.27)
and
P (h
(2)
k
= 1 | h
(1)
) = σ
h
(1)
W
(2)
:,k
. (20.28)
The bipartite structure makes Gibbs sampling in a deep Boltzmann machine
efficient. The naive approach to Gibbs sampling is to update only one variable
at a time. RBMs allow all the visible units to be updated in one block and all
the hidden units to be updated in a second block. One might naively assume
that a DBM with
l
layers requires
l
+ 1 updates, with each iteration updating
a block consisting of one layer of units. Instead, it is possible to update all the
units in only two iterations. Gibbs sampling can be divided into two blocks of
updates, one including all even layers (including the visible layer) and the other
including all odd layers. Because of the bipartite DBM connection pattern, given
the even layers, the distribution over the odd layers is factorial and thus can be
sampled simultaneously and independently as a block. Likewise, given the odd
layers, the even layers can be sampled simultaneously and independently as a
block. Efficient sampling is especially important for training with the stochastic
maximum likelihood algorithm.
20.4.1 Interesting Properties
Deep Boltzmann machines have many interesting properties.
DBMs were developed after DBNs. Compared to DBNs, the posterior distribu-
tion
P
(
h | v
) is simpler for DBMs. Somewhat counterintuitively, the simplicity
of this posterior distribution allows richer approximations of the posterior. In
the case of the DBN, we perform classification using a heuristically motivated
approximate inference procedure, in which we guess that a reasonable value for
the mean field expectation of the hidden units can be provided by an upward pass
through the network in an MLP that uses sigmoid activation functions and the
same weights as the original DBN. Any distribution
Q
(
h
) can be used to obtain a
variational lower bound on the log-likelihood. This heuristic procedure therefore
enables us to obtain such a bound. Yet the bound is not explicitly optimized in
any way, so it may be far from tight. In particular, the heuristic estimate of
Q
ignores interactions between hidden units within the same layer as well as the
top-down feedback influence of hidden units in deeper layers on hidden units that
are closer to the input. Because the heuristic MLP-based inference procedure in
the DBN is not able to account for these interactions, the resulting
Q
is presumably
662
CHAPTER 20. DEEP GENERATIVE MODELS
far from optimal. In DBMs, all the hidden units within a layer are conditionally
independent given the other layers. This lack of intralayer interaction makes it
possible to use fixed-point equations to optimize the variational lower bound and
find the true optimal mean field expectations (to within some numerical tolerance).
The use of proper mean field allows the approximate inference procedure for
DBMs to capture the influence of top-down feedback interactions. This makes
DBMs interesting from the point of view of neuroscience, because the human brain
is known to use many top-down feedback connections. Because of this property,
DBMs have been used as computational models of real neuroscientific phenomena
(Series et al., 2010; Reichert et al., 2011).
One unfortunate property of DBMs is that sampling from them is relatively
difficult. DBNs only need to use MCMC sampling in their top pair of layers. The
other layers are used only at the end of the sampling process, in one efficient
ancestral sampling pass. To generate a sample from a DBM, it is necessary to
use MCMC across all layers, with every layer of the model participating in every
Markov chain transition.
20.4.2 DBM Mean Field Inference
The conditional distribution over one DBM layer given the neighboring layers is
factorial. In the example of the DBM with two hidden layers, these distributions
are
P
(
v | h
(1)
),
P
(
h
(1)
| v, h
(2)
), and
P
(
h
(2)
| h
(1)
). The distribution over all
hidden layers generally does not factorize because of interactions between layers. In
the example with two hidden layers,
P
(
h
(1)
, h
(2)
| v
) does not factorize because of
the interaction weights
W
(2)
between
h
(1)
and
h
(2)
, which render these variables
mutually dependent.
As was the case with the DBN, we are left to seek out methods to approximate
the DBM posterior distribution. Unlike the DBN, however, the DBM posterior
distribution over their hidden units—while complicated—is easy to approximate
with a variational approximation (as discussed in section 19.4), specifically a
mean field approximation. The mean field approximation is a simple form of
variational inference, where we restrict the approximating distribution to fully
factorial distributions. In the context of DBMs, the mean field equations capture
the bidirectional interactions between layers. In this section we derive the iterative
approximate inference procedure originally introduced in Salakhutdinov and Hinton
(2009a).
In variational approximations to inference, we approach the task of approxi-
mating a particular target distribution—in our case, the posterior distribution over
663
CHAPTER 20. DEEP GENERATIVE MODELS
the hidden units given the visible units—by some reasonably simple family of dis-
tributions. In the case of the mean field approximation, the approximating family
is the set of distributions where the hidden units are conditionally independent.
We now develop the mean field approach for the example with two hidden
layers. Let
Q
(
h
(1)
, h
(2)
| v
) be the approximation of
P
(
h
(1)
, h
(2)
| v
). The mean
field assumption implies that
Q(h
(1)
, h
(2)
| v) =
j
Q(h
(1)
j
| v)
k
Q(h
(2)
k
| v). (20.29)
The mean field approximation attempts to find a member of this family of
distributions that best fits the true posterior
P
(
h
(1)
, h
(2)
| v
). Importantly, the
inference process must be run again to find a different distribution
Q
every time
we use a new value of v.
One can conceive of many ways of measuring how well
Q
(
h | v
) fits
P
(
h | v
).
The mean field approach is to minimize
KL(QP ) =
h
Q(h
(1)
, h
(2)
| v) log
Q(h
(1)
, h
(2)
| v)
P (h
(1)
, h
(2)
| v)
. (20.30)
In general, we do not have to provide a parametric form of the approximating
distribution beyond enforcing the independence assumptions. The variational
approximation procedure is generally able to recover a functional form of the
approximate distribution. However, in the case of a mean field assumption on
binary hidden units (the case we are developing here) there is no loss of generality
resulting from fixing a parametrization of the model in advance.
We parametrize
Q
as a product of Bernoulli distributions; that is, we associate
the probability of each element of
h
(1)
with a parameter. Specifically, for each
j
,
ˆ
h
(1)
j
=
Q
(
h
(1)
j
= 1
| v
), where
ˆ
h
(1)
j
[0
,
1], and for each
k
,
ˆ
h
(2)
k
=
Q
(
h
(2)
k
= 1
| v
),
where
ˆ
h
(2)
k
[0, 1]. Thus we have the following approximation to the posterior:
Q(h
(1)
, h
(2)
| v) =
j
Q(h
(1)
j
| v)
k
Q(h
(2)
k
| v) (20.31)
=
j
(
ˆ
h
(1)
j
)
h
(1)
j
(1
ˆ
h
(1)
j
)
(1h
(1)
j
)
×
k
(
ˆ
h
(2)
k
)
h
(2)
k
(1
ˆ
h
(2)
k
)
(1h
(2)
k
)
.
(20.32)
Of course, for DBMs with more layers, the approximate posterior parametrization
can be extended in the obvious way, exploiting the bipartite structure of the graph
664
CHAPTER 20. DEEP GENERATIVE MODELS
to update all the even layers simultaneously and then to update all the odd layers
simultaneously, following the same schedule as Gibbs sampling.
Now that we have specified our family of approximating distributions
Q
, it
remains to specify a procedure for choosing the member of this family that best
fits
P
. The most straightforward way to do this is to use the mean field equations
specified by equation 19.56. These equations were derived by solving for where the
derivatives of the variational lower bound are zero. They describe in an abstract
manner how to optimize the variational lower bound for any model, simply by
taking expectations with respect to Q.
Applying these general equations, we obtain the update rules (again, ignoring
bias terms):
ˆ
h
(1)
j
= σ
i
v
i
W
(1)
i,j
+
k
W
(2)
j,k
ˆ
h
(2)
k
, j, (20.33)
ˆ
h
(2)
k
= σ
j
W
(2)
j
,k
ˆ
h
(1)
j
, k. (20.34)
At a fixed point of this system of equations, we have a local maximum of the
variational lower bound
L
(
Q
). Thus these fixed-point update equations define an
iterative algorithm where we alternate updates of
ˆ
h
(1)
j
(using equation 20.33) and
updates of
ˆ
h
(2)
k
(using equation 20.34). On small problems such as MNIST, as few
as ten iterations can be sufficient to find an approximate positive phase gradient
for learning, and fifty usually suffice to obtain a high-quality representation of
a single specific example to be used for high-accuracy classification. Extending
approximate variational inference to deeper DBMs is straightforward.
20.4.3 DBM Parameter Learning
Learning in the DBM must confront both the challenge of an intractable partition
function, using the techniques from chapter 18, and the challenge of an intractable
posterior distribution, using the techniques from chapter 19.
As described in section 20.4.2, variational inference allows the construction of
a distribution
Q
(
h | v
) that approximates the intractable
P
(
h | v
). Learning then
proceeds by maximizing
L
(
v, Q, θ
), the variational lower bound on the intractable
log-likelihood, log P (v; θ).
665
CHAPTER 20. DEEP GENERATIVE MODELS
For a deep Boltzmann machine with two hidden layers, L is given by
L(Q, θ) =
i
j
v
i
W
(1)
i,j
ˆ
h
(1)
j
+
j
k
ˆ
h
(1)
j
W
(2)
j
,k
ˆ
h
(2)
k
log Z(θ) + H(Q). (20.35)
This expression still contains the log partition function,
log Z
(
θ
). Because a deep
Boltzmann machine contains restricted Boltzmann machines as components, the
hardness results for computing the partition function and sampling that apply to
restricted Boltzmann machines also apply to deep Boltzmann machines. This means
that evaluating the probability mass function of a Boltzmann machine requires
approximate methods such as annealed importance sampling. Likewise, training
the model requires approximations to the gradient of the log partition function. See
chapter 18 for a general description of these methods. DBMs are typically trained
using stochastic maximum likelihood. Many of the other techniques described in
chapter 18 are not applicable. Techniques such as pseudolikelihood require the
ability to evaluate the unnormalized probabilities, rather than merely obtain a
variational lower bound on them. Contrastive divergence is slow for deep Boltzmann
machines because they do not allow efficient sampling of the hidden units given the
visible units—instead, contrastive divergence would require burning in a Markov
chain every time a new negative phase sample is needed.
The nonvariational version of the stochastic maximum likelihood algorithm is
discussed in section 18.2. Variational stochastic maximum likelihood as applied to
the DBM is given in algorithm 20.1. Recall that we describe a simplified variant
of the DBM that lacks bias parameters; including them is trivial.
20.4.4 Layer-Wise Pretraining
Unfortunately, training a DBM using stochastic maximum likelihood (as described
above) from a random initialization usually results in failure. In some cases, the
model fails to learn to represent the distribution adequately. In other cases, the
DBM may represent the distribution well, but with no higher likelihood than could
be obtained with just an RBM. A DBM with very small weights in all but the first
layer represents approximately the same distribution as an RBM.
Various techniques that permit joint training have been developed and are
described in section 20.4.5. However, the original and most popular method for
overcoming the joint training problem of DBMs is greedy layer-wise pretraining.
In this method, each layer of the DBM is trained in isolation as an RBM. The
first layer is trained to model the input data. Each subsequent RBM is trained
to model samples from the previous RBM’s posterior distribution. After all the
666
CHAPTER 20. DEEP GENERATIVE MODELS
Algorithm 20.1
The variational stochastic maximum likelihood algorithm for
training a DBM with two hidden layers
Set , the step size, to a small positive number
Set
k
, the number of Gibbs steps, high enough to allow a Markov chain of
p
(
v, h
(1)
, h
(2)
;
θ
+
θ
) to burn in, starting from samples from
p
(
v, h
(1)
, h
(2)
;
θ
).
Initialize three matrices,
˜
V
,
˜
H
(1)
, and
˜
H
(2)
each with
m
rows set to random
values (e.g., from Bernoulli distributions, possibly with marginals matched to
the model’s marginals).
while not converged (learning loop) do
Sample a minibatch of
m
examples from the training data and arrange them
as the rows of a design matrix V .
Initialize matrices
ˆ
H
(1)
and
ˆ
H
(2)
, possibly to the model’s marginals.
while not converged (mean field inference loop) do
ˆ
H
(1)
σ
V W
(1)
+
ˆ
H
(2)
W
(2)
.
ˆ
H
(2)
σ
ˆ
H
(1)
W
(2)
.
end while
W
(1)
1
m
V
ˆ
H
(1)
W
(2)
1
m
ˆ
H
(1)
ˆ
H
(2)
for l = 1 to k (Gibbs sampling) do
Gibbs block 1:
i, j,
˜
V
i,j
sampled from P (
˜
V
i,j
= 1) = σ
W
(1)
j,:
˜
H
(1)
i,:
.
i, j,
˜
H
(2)
i,j
sampled from P (
˜
H
(2)
i,j
= 1) = σ
˜
H
(1)
i,:
W
(2)
:,j
.
Gibbs block 2:
i, j,
˜
H
(1)
i,j
sampled from P (
˜
H
(1)
i,j
= 1) = σ
˜
V
i,:
W
(1)
:,j
+
˜
H
(2)
i,:
W
(2)
j,:
.
end for
W
(1)
W
(1)
1
m
V
˜
H
(1)
W
(2)
W
(2)
1
m
˜
H
(1)
˜
H
(2)
W
(1)
W
(1)
+
W
(1)
(this is a cartoon illustration, in practice use a more
effective algorithm, such as momentum with a decaying learning rate)
W
(2)
W
(2)
+
W
(2)
end while
RBMs have been trained in this way, they can be combined to form a DBM. The
DBM may then be trained with PCD. Typically PCD training will make only a
small change in the model’s parameters and in its performance as measured by the
667
CHAPTER 20. DEEP GENERATIVE MODELS
log-likelihood it assigns to the data, or its ability to classify inputs. See figure 20.4
for an illustration of the training procedure.
This greedy layer-wise training procedure is not just coordinate ascent. It bears
some passing resemblance to coordinate ascent because we optimize one subset of
the parameters at each step. The two methods differ because the greedy layer-wise
training procedure uses a different objective function at each step.
Greedy layer-wise pretraining of a DBM differs from greedy layer-wise pre-
training of a DBN. The parameters of each individual RBM may be copied to
the corresponding DBN directly. In the case of the DBM, the RBM parameters
must be modified before inclusion in the DBM. A layer in the middle of the stack
of RBMs is trained with only bottom-up input, but after the stack is combined
to form the DBM, the layer will have both bottom-up and top-down input. To
account for this effect, Salakhutdinov and Hinton (2009a) advocate dividing the
weights of all but the top and bottom RBM in half before inserting them into the
DBM. Additionally, the bottom RBM must be trained using two “copies” of each
visible unit and the weights tied to be equal between the two copies. This means
that the weights are effectively doubled during the upward pass. Similarly, the top
RBM should be trained with two copies of the topmost layer.
Obtaining the state of the art results with the deep Boltzmann machine requires
a modification of the standard SML algorithm, which is to use a small amount of
mean field during the negative phase of the joint PCD training step (Salakhutdinov
and Hinton, 2009a). Specifically, the expectation of the energy gradient should
be computed with respect to the mean field distribution in which all the units
are independent from each other. The parameters of this mean field distribution
should be obtained by running the mean field fixed-point equations for just one
step. See Goodfellow et al. (2013b) for a comparison of the performance of centered
DBMs with and without the use of partial mean field in the negative phase.
20.4.5 Jointly Training Deep Boltzmann Machines
Classic DBMs require greedy unsupervised pretraining and, to perform classification
well, require a separate MLP-based classifier on top of the hidden features they
extract. This has some undesirable properties. It is hard to track performance
during training because we cannot evaluate properties of the full DBM while
training the first RBM. Thus, it is hard to tell how well our hyperparameters
are working until quite late in the training process. Software implementations
of DBMs need to have many different components for CD training of individual
RBMs, PCD training of the full DBM, and training based on back-propagation
668
CHAPTER 20. DEEP GENERATIVE MODELS
through the MLP. Finally, the MLP on top of the Boltzmann machine loses many
of the advantages of the Boltzmann machine probabilistic model, such as being
able to perform inference when some input values are missing.
There are two main ways to resolve the joint training problem of the deep
d)
a) b)
c)
Figure 20.4: The deep Boltzmann machine training procedure used to classify the MNIST
dataset (Salakhutdinov and Hinton, 2009a; Srivastava et al., 2014). (a)Train an RBM by
using CD to approximately maximize
log P
(
v
). (b)Train a second RBM that models
h
(1)
and target class
y
by using CD-
k
to approximately maximize
log P
(
h
(1)
, y
), where
h
(1)
is drawn from the first RBM’s posterior conditioned on the data. Increase
k
from 1 to
20 during learning. (c)Combine the two RBMs into a DBM. Train it to approximately
maximize
log P
(
v, y
) using stochastic maximum likelihood with
k
= 5. (d)Delete
y
from
the model. Define a new set of features
h
(1)
and
h
(2)
that are obtained by running mean
field inference in the model lacking
y
. Use these features as input to an MLP whose
structure is the same as an additional pass of mean field, with an additional output layer
for the estimate of
y
. Initialize the MLP’s weights to be the same as the DBM’s weights.
Train the MLP to approximately maximize
log P
(
y | v
) using stochastic gradient descent
and dropout. Figure reprinted from Goodfellow et al. (2013b).
669
CHAPTER 20. DEEP GENERATIVE MODELS
Boltzmann machine. The first is the
centered deep Boltzmann machine
(Montavon and Muller, 2012), which reparametrizes the model in order to make
the Hessian of the cost function better conditioned at the beginning of the learning
process. This yields a model that can be trained without a greedy layer-wise
pretraining stage. The resulting model obtains excellent test set log-likelihood
and produces high-quality samples. Unfortunately, it remains unable to compete
with appropriately regularized MLPs as a classifier. The second way to jointly
train a deep Boltzmann machine is to use a
multi-prediction deep Boltzmann
machine
(Goodfellow et al., 2013b). This model uses an alternative training
criterion that allows the use of the back-propagation algorithm to avoid the
problems with MCMC estimates of the gradient. Unfortunately, the new criterion
does not lead to good likelihood or samples, but, compared to the MCMC approach,
it does lead to superior classification performance and ability to reason well about
missing inputs.
The centering trick for the Boltzmann machine is easiest to describe if we
return to the general view of a Boltzmann machine as consisting of a set of units
x
with a weight matrix
U
and biases
b
. Recall from equation 20.2 that the energy
function is given by
E(x) = x
Ux b
x. (20.36)
Using different sparsity patterns in the weight matrix
U
, we can implement
structures of Boltzmann machines, such as RBMs or DBMs with different numbers
of layers. This is accomplished by partitioning
x
into visible and hidden units and
zeroing out elements of
U
for units that do not interact. The centered Boltzmann
machine introduces a vector µ that is subtracted from all the states:
E
(x; U, b) = (x µ)
U(x µ) (x µ)
b. (20.37)
Typically
µ
is a hyperparameter fixed at the beginning of training. It is usu-
ally chosen to make sure that
x µ 0
when the model is initialized. This
reparametrization does not change the set of probability distributions that the
model can represent, but it does change the dynamics of stochastic gradient descent
applied to the likelihood. Specifically, in many cases, this reparametrization results
in a Hessian matrix that is better conditioned. Melchior et al. (2013) experimentally
confirmed that the conditioning of the Hessian matrix improves, and observed that
the centering trick is equivalent to another Boltzmann machine learning technique,
the
enhanced gradient
(Cho et al., 2011). The improved conditioning of the
Hessian matrix enables learning to succeed, even in difficult cases like training a
deep Boltzmann machine with multiple layers.
The other approach to jointly training deep Boltzmann machines is the multi-
prediction deep Boltzmann machine (MP-DBM), which works by viewing the mean
670
CHAPTER 20. DEEP GENERATIVE MODELS
field equations as defining a family of recurrent networks for approximately solving
every possible inference problem (Goodfellow et al., 2013b). Rather than training
the model to maximize the likelihood, the model is trained to make each recurrent
network obtain an accurate answer to the corresponding inference problem. The
training process is illustrated in figure 20.5. It consists of randomly sampling a
training example, randomly sampling a subset of inputs to the inference network,
and then training the inference network to predict the values of the remaining
units.
This general principle of back-propagating through the computational graph
for approximate inference has been applied to other models (Stoyanov et al., 2011;
Brakel et al., 2013). In these models and in the MP-DBM, the final loss is not
the lower bound on the likelihood. Instead, the final loss is typically based on
the approximate conditional distribution that the approximate inference network
imposes over the missing values. This means that the training of these models
is somewhat heuristically motivated. If we inspect the
p
(
v
) represented by the
Boltzmann machine learned by the MP-DBM, it tends to be somewhat defective,
in the sense that Gibbs sampling yields poor samples.
Back-propagation through the inference graph has two main advantages. First,
it trains the model as it is really used—with approximate inference. This means
that approximate inference, for example, to fill in missing inputs or to perform
classification despite the presence of missing inputs, is more accurate in the MP-
DBM than in the original DBM. The original DBM does not make an accurate
classifier on its own; the best classification results with the original DBM were
based on training a separate classifier to use features extracted by the DBM,
rather than by using inference in the DBM to compute the distribution over the
class labels. Mean field inference in the MP-DBM performs well as a classifier
without special modifications. The other advantage of back-propagating through
approximate inference is that back-propagation computes the exact gradient of
the loss. This is better for optimization than the approximate gradients of SML
training, which suffer from both bias and variance. This probably explains why MP-
DBMs may be trained jointly while DBMs require a greedy layer-wise pretraining.
The disadvantage of back-propagating through the approximate inference graph is
that it does not provide a way to optimize the log-likelihood, but rather gives a
heuristic approximation of the generalized pseudolikelihood.
The MP-DBM inspired the NADE-
k
(Raiko et al., 2014) extension to the
NADE framework, which is described in section 20.10.10.
The MP-DBM has some connections to dropout. Dropout shares the same pa-
rameters among many different computational graphs, with the difference between
671
CHAPTER 20. DEEP GENERATIVE MODELS
Figure 20.5: An illustration of the multiprediction training process for a deep Boltzmann
machine. Each row indicates a different example within a minibatch for the same training
step. Each column represents a time step within the mean field inference process. For
each example, we sample a subset of the data variables to serve as inputs to the inference
process. These variables are shaded black to indicate conditioning. We then run the
mean field inference process, with arrows indicating which variables influence which other
variables in the process. In practical applications, we unroll mean field for several steps.
In this illustration, we unroll for only two steps. Dashed arrows indicate how the process
could be unrolled for more steps. The data variables that were not used as inputs to the
inference process become targets, shaded in gray. We can view the inference process for
each example as a recurrent network. We use gradient descent and back-propagation to
train these recurrent networks to produce the correct targets given their inputs. This
trains the mean field process for the MP-DBM to produce accurate estimates. Figure
adapted from Goodfellow et al. (2013b).
672
CHAPTER 20. DEEP GENERATIVE MODELS
each graph being whether it includes or excludes each unit. The MP-DBM also
shares parameters across many computational graphs. In the case of the MP-DBM,
the difference between the graphs is whether each input unit is observed or not.
When a unit is not observed, the MP-DBM does not delete it entirely as dropout
does. Instead, the MP-DBM treats it as a latent variable to be inferred. One could
imagine applying dropout to the MP-DBM by additionally removing some units
rather than making them latent.
20.5 Boltzmann Machines for Real-Valued Data
While Boltzmann machines were originally developed for use with binary data,
many applications such as image and audio modeling seem to require the ability
to represent probability distributions over real values. In some cases, it is possible
to treat real-valued data in the interval [0, 1] as representing the expectation of a
binary variable. For example, Hinton (2000) treats grayscale images in the training
set as defining [0,1] probability values. Each pixel defines the probability of a
binary value being 1, and the binary pixels are all sampled independently from each
other. This is a common procedure for evaluating binary models on grayscale image
datasets. Nonetheless, it is not a particularly theoretically satisfying approach,
and binary images sampled independently in this way have a noisy appearance. In
this section, we present Boltzmann machines that define a probability density over
real-valued data.
20.5.1 Gaussian-Bernoulli RBMs
Restricted Boltzmann machines may be developed for many exponential family
conditional distributions (Welling et al., 2005). Of these, the most common is the
RBM with binary hidden units and real-valued visible units, with the conditional
distribution over the visible units being a Gaussian distribution whose mean is a
function of the hidden units.
There are many ways of parametrizing Gaussian-Bernoulli RBMs. One choice
is whether to use a covariance matrix or a precision matrix for the Gaussian
distribution. Here we present the precision formulation. The modification to obtain
the covariance formulation is straightforward. We wish to have the conditional
distribution
p(v | h) = N(v; W h, β
1
). (20.38)
We can find the terms we need to add to the energy function by expanding the
673
CHAPTER 20. DEEP GENERATIVE MODELS
unnormalized log conditional distribution:
log N(v; W h, β
1
) =
1
2
(v W h)
β (v W h) + f(β). (20.39)
Here
f
encapsulates all the terms that are a function only of the parameters
and not the random variables in the model. We can discard
f
because its only
role is to normalize the distribution, and the partition function of whatever energy
function we choose will carry out that role.
If we include all the terms (with their sign flipped) involving
v
from equa-
tion 20.39 in our energy function and do not add any other terms involving
v
, then
our energy function will represent the desired conditional p(v | h).
We have some freedom regarding the other conditional distribution,
p
(
h | v
).
Note that equation 20.39 contains a term
1
2
h
W
βW h. (20.40)
This term cannot be included in its entirety because it includes
h
i
h
j
terms. These
correspond to edges between the hidden units. If we included these terms, we
would have a linear factor model instead of a restricted Boltzmann machine. When
designing our Boltzmann machine, we simply omit these
h
i
h
j
cross terms. Omitting
them does not change the conditional
p
(
v | h
), so equation 20.39 is still respected.
We still have a choice, however, about whether to include the terms involving only
a single
h
i
. If we assume a diagonal precision matrix, we find that for each hidden
unit h
i
, we have a term
1
2
h
i
j
β
j
W
2
j,i
. (20.41)
In the above, we used the fact that
h
2
i
=
h
i
because
h
i
{
0
,
1
}
. If we include this
term (with its sign flipped) in the energy function, then it will naturally bias
h
i
to be turned off when the weights for that unit are large and connected to visible
units with high precision. The choice of whether to include this bias term does
not affect the family of distributions that the model can represent (assuming that
we include bias parameters for the hidden units), but it does affect the learning
dynamics of the model. Including the term may help the hidden unit activations
remain reasonable even when the weights rapidly increase in magnitude.
One way to define the energy function on a Gaussian-Bernoulli RBM is thus:
E(v, h) =
1
2
v
(β v) (v β)
W h b
h, (20.42)
674
CHAPTER 20. DEEP GENERATIVE MODELS
but we may also add extra terms or parametrize the energy in terms of the variance
rather than precision if we choose.
In this derivation, we have not included a bias term on the visible units, but one
could easily be added. One final source of variability in the parametrization of a
Gaussian-Bernoulli RBM is the choice of how to treat the precision matrix. It may
be either fixed to a constant (perhaps estimated based on the marginal precision
of the data) or learned. It may also be a scalar times the identity matrix, or it
may be a diagonal matrix. Typically we do not allow the precision matrix to be
nondiagonal in this context, because some operations on the Gaussian distribution
require inverting the matrix, and a diagonal matrix can be inverted trivially. In
the sections ahead, we will see that other forms of Boltzmann machines permit
modeling the covariance structure, using various techniques to avoid inverting the
precision matrix.
20.5.2 Undirected Models of Conditional Covariance
While the Gaussian RBM has been the canonical energy model for real-valued
data, Ranzato et al. (2010a) argue that the Gaussian RBM inductive bias is not
well suited to the statistical variations present in some types of real-valued data,
especially natural images. The problem is that much of the information content
present in natural images is embedded in the covariance between pixels rather than
in the raw pixel values. In other words, it is the relationships between pixels and
not their absolute values where most of the useful information in images resides.
Since the Gaussian RBM only models the conditional mean of the input given the
hidden units, it cannot capture conditional covariance information. In response
to these criticisms, alternative models have been proposed that attempt to better
account for the covariance of real-valued data. These models include the mean and
covariance RBM (mcRBM
1
), the mean product of Student
t
-distribution (mPoT)
model, and the spike and slab RBM (ssRBM).
Mean and Covariance RBM
The mcRBM uses its hidden units to indepen-
dently encode the conditional mean and covariance of all observed units. The
mcRBM hidden layer is divided into two groups of units: mean units and covariance
units. The group that models the conditional mean is simply a Gaussian RBM.
The other half is a covariance RBM (Ranzato et al., 2010a), also called a cRBM,
whose components model the conditional covariance structure, as described below.
1
The term “mcRBM” is pronounced by saying the name of the letters M-C-R-B-M; the “mc”
is not pronounced like the “Mc” in “McDonald’s.”
675
CHAPTER 20. DEEP GENERATIVE MODELS
Specifically, with binary mean units
h
(m)
and binary covariance units
h
(c)
, the
mcRBM model is defined as the combination of two energy functions:
E
mc
(x, h
(m)
, h
(c)
) = E
m
(x, h
(m)
) + E
c
(x, h
(c)
), (20.43)
where E
m
is the standard Gaussian-Bernoulli RBM energy function,
2
E
m
(x, h
(m)
) =
1
2
x
x
j
x
W
:,j
h
(m)
j
j
b
(m)
j
h
(m)
j
, (20.44)
and
E
c
is the cRBM energy function that models the conditional covariance
information:
E
c
(x, h
(c)
) =
1
2
j
h
(c)
j
x
r
(j)
2
j
b
(c)
j
h
(c)
j
. (20.45)
The parameter
r
(j)
corresponds to the covariance weight vector associated with
h
(c)
j
, and
b
(c)
is a vector of covariance offsets. The combined energy function defines
a joint distribution,
p
mc
(x, h
(m)
, h
(c)
) =
1
Z
exp
E
mc
(x, h
(m)
, h
(c)
)
, (20.46)
and a corresponding conditional distribution over the observations given
h
(m)
and
h
(c)
as a multivariate Gaussian distribution:
p
mc
(x | h
(m)
, h
(c)
) = N
x; C
mc
x|h
j
W
:,j
h
(m)
j
, C
mc
x|h
. (20.47)
Note that the covariance matrix
C
mc
x|h
=
j
h
(c)
j
r
(j)
r
(j)
+ I
1
is nondiagonal
and that
W
is the weight matrix associated with the Gaussian RBM modeling the
conditional means. It is difficult to train the mcRBM via contrastive divergence or
persistent contrastive divergence because of its nondiagonal conditional covariance
structure. CD and PCD require sampling from the joint distribution of
x, h
(m)
, h
(c)
,
which, in a standard RBM, is accomplished by Gibbs sampling over the conditionals.
However, in the mcRBM, sampling from
p
mc
(
x | h
(m)
, h
(c)
) requires computing
(
C
mc
)
1
at every iteration of learning. This can be an impractical computational
burden for larger observations. Ranzato and Hinton (2010) avoid direct sampling
from the conditional
p
mc
(
x | h
(m)
, h
(c)
) by sampling directly from the marginal
p
(
x
) using Hamiltonian (hybrid) Monte Carlo (Neal, 1993) on the mcRBM free
energy.
2
This version of the Gaussian-Bernoulli RBM energy function assumes the image data have
zero mean per pixel. Pixel offsets can easily be added to the model to account for nonzero pixel
means.
676
CHAPTER 20. DEEP GENERATIVE MODELS
Mean Product of Student t-distributions
The mean product of Student
t
-
distribution (mPoT) model (Ranzato et al., 2010b) extends the PoT model (Welling
et al., 2003a) in a manner similar to how the mcRBM extends the cRBM. This
is achieved by including nonzero Gaussian means by the addition of Gaussian
RBM-like hidden units. Like the mcRBM, the PoT conditional distribution over the
observation is a multivariate Gaussian (with nondiagonal covariance) distribution;
however, unlike the mcRBM, the complementary conditional distribution over the
hidden variables is given by conditionally independent Gamma distributions. The
Gamma distribution
G
(
k, θ
) is a probability distribution over positive real numbers,
with mean
kθ
. It is not necessary to have a more detailed understanding of the
Gamma distribution to understand the basic ideas underlying the mPoT model.
The mPoT energy function is
E
mPoT
(x, h
(m)
, h
(c)
) (20.48)
= E
m
(x, h
(m)
) +
j
h
(c)
j
1 +
1
2
r
(j)
x
2
+ (1 γ
j
) log h
(c)
j
,
(20.49)
where
r
(j)
is the covariance weight vector associated with unit
h
(c)
j
, and
E
m
(
x, h
(m)
)
is as defined in equation 20.44.
Just as with the mcRBM, the mPoT model energy function specifies a mul-
tivariate Gaussian, with a conditional distribution over
x
that has
nondiagonal
covariance. Learning in the mPoT model—again, like the mcRBM—is com-
plicated by the inability to sample from the nondiagonal Gaussian conditional
p
mPoT
(
x | h
(m)
, h
(c)
), so Ranzato et al. (2010b) also advocate direct sampling of
p(x) via Hamiltonian (hybrid) Monte Carlo.
Spike and Slab Restricted Boltzmann Machines
Spike and slab restricted
Boltzmann machines (Courville et al., 2011) or ssRBMs provide another means
of modeling the covariance structure of real-valued data. Compared to mcRBMs,
ssRBMs have the advantage of requiring neither matrix inversion nor Hamiltonian
Monte Carlo methods. Like the mcRBM and the mPoT model, the ssRBM’s binary
hidden units encode the conditional covariance across pixels through the use of
auxiliary real-valued variables.
The spike and slab RBM has two sets of hidden units: binary
spike
units
h
and real-valued
slab
units
s
. The mean of the visible units conditioned on the
hidden units is given by (
h s
)
W
. In other words, each column
W
:,i
defines a
component that can appear in the input when
h
i
= 1. The corresponding spike
677
CHAPTER 20. DEEP GENERATIVE MODELS
variable
h
i
determines whether that component is present at all. The corresponding
slab variable
s
i
determines the intensity of that component, if it is present. When
a spike variable is active, the corresponding slab variable adds variance to the
input along the axis defined by
W
:,i
. This allows us to model the covariance of the
inputs. Fortunately, contrastive divergence and persistent contrastive divergence
with Gibbs sampling are still applicable. There is no need to invert any matrix.
Formally, the ssRBM model is defined via its energy function:
E
ss
(x, s, h) =
i
x
W
:,i
s
i
h
i
+
1
2
x
Λ +
i
Φ
i
h
i
x (20.50)
+
1
2
i
α
i
s
2
i
i
α
i
µ
i
s
i
h
i
i
b
i
h
i
+
i
α
i
µ
2
i
h
i
, (20.51)
where
b
i
is the offset of the spike
h
i
, and
Λ
is a diagonal precision matrix on the
observations
x
. The parameter
α
i
>
0 is a scalar precision parameter for the
real-valued slab variable
s
i
. The parameter
Φ
i
is a nonnegative diagonal matrix
that defines an
h
-modulated quadratic penalty on
x
. Each
µ
i
is a mean parameter
for the slab variable s
i
.
With the joint distribution defined via the energy function, deriving the ssRBM
conditional distributions is relatively straightforward. For example, by marginal-
izing out the slab variables
s
, the conditional distribution over the observations
given the binary spike variables h is given by
p
ss
(x | h) =
1
P (h)
1
Z
exp {−E(x, s, h)} ds (20.52)
= N
x; C
ss
x|h
i
W
:,i
µ
i
h
i
, C
ss
x|h
(20.53)
where
C
ss
x|h
=
Λ +
i
Φ
i
h
i
i
α
1
i
h
i
W
:,i
W
:,i
1
. The last equality holds only if
the covariance matrix C
ss
x|h
is positive definite.
Gating by the spike variables means that the true marginal distribution over
h s
is sparse. This is different from sparse coding, where samples from the model
“almost never” (in the measure theoretic sense) contain zeros in the code, and MAP
inference is required to impose sparsity.
Comparing the ssRBM to the mcRBM and the mPoT models, the ssRBM
parametrizes the conditional covariance of the observation in a significantly different
way. The mcRBM and mPoT both model the covariance structure of the observation
as
j
h
(c)
j
r
(j)
r
(j)
+ I
1
, using the activation of the hidden units
h
j
>
0 to
678
CHAPTER 20. DEEP GENERATIVE MODELS
enforce constraints on the conditional covariance in the direction
r
(j)
. In contrast,
the ssRBM specifies the conditional covariance of the observations using the hidden
spike activations
h
i
= 1 to pinch the precision matrix along the direction specified
by the corresponding weight vector. The ssRBM conditional covariance is similar to
that given by a different model: the product of probabilistic principal components
analysis (PoPPCA) (Williams and Agakov, 2002). In the overcomplete setting,
sparse activations with the ssRBM parametrization permit significant variance
(above the nominal variance given by
Λ
1
) only in the selected directions of
the sparsely activated
h
i
. In the mcRBM or mPoT models, an overcomplete
representation would mean that to capture variation in a particular direction in the
observation space would require removing potentially all constraints with positive
projection in that direction. This would suggest that these models are less well
suited to the overcomplete setting.
The primary disadvantage of the spike and slab restricted Boltzmann machine
is that some settings of the parameters can correspond to a covariance matrix
that is not positive definite. Such a covariance matrix places more unnormalized
probability on values that are farther from the mean, causing the integral over
all possible outcomes to diverge. Generally this issue can be avoided with simple
heuristic tricks. There is not yet any theoretically satisfying solution. Using
constrained optimization to explicitly avoid the regions where the probability is
undefined is difficult to do without being overly conservative and also preventing
the model from accessing high-performing regions of parameter space.
Qualitatively, convolutional variants of the ssRBM produce excellent samples
of natural images. Some examples are shown in figure 16.1.
The ssRBM allows for several extensions. Including higher-order interactions
and average-pooling of the slab variables (Courville et al., 2014) enables the model
to learn excellent features for a classifier when labeled data is scarce. Adding a
term to the energy function that prevents the partition function from becoming
undefined results in a sparse coding model, spike and slab sparse coding (Goodfellow
et al., 2013d), also known as S3C.
20.6 Convolutional Boltzmann Machines
As we discuss in chapter 9, extremely high-dimensional inputs such as images place
great strain on the computation, memory and