Chapter 19
Approximate Inference
Many probabilistic models are difficult to train because it is difficult to perform
inference in them. In the context of deep learning, we usually have a set of visible
variables
v
and a set of latent variables
h
. The challenge of inference usually
refers to the difficult problem of computing
p
(
h | v
) or taking expectations with
respect to it. Such operations are often necessary for tasks like maximum likelihood
learning.
Many simple graphical models with only one hidden layer, such as restricted
Boltzmann machines and probabilistic PCA, are defined in a way that makes
inference operations like computing
p
(
h | v
), or taking expectations with respect
to it, simple. Unfortunately, most graphical models with multiple layers of hidden
variables have intractable posterior distributions. Exact inference requires an
exponential amount of time in these models. Even some models with only a single
layer, such as sparse coding, have this problem.
In this chapter, we introduce several of the techniques for confronting these
intractable inference problems. In chapter 20, we describe how to use these
techniques to train probabilistic models that would otherwise be intractable, such
as deep belief networks and deep Boltzmann machines.
Intractable inference problems in deep learning usually arise from interactions
between latent variables in a structured graphical model. See figure 19.1 for some
examples. These interactions may be due to direct interactions in undirected
models or “explaining away” interactions between mutual ancestors of the same
visible unit in directed models.
629
CHAPTER 19. APPROXIMATE INFERENCE
Figure 19.1: Intractable inference problems in deep learning are usually the result of
interactions between latent variables in a structured graphical model. These interactions
can be due to edges directly connecting one latent variable to another or longer paths
that are activated when the child of a V-structure is observed. (Left)A
semi-restricted
Boltzmann machine
(Osindero and Hinton, 2008) with connections between hidden
units. These direct connections between latent variables make the posterior distribution
intractable because of large cliques of latent variables. (Center)A deep Boltzmann machine,
organized into layers of variables without intralayer connections, still has an intractable
posterior distribution because of the connections between layers. (Right)This directed
model has interactions between latent variables when the visible variables are observed,
because every two latent variables are coparents. Some probabilistic models are able
to provide tractable inference over the latent variables despite having one of the graph
structures depicted above. This is possible if the conditional probability distributions
are chosen to introduce additional independences beyond those described by the graph.
For example, probabilistic PCA has the graph structure shown in the right yet still has
simple inference because of special properties of the specific conditional distributions it
uses (linear-Gaussian conditionals with mutually orthogonal basis vectors).
630
CHAPTER 19. APPROXIMATE INFERENCE
19.1 Inference as Optimization
Many approaches to confronting the problem of difficult inference make use of
the observation that exact inference can be described as an optimization problem.
Approximate inference algorithms may then be derived by approximating the
underlying optimization problem.
To construct the optimization problem, assume we have a probabilistic model
consisting of observed variables
v
and latent variables
h
. We would like to compute
the log-probability of the observed data,
log p
(
v
;
θ
). Sometimes it is too difficult
to compute
log p
(
v
;
θ
) if it is costly to marginalize out
h
. Instead, we can compute
a lower bound
L
(
v, θ, q
) on
log p
(
v
;
θ
). This bound is called the
evidence lower
bound
(ELBO). Another commonly used name for this lower bound is the negative
variational free energy
. Specifically, the evidence lower bound is defined to be
L(v, θ, q) = log p(v; θ) D
KL
(q(h | v)p(h | v; θ)) , (19.1)
where q is an arbitrary probability distribution over h.
Because the difference between
log p
(
v
) and
L
(
v, θ, q
) is given by the KL
divergence, and because the KL divergence is always nonnegative, we can see that
L
always has at most the same value as the desired log-probability. The two are
equal if and only if q is the same distribution as p(h | v).
Surprisingly,
L
can be considerably easier to compute for some distributions
q
.
Simple algebra shows that we can rearrange
L
into a much more convenient form:
L(v, θ, q) = log p(v; θ) D
KL
(q(h | v)p(h | v; θ)) (19.2)
= log p(v; θ) E
hq
log
q(h | v)
p(h | v)
(19.3)
= log p(v; θ) E
hq
log
q(h | v)
p(h,v;θ)
p(v;θ)
(19.4)
= log p(v; θ) E
hq
[log q(h | v) log p(h, v; θ) + log p(v; θ)] (19.5)
= E
hq
[log q(h | v) log p(h, v; θ)] . (19.6)
This yields the more canonical definition of the evidence lower bound,
L(v, θ, q) = E
hq
[log p(h, v)] + H(q). (19.7)
For an appropriate choice of
q
,
L
is tractable to compute. For any choice
of
q
,
L
provides a lower bound on the likelihood. For
q
(
h | v
) that are better
631
CHAPTER 19. APPROXIMATE INFERENCE
approximations of
p
(
h | v
), the lower bound
L
will be tighter, in other words,
closer to
log p
(
v
). When
q
(
h | v
) =
p
(
h | v
), the approximation is perfect, and
L(v, θ, q) = log p(v; θ).
We can thus think of inference as the procedure for finding the
q
that maximizes
L
. Exact inference maximizes
L
perfectly by searching over a family of functions
q
that includes
p
(
h | v
). Throughout this chapter, we show how to derive different
forms of approximate inference by using approximate optimization to find
q
. We
can make the optimization procedure less expensive but approximate by restricting
the family of distributions
q
that the optimization is allowed to search over or by
using an imperfect optimization procedure that may not completely maximize
L
but may merely increase it by a significant amount.
No matter what choice of
q
we use,
L
is a lower bound. We can get tighter
or looser bounds that are cheaper or more expensive to compute depending on
how we choose to approach this optimization problem. We can obtain a poorly
matched
q
but reduce the computational cost by using an imperfect optimization
procedure, or by using a perfect optimization procedure over a restricted family of
q distributions.
19.2 Expectation Maximization
The first algorithm we introduce based on maximizing a lower bound
L
is the
expectation maximization
(EM) algorithm, a popular training algorithm for
models with latent variables. We describe here a view on the EM algorithm
developed by Neal and Hinton (1999). Unlike most of the other algorithms we
describe in this chapter, EM is not an approach to approximate inference, but
rather an approach to learning with an approximate posterior.
The EM algorithm consists of alternating between two steps until convergence:
The
E-step
(expectation step): Let
θ
(0)
denote the value of the parameters
at the beginning of the step. Set
q
(
h
(i)
| v
) =
p
(
h
(i)
| v
(i)
;
θ
(0)
) for all
indices
i
of the training examples
v
(i)
we want to train on (both batch and
minibatch variants are valid). By this we mean
q
is defined in terms of the
current parameter value of
θ
(0)
; if we vary
θ
, then
p
(
h | v
;
θ
) will change,
but q(h | v) will remain equal to p(h | v; θ
(0)
).
The M-step (maximization step): Completely or partially maximize
i
L(v
(i)
, θ, q) (19.8)
632
CHAPTER 19. APPROXIMATE INFERENCE
with respect to θ using your optimization algorithm of choice.
This can be viewed as a coordinate ascent algorithm to maximize
L
. On one
step, we maximize
L
with respect to
q
, and on the other, we maximize
L
with
respect to θ.
Stochastic gradient ascent on latent variable models can be seen as a special
case of the EM algorithm where the M-step consists of taking a single gradient
step. Other variants of the EM algorithm can make much larger steps. For some
model families, the M-step can even be performed analytically, jumping all the
way to the optimal solution for θ given the current q.
Even though the E-step involves exact inference, we can think of the EM
algorithm as using approximate inference in some sense. Specifically, the M-step
assumes that the same value of
q
can be used for all values of
θ
. This will introduce
a gap between
L
and the true
log p
(
v
) as the M-step moves further and further
away from the value
θ
(0)
used in the E-step. Fortunately, the E-step reduces the
gap to zero again as we enter the loop for the next time.
The EM algorithm contains a few different insights. First, there is the basic
structure of the learning process, in which we update the model parameters to
improve the likelihood of a completed dataset, where all missing variables have
their values provided by an estimate of the posterior distribution. This particular
insight is not unique to the EM algorithm. For example, using gradient descent to
maximize the log-likelihood also has this same property; the log-likelihood gradient
computations require taking expectations with respect to the posterior distribution
over the hidden units. Another key insight in the EM algorithm is that we can
continue to use one value of
q
even after we have moved to a different value of
θ
.
This particular insight is used throughout classical machine learning to derive large
M-step updates. In the context of deep learning, most models are too complex
to admit a tractable solution for an optimal large M-step update, so this second
insight, which is more unique to the EM algorithm, is rarely used.
19.3 MAP Inference and Sparse Coding
We usually use the term inference to refer to computing the probability distribution
over one set of variables given another. When training probabilistic models with
latent variables, we are usually interested in computing
p
(
h | v
). An alternative
form of inference is to compute the single most likely value of the missing variables,
rather than to infer the entire distribution over their possible values. In the context
633
CHAPTER 19. APPROXIMATE INFERENCE
of latent variable models, this means computing
h
= arg max
h
p(h | v). (19.9)
This is known as
maximum a posteriori
inference, abbreviated as MAP inference.
MAP inference is usually not thought of as approximate inference—it does
compute the exact most likely value of
h
. However, if we wish to develop a
learning process based on maximizing
L
(
v, h, q
), then it is helpful to think of
MAP inference as a procedure that provides a value of
q
. In this sense, we can
think of MAP inference as approximate inference, because it does not provide the
optimal q.
Recall from section 19.1 that exact inference consists of maximizing
L(v, θ, q) = E
hq
[log p(h, v)] + H(q) (19.10)
with respect to
q
over an unrestricted family of probability distributions, using
an exact optimization algorithm. We can derive MAP inference as a form of
approximate inference by restricting the family of distributions
q
may be drawn
from. Specifically, we require q to take on a Dirac distribution:
q(h | v) = δ(h µ). (19.11)
This means that we can now control
q
entirely via
µ
. Dropping terms of
L
that
do not vary with µ, we are left with the optimization problem
µ
= arg max
µ
log p(h = µ, v), (19.12)
which is equivalent to the MAP inference problem
h
= arg max
h
p(h | v). (19.13)
We can thus justify a learning procedure similar to EM, in which we alternate
between performing MAP inference to infer
h
and then update
θ
to increase
log p
(
h
, v
). As with EM, this is a form of coordinate ascent on
L
, where we
alternate between using inference to optimize
L
with respect to
q
and using
parameter updates to optimize
L
with respect to
θ
. The procedure as a whole can
be justified by the fact that
L
is a lower bound on
log p
(
v
). In the case of MAP
inference, this justification is rather vacuous, because the bound is infinitely loose,
due to the Dirac distribution’s differential entropy of negative infinity. Adding
noise to µ would make the bound meaningful again.
634
CHAPTER 19. APPROXIMATE INFERENCE
MAP inference is commonly used in deep learning as both a feature extractor
and a learning mechanism. It is primarily used for sparse coding models.
Recall from section 13.4 that sparse coding is a linear factor model that imposes
a sparsity-inducing prior on its hidden units. A common choice is a factorial Laplace
prior, with
p(h
i
) =
λ
2
e
λ|h
i
|
. (19.14)
The visible units are then generated by performing a linear transformation and
adding noise:
p(x | h) = N(v; W h + b, β
1
I). (19.15)
Computing or even representing
p
(
h | v
) is difficult. Every pair of variables
h
i
and
h
j
are both parents of
v
. This means that when
v
is observed, the graphical
model contains an active path connecting
h
i
and
h
j
. All the hidden units thus
participate in one massive clique in
p
(
h | v
). If the model were Gaussian, then
these interactions could be modeled efficiently via the covariance matrix, but the
sparse prior makes these interactions non-Gaussian.
Because
p
(
h | v
) is intractable, so is the computation of the log-likelihood and
its gradient. We thus cannot use exact maximum likelihood learning. Instead, we
use MAP inference and learn the parameters by maximizing the ELBO defined by
the Dirac distribution around the MAP estimate of h.
If we concatenate all the
h
vectors in the training set into a matrix
H
, and
concatenate all the
v
vectors into a matrix
V
, then the sparse coding learning
process consists of minimizing
J(H, W ) =
i,j
|H
i,j
| +
i,j
V HW
2
i,j
. (19.16)
Most applications of sparse coding also involve weight decay or a constraint on the
norms of the columns of
W
, to prevent the pathological solution with extremely
small H and large W .
We can minimize
J
by alternating between minimization with respect to
H
and minimization with respect to
W
. Both subproblems are convex. In fact, the
minimization with respect to
W
is just a linear regression problem. Minimization
of J with respect to both arguments, however, is usually not a convex problem.
Minimization with respect to
H
requires specialized algorithms such as the
feature-sign search algorithm (Lee et al., 2007).
635
CHAPTER 19. APPROXIMATE INFERENCE
19.4 Variational Inference and Learning
We have seen how the evidence lower bound
L
(
v, θ, q
) is a lower bound on
log p
(
v
;
θ
), how inference can be viewed as maximizing
L
with respect to
q
, and
how learning can be viewed as maximizing
L
with respect to
θ
. We have seen
that the EM algorithm enables us to make large learning steps with a fixed
q
and
that learning algorithms based on MAP inference enable us to learn using a point
estimate of
p
(
h | v
) rather than inferring the entire distribution. Now we develop
the more general approach to variational learning.
The core idea behind variational learning is that we can maximize
L
over a
restricted family of distributions
q
. This family should be chosen so that it is easy
to compute
E
q
log p
(
h, v
). A typical way to do this is to introduce assumptions
about how q factorizes.
A common approach to variational learning is to impose the restriction that
q
is a factorial distribution:
q(h | v) =
i
q(h
i
| v). (19.17)
This is called the
mean field
approach. More generally, we can impose any graphi-
cal model structure we choose on
q
, to flexibly determine how many interactions we
want our approximation to capture. This fully general graphical model approach
is called structured variational inference (Saul and Jordan, 1996).
The beauty of the variational approach is that we do not need to specify
a specific parametric form for
q
. We specify how it should factorize, but then
the optimization problem determines the optimal probability distribution within
those factorization constraints. For discrete latent variables, this just means
that we use traditional optimization techniques to optimize a finite number of
variables describing the
q
distribution. For continuous latent variables, this means
that we use a branch of mathematics called calculus of variations to perform
optimization over a space of functions and actually determine which function
should be used to represent
q
. Calculus of variations is the origin of the names
“variational learning” and “variational inference,” though these names apply even
when the latent variables are discrete and calculus of variations is not needed.
With continuous latent variables, calculus of variations is a powerful technique
that removes much of the responsibility from the human designer of the model,
who now must specify only how
q
factorizes, rather than needing to guess how to
design a specific q that can accurately approximate the posterior.
Because
L
(
v, θ, q
) is defined to be
log p
(
v
;
θ
)
D
KL
(
q
(
h | v
)
p
(
h | v
;
θ
)), we
can think of maximizing
L
with respect to
q
as minimizing
D
KL
(
q
(
h | v
)
p
(
h | v
)).
636
CHAPTER 19. APPROXIMATE INFERENCE
In this sense, we are fitting
q
to
p
. However, we are doing so with the opposite di-
rection of the KL divergence than we are used to using for fitting an approximation.
When we use maximum likelihood learning to fit a model to data, we minimize