CHAPTER 19. APPROXIMATE INFERENCE
with respect to θ using your optimization algorithm of choice.
This can be viewed as a coordinate ascent algorithm to maximize
L
. On one
step, we maximize
L
with respect to
q
, and on the other, we maximize
L
with
respect to θ.
Stochastic gradient ascent on latent variable models can be seen as a special
case of the EM algorithm where the M step consists of taking a single gradient
step. Other variants of the EM algorithm can make much larger steps. For some
model families, the M step can even be performed analytically, jumping all the
way to the optimal solution for θ given the current q.
Even though the E-step involves exact inference, we can think of the EM
algorithm as using approximate inference in some sense. Speciﬁcally, the M-step
assumes that the same value of
q
can be used for all values of
θ
. This will introduce
a gap between
L
and the true
log p
(
v
) as the M-step moves further and further
away from the value
θ
(0)
used in the E-step. Fortunately, the E-step reduces the
gap to zero again as we enter the loop for the next time.
The EM algorithm contains a few diﬀerent insights. First, there is the basic
structure of the learning process, in which we update the model parameters to
improve the likelihood of a completed dataset, where all missing variables have
their values provided by an estimate of the posterior distribution. This particular
insight is not unique to the EM algorithm. For example, using gradient descent to
maximize the log-likelihood also has this same property; the log-likelihood gradient
computations require taking expectations with respect to the posterior distribution
over the hidden units. Another key insight in the EM algorithm is that we can
continue to use one value of
q
even after we have moved to a diﬀerent value of
θ
.
This particular insight is used throughout classical machine learning to derive large
M-step updates. In the context of deep learning, most models are too complex
to admit a tractable solution for an optimal large M-step update, so this second
insight which is more unique to the EM algorithm is rarely used.
19.3 MAP Inference and Sparse Coding
We usually use the term inference to refer to computing the probability distribution
over one set of variables given another. When training probabilistic models with
latent variables, we are usually interested in computing
p
(
h | v
). An alternative
form of inference is to compute the single most likely value of the missing variables,
rather than to infer the entire distribution over their possible values. In the context
635