
CHAPTER 20. DEEP GENERATIVE MODELS
field equations as defining a family of recurrent networks for approximately solving
every possible inference problem (Goodfellow et al., 2013b). Rather than training
the model to maximize the likelihood, the model is trained to make each recurrent
network obtain an accurate answer to the corresponding inference problem. The
training process is illustrated in figure 20.5. It consists of randomly sampling a
training example, randomly sampling a subset of inputs to the inference network,
and then training the inference network to predict the values of the remaining
units.
This general principle of back-propagating through the computational graph
for approximate inference has been applied to other models (Stoyanov et al., 2011;
Brakel et al., 2013). In these models and in the MP-DBM, the final loss is not
the lower bound on the likelihood. Instead, the final loss is typically based on
the approximate conditional distribution that the approximate inference network
imposes over the missing values. This means that the training of these models
is somewhat heuristically motivated. If we inspect the
p
(
v
) represented by the
Boltzmann machine learned by the MP-DBM, it tends to be somewhat defective,
in the sense that Gibbs sampling yields poor samples.
Back-propagation through the inference graph has two main advantages. First,
it trains the model as it is really used—with approximate inference. This means
that approximate inference, for example, to fill in missing inputs or to perform
classification despite the presence of missing inputs, is more accurate in the MP-
DBM than in the original DBM. The original DBM does not make an accurate
classifier on its own; the best classification results with the original DBM were
based on training a separate classifier to use features extracted by the DBM,
rather than by using inference in the DBM to compute the distribution over the
class labels. Mean field inference in the MP-DBM performs well as a classifier
without special modifications. The other advantage of back-propagating through
approximate inference is that back-propagation computes the exact gradient of
the loss. This is better for optimization than the approximate gradients of SML
training, which suffer from both bias and variance. This probably explains why MP-
DBMs may be trained jointly while DBMs require a greedy layer-wise pretraining.
The disadvantage of back-propagating through the approximate inference graph is
that it does not provide a way to optimize the log-likelihood, but rather gives a
heuristic approximation of the generalized pseudolikelihood.
The MP-DBM inspired the NADE-
k
(Raiko et al., 2014) extension to the
NADE framework, which is described in section 20.10.10.
The MP-DBM has some connections to dropout. Dropout shares the same pa-
rameters among many different computational graphs, with the difference between
671