CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
much help for training deeper models directly. This is because it is diﬃcult to
obtain samples of the hidden units given samples of the visible units. Since the
hidden units are not included in the data, initializing from training points cannot
solve the problem. Even if we initialize the visible units from the data, we will still
need to burn in a Markov chain sampling from the distribution over the hidden
units conditioned on those visible samples.
The CD algorithm can be thought of as penalizing the model for having a
Markov chain that changes the input rapidly when the input comes from the data.
This means training with CD somewhat resembles autoencoder training. Even
though CD is more biased than some of the other training methods, it can be
useful for pretraining shallow models that will later be stacked. This is because
the earliest models in the stack are encouraged to copy more information up to
their latent variables, thereby making it available to the later models. This should
be thought of more of as an often-exploitable side eﬀect of CD training rather than
a principled design advantage.
Sutskever and Tieleman (2010) showed that the CD update direction is not the
gradient of any function. This allows for situations where CD could cycle forever,
but in practice this is not a serious problem.
A diﬀerent strategy that resolves many of the problems with CD is to initial-
ize the Markov chains at each gradient step with their states from the previous
gradient step. This approach was ﬁrst discovered under the name
stochastic max-
imum likelihood
(SML) in the applied mathematics and statistics community
(Younes, 1998) and later independently rediscovered under the name
persistent
contrastive divergence
(PCD, or PCD-
k
to indicate the use of
k
Gibbs steps
per update) in the deep learning community (Tieleman, 2008). See algorithm 18.3.
The basic idea of this approach is that, as long as the steps taken by the stochastic
gradient algorithm are small, the model from the previous step will be similar to
the model from the current step. It follows that the samples from the previous
model’s distribution will be very close to being fair samples from the current
model’s distribution, so a Markov chain initialized with these samples will not
require much time to mix.
Because each Markov chain is continually updated throughout the learning
process, rather than restarted at each gradient step, the chains are free to wander
far enough to ﬁnd all the model’s modes. SML is thus considerably more resistant
to forming models with spurious modes than CD is. Moreover, because it is possible
to store the state of all the sampled variables, whether visible or latent, SML
provides an initialization point for both the hidden and the visible units. CD is
only able to provide an initialization for the visible units, and therefore requires
610