CHAPTER 17. MONTE CARLO METHODS
deterministic. When the temperature rises to inﬁnity, and
β
falls to zero, the
distribution (for discrete x) becomes uniform.
Typically, a model is trained to be evaluated at
β
= 1. However, we can make
use of other temperatures, particularly those where
β <
1.
Tempering
is a general
strategy of mixing between modes of p
1
rapidly by drawing samples with β < 1.
Markov chains based on
tempered transitions
(Neal, 1994) temporarily
sample from higher-temperature distributions to mix to diﬀerent modes, then
resume sampling from the unit temperature distribution. These techniques have
been applied to models such as RBMs (Salakhutdinov, 2010). Another approach is
to use
parallel tempering
(Iba, 2001), in which the Markov chain simulates many
diﬀerent states in parallel, at diﬀerent temperatures. The highest temperature
states mix slowly, while the lowest temperature states, at temperature 1, provide
accurate samples from the model. The transition operator includes stochastically
swapping states between two diﬀerent temperature levels, so that a suﬃciently high-
probability sample from a high-temperature slot can jump into a lower temperature
slot. This approach has also been applied to RBMs (Desjardins et al., 2010; Cho
et al., 2010). Although tempering is a promising approach, at this point it has not
allowed researchers to make a strong advance in solving the challenge of sampling
from complex EBMs. One possible reason is that there are
critical temperatures
around which the temperature transition must be very slow (as the temperature is
gradually reduced) for tempering to be eﬀective.
17.5.2 Depth May Help Mixing
When drawing samples from a latent variable model
p
(
h, x
), we have seen that if
p
(
h | x
) encodes
x
too well, then sampling from
p
(
x | h
) will not change
x
very
much, and mixing will be poor. One way to resolve this problem is to make
h
a
deep representation, encoding
x
into
h
in such a way that a Markov chain in the
space of
h
can mix more easily. Many representation learning algorithms, such as
autoencoders and RBMs, tend to yield a marginal distribution over
h
that is more
uniform and more unimodal than the original data distribution over
x
. It can be
argued that this arises from trying to minimize reconstruction error while using all
the available representation space, because minimizing reconstruction error over
the training examples will be better achieved when diﬀerent training examples are
easily distinguishable from each other in
h
-space, and thus well separated. Bengio
et al. (2013a) observed that deeper stacks of regularized autoencoders or RBMs
yield marginal distributions in the top-level
h
-space that appeared more spread out
and more uniform, with less of a gap between the regions corresponding to diﬀerent
modes (categories, in the experiments). Training an RBM in that higher-level
601