Chapter 13
Linear Factor Models
Many of the research frontiers in deep learning involve building a probabilistic
model of the input,
p
model
(
x
). Such a model can, in principle, use probabilis-
tic inference to predict any of the variables in its environment given any of
the other variables. Many of these models also have latent variables
h
, with
p
model
(x) = E
h
p
model
(x | h).
These latent variables provide another means of rep-
resenting the data. Distributed representations based on latent variables can
obtain all the advantages of representation learning that we have seen with deep
feedforward and recurrent networks.
In this chapter, we describe some of the simplest probabilistic models with
latent variables: linear factor models. These models are sometimes used as building
blocks of mixture models (Hinton et al., 1995a; Ghahramani and Hinton, 1996;
Roweis et al., 2002) or of larger, deep probabilistic models (Tang et al., 2012).
They also show many of the basic approaches necessary to building generative
models that the more advanced deep models will extend further.
A linear factor model is defined by the use of a stochastic linear decoder function
that generates x by adding noise to a linear transformation of h.
These models are interesting because they allow us to discover explanatory
factors that have a simple joint distribution. The simplicity of using a linear decoder
made these models some of the first latent variable models to be extensively studied.
A linear factor model describes the data-generation process as follows. First,
we sample the explanatory factors h from a distribution
h p(h), (13.1)
where
p
(
h
) is a factorial distribution, with
p
(
h
) =
i
p
(
h
i
), so that it is easy
485
CHAPTER 13. LINEAR FACTOR MODELS
h
1
h
1
h
2
h
2
h
3
h
3
x
1
x
1
x
2
x
2
x
3
x
3
x = W h + b + noisex = W h + b + noise
Figure 13.1: The directed graphical model describing the linear factor model family, in
which we assume that an observed data vector
x
is obtained by a linear combination of
independent latent factors
h
, plus some noise. Different models, such as probabilistic
PCA, factor analysis or ICA, make different choices about the form of the noise and of
the prior p(h).
to sample from. Next we sample the real-valued observable variables given the
factors
x = W h + b + noise, (13.2)
where the noise is typically Gaussian and diagonal (independent across dimensions).
This is illustrated in figure 13.1.
13.1 Probabilistic PCA and Factor Analysis
Probabilistic PCA (principal components analysis), factor analysis and other linear
factor models are special cases of the above equations (13.1 and 13.2) and only
differ in the choices made for the noise distribution and the model’s prior over
latent variables h before observing x.
In
factor analysis
(Bartholomew, 1987; Basilevsky, 1994), the latent variable
prior is just the unit variance Gaussian
h N(h; 0, I), (13.3)
while the observed variables
x
i
are assumed to be
conditionally independent
,
given
h
. Specifically, the noise is assumed to be drawn from a diagonal co-
variance Gaussian distribution, with covariance matrix
ψ
=
diag
(
σ
2
), with
σ
2
= [σ
2
1
, σ
2
2
, . . . , σ
2
n
]
a vector of per-variable variances.
The role of the latent variables is thus to capture the dependencies between
the different observed variables
x
i
. Indeed, it can easily be shown that
x
is just a
multivariate normal random variable, with
x N(x; b, W W
+ ψ). (13.4)
486
CHAPTER 13. LINEAR FACTOR MODELS
To cast PCA in a probabilistic framework, we can make a slight modification
to the factor analysis model, making the conditional variances
σ
2
i
equal to each
other. In that case the covariance of
x
is just
W W
+
σ
2
I
, where
σ
2
is now a
scalar. This yields the conditional distribution
x N(x; b, W W
+ σ
2
I), (13.5)
or equivalently
x = W h + b + σz, (13.6)
where
z N
(
z
;
0, I
) is Gaussian noise. Then, as Tipping and Bishop (1999) show,
we can use an iterative EM algorithm for estimating the parameters W and σ
2
.
This
probabilistic
PCA model takes advantage of the observation that most
variations in the data can be captured by the latent variables
h
, up to some small
residual
reconstruction error σ
2
. As shown by Tipping and Bishop (1999),
probabilistic PCA becomes PCA as
σ
0. In that case, the conditional expected
value of
h
given
x
becomes an orthogonal projection of
x b
onto the space
spanned by the d columns of W , as in PCA.
As
σ
0, the density model defined by probabilistic PCA becomes very sharp
around these
d
dimensions spanned by the columns of
W
. This can make the
model assign very low likelihood to the data if the data do not actually cluster
near a hyperplane.
13.2 Independent Component Analysis (ICA)
Independent component analysis (ICA) is among the oldest representation learning
algorithms (Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994;
Hyvärinen, 1999; Hyvärinen et al., 2001a; Hinton et al., 2001; Teh et al., 2003).
It is an approach to modeling linear factors that seeks to separate an observed
signal into many underlying signals that are scaled and added together to form
the observed data. These signals are intended to be fully independent, rather than
merely decorrelated from each other.
1
Many different specific methodologies are referred to as ICA. The variant
that is most similar to the other generative models we have described here is a
variant (Pham et al., 1992) that trains a fully parametric generative model. The
prior distribution over the underlying factors,
p
(
h
), must be fixed ahead of time by
the user. The model then deterministically generates
x
=
W h
. We can perform a
1
See section 3.8 for a discussion of the difference between uncorrelated variables and indepen-
dent variables.
487
CHAPTER 13. LINEAR FACTOR MODELS
nonlinear change of variables (using equation 3.47) to determine
p
(
x
)
.
Learning
the model then proceeds as usual, using maximum likelihood.
The motivation for this approach is that by choosing
p
(
h
) to be independent,
we can recover underlying factors that are as close as possible to independent. This
is commonly used, not to capture high-level abstract causal factors, but to recover
low-level signals that have been mixed together. In this setting, each training
example is one moment in time, each
x
i
is one sensor’s observation of the mixed
signals, and each
h
i
is one estimate of one of the original signals. For example, we
might have
n
people speaking simultaneously. If we have
n
different microphones
placed in different locations, ICA can detect the changes in the volume between
each speaker as heard by each microphone and separate the signals so that each
h
i
contains only one person speaking clearly. This is commonly used in neuroscience
for electroencephalography, a technology for recording electrical signals originating
in the brain. Multiple electrode sensors placed on the subject’s head are used
to measure many electrical signals coming from the body. The experimenter is
typically only interested in signals from the brain, but signals from the subject’s
heart and eyes are strong enough to confound measurements taken at the subject’s
scalp. The signals arrive at the electrodes mixed together, so ICA is necessary to
separate the electrical signature of the heart from the signals originating in the
brain, and to separate signals in different brain regions from each other.
As mentioned before, many variants of ICA are possible. Some add some noise
in the generation of
x
rather than using a deterministic decoder. Most do not
use the maximum likelihood criterion, but instead aim to make the elements of
h
=
W
1
x
independent from each other. Many criteria that accomplish this goal
are possible. Equation 3.47 requires taking the determinant of
W
, which can be
an expensive and numerically unstable operation. Some variants of ICA avoid this
problematic operation by constraining W to be orthogonal.
All variants of ICA require that
p
(
h
) be non-Gaussian. This is because if
p
(
h
)
is an independent prior with Gaussian components, then
W
is not identifiable.
We can obtain the same distribution over
p
(
x
) for many values of
W
. This is very
different from other linear factor models like probabilistic PCA and factor analysis,
which often require
p
(
h
) to be Gaussian in order to make many operations on the
model have closed form solutions. In the maximum likelihood approach, where the
user explicitly specifies the distribution, a typical choice is to use
p
(
h
i
) =
d
dh
i
σ
(
h
i
).
Typical choices of these non-Gaussian distributions have larger peaks near 0 than
does the Gaussian distribution, so we can also see most implementations of ICA
as learning sparse features.
Many variants of ICA are not generative models in the sense that we use the
488
CHAPTER 13. LINEAR FACTOR MODELS
phrase. In this book, a generative model either represents
p
(
x
) or can draw samples
from it. Many variants of ICA only know how to transform between
x
and
h
but
do not have any way of representing
p
(
h
), and thus do not impose a distribution
over
p
(
x
). For example, many ICA variants aim to increase the sample kurtosis of
h
=
W
1
x
, because high kurtosis indicates that
p
(
h
) is non-Gaussian, but this is
accomplished without explicitly representing
p
(
h
). This is because ICA is more
often used as an analysis tool for separating signals, rather than for generating
data or estimating its density.
Just as PCA can be generalized to the nonlinear autoencoders described in
chapter 14, ICA can be generalized to a nonlinear generative model, in which
we use a nonlinear function
f
to generate the observed data. See Hyvärinen and
Pajunen (1999) for the initial work on nonlinear ICA and its successful use with
ensemble learning by Roberts and Everson (2001) and Lappalainen et al. (2000).
Another nonlinear extension of ICA is the approach of
nonlinear independent
components estimation
, or NICE (Dinh et al., 2014), which stacks a series of
invertible transformations (encoder stages) with the property that the determinant
of the Jacobian of each transformation can be computed efficiently. This makes
it possible to compute the likelihood exactly, and like ICA, NICE attempts to
transform the data into a space where it has a factorized marginal distribution, but
it is more likely to succeed thanks to the nonlinear encoder. Because the encoder
is associated with a decoder that is its perfect inverse, generating samples from
the model is straightforward (by first sampling from
p
(
h
) and then applying the
decoder).
Another generalization of ICA is to learn groups of features, with statistical
dependence allowed within a group but discouraged between groups (Hyvärinen
and Hoyer, 1999; Hyvärinen et al., 2001b). When the groups of related units are
chosen to be nonoverlapping, this is called
independent subspace analysis
. It is
also possible to assign spatial coordinates to each hidden unit and form overlapping
groups of spatially neighboring units. This encourages nearby units to learn similar
features. When applied to natural images, this
topographic ICA
approach learns
Gabor filters, such that neighboring features have similar orientation, location or
frequency. Many different phase offsets of similar Gabor functions occur within
each region, so that pooling over small regions yields translation invariance.
13.3 Slow Feature Analysis
Slow feature analysis
(SFA) is a linear factor model that uses information from
time signals to learn invariant features (Wiskott and Sejnowski, 2002).
489
CHAPTER 13. LINEAR FACTOR MODELS
Slow feature analysis is motivated by a general principle called the slowness
principle. The idea is that the important characteristics of scenes change very
slowly compared to the individual measurements that make up a description of a
scene. For example, in computer vision, individual pixel values can change very
rapidly. If a zebra moves from left to right across the image, an individual pixel
will rapidly change from black to white and back again as the zebra’s stripes pass
over the pixel. By comparison, the feature indicating whether a zebra is in the
image will not change at all, and the feature describing the zebra’s position will
change slowly. We therefore may wish to regularize our model to learn features
that change slowly over time.
The slowness principle predates slow feature analysis and has been applied
to a wide variety of models (Hinton, 1989; Földiák, 1989; Mobahi et al., 2009;
Bergstra and Bengio, 2009). In general, we can apply the slowness principle to any
differentiable model trained with gradient descent. The slowness principle may be
introduced by adding a term to the cost function of the form
λ
t
L(f(x
(t+1)
), f(x
(t)
)), (13.7)
where
λ
is a hyperparameter determining the strength of the slowness regularization
term,
t
is the index into a time sequence of examples,
f
is the feature extractor
to be regularized, and
L
is a loss function measuring the distance between
f
(
x
(t)
)
and f(x
(t+1)
). A common choice for L is the mean squared difference.
Slow feature analysis is a particularly efficient application of the slowness
principle. It is efficient because it is applied to a linear feature extractor and can
thus be trained in closed form. Like some variants of ICA, SFA is not quite a
generative model per se, in the sense that it defines a linear map between input
space and feature space but does not define a prior over feature space and thus
does not impose a distribution p(x) on input space.
The SFA algorithm (Wiskott and Sejnowski, 2002) consists of defining
f
(
x
;
θ
)
to be a linear transformation, then solving the optimization problem
min
θ
E
t
(f(x
(t+1)
)
i
f (x
(t)
)
i
)
2
(13.8)
subject to the constraints
E
t
f(x
(t)
)
i
= 0 (13.9)
and
E
t
[f(x
(t)
)
2
i
] = 1. (13.10)
The constraint that the learned feature have zero mean is necessary to make the
problem have a unique solution; otherwise we could add a constant to all feature
490
CHAPTER 13. LINEAR FACTOR MODELS
values and obtain a different solution with equal value of the slowness objective.
The constraint that the features have unit variance is necessary to prevent the
pathological solution where all features collapse to 0. Like PCA, the SFA features
are ordered, with the first feature being the slowest. To learn multiple features, we
must also add the constraint
i < j, E
t
[f(x
(t)
)
i
f(x
(t)
)
j
] = 0. (13.11)
This specifies that the learned features must be linearly decorrelated from each
other. Without this constraint, all the learned features would simply capture the
one slowest signal. One could imagine using other mechanisms, such as minimizing
reconstruction error, to force the features to diversify, but this decorrelation
mechanism admits a simple solution due to the linearity of SFA features. The SFA
problem may be solved in closed form by a linear algebra package.
SFA is typically used to learn nonlinear features by applying a nonlinear basis
expansion to
x
before running SFA. For example, it is common to replace
x
with the quadratic basis expansion, a vector containing elements
x
i
x
j
for all
i
and
j
. Linear SFA modules may then be composed to learn deep nonlinear slow
feature extractors by repeatedly learning a linear SFA feature extractor, applying
a nonlinear basis expansion to its output, and then learning another linear SFA
feature extractor on top of that expansion.
When trained on small spatial patches of videos of natural scenes, SFA with
quadratic basis expansions learns features that share many characteristics with
those of complex cells in V1 cortex (Berkes and Wiskott, 2005). When trained
on videos of random motion within 3-D computer-rendered environments, deep
SFA learns features that share many characteristics with the features represented
by neurons in rat brains that are used for navigation (Franzius et al., 2007). SFA
thus seems to be a reasonably biologically plausible model.
A major advantage of SFA is that it is possible to theoretically predict which
features SFA will learn, even in the deep nonlinear setting. To make such theoretical
predictions, one must know about the dynamics of the environment in terms of
configuration space (e.g., in the case of random motion in the 3-D rendered
environment, the theoretical analysis proceeds from knowledge of the probability
distribution over position and velocity of the camera). Given the knowledge of how
the underlying factors actually change, it is possible to analytically solve for the
optimal functions expressing these factors. In practice, experiments with deep SFA
applied to simulated data seem to recover the theoretically predicted functions.
This is in comparison to other learning algorithms, where the cost function depends
highly on specific pixel values, making it much more difficult to determine what
features the model will learn.
491
CHAPTER 13. LINEAR FACTOR MODELS
Deep SFA has also been used to learn features for object recognition and pose
estimation (Franzius et al., 2008). So far, the slowness principle has not become
the basis for any state-of-the-art applications. It is unclear what factor has limited
its performance. We speculate that perhaps the slowness prior is too strong, and
that, rather than imposing a prior that features should be approximately constant,
it would be better to impose a prior that features should be easy to predict from
one time step to the next. The position of an object is a useful feature regardless of
whether the object’s velocity is high or low, but the slowness principle encourages
the model to ignore the position of objects that have high velocity.
13.4 Sparse Coding
Sparse coding
(Olshausen and Field, 1996) is a linear factor model that has
been heavily studied as an unsupervised feature learning and feature extraction
mechanism. Strictly speaking, the term “sparse coding” refers to the process of
inferring the value of
h
in this model, while “sparse modeling” refers to the process
of designing and learning the model, but the term “sparse coding” is often used to
refer to both.
Like most other linear factor models, it uses a linear decoder plus noise to
obtain reconstructions of
x
, as specified in equation 13.2. More specifically, sparse
coding models typically assume that the linear factors have Gaussian noise with
isotropic precision β:
p(x | h) = N(x; W h + b,
1
β
I). (13.12)
The distribution
p
(
h
) is chosen to be one with sharp peaks near 0 (Olshausen
and Field, 1996). Common choices include factorized Laplace, Cauchy or factorized
Student
t
-distributions. For example, the Laplace prior parametrized in terms of
the sparsity penalty coefficient λ is given by
p(h
i
) = Laplace(h
i
; 0,
2
λ
) =
λ
4
e
1
2
λ|h
i
|
, (13.13)
and the Student t prior by
p(h
i
)
1
(1 +
h
2
i
ν
)
ν+1
2
. (13.14)
Training sparse coding with maximum likelihood is intractable. Instead, the
training alternates between encoding the data and training the decoder to better
492
CHAPTER 13. LINEAR FACTOR MODELS
reconstruct the data given the encoding. This approach will be justified further as
a principled approximation to maximum likelihood later, in section 19.3.
For models such as PCA, we have seen the use of a parametric encoder function
that predicts
h
and consists only of multiplication by a weight matrix. The encoder
that we use with sparse coding is not a parametric encoder. Instead, the encoder
is an optimization algorithm, which solves an optimization problem in which we
seek the single most likely code value:
h
= f(x) = arg max
h
p(h | x). (13.15)
When combined with equation 13.13 and equation 13.12, this yields the following
optimization problem:
arg max
h
p(h | x) (13.16)
= arg max
h
log p(h | x) (13.17)
= arg min
h
λ||h||
1
+ β||x W h||
2
2
, (13.18)
where we have dropped terms not depending on
h
and divided by positive scaling
factors to simplify the equation.
Due to the imposition of an
L
1
norm on
h
, this procedure will yield a sparse
h
(see section 7.1.2).
To train the model rather than just perform inference, we alternate between
minimization with respect to
h
and minimization with respect to
W
. In this
presentation, we treat
β
as a hyperparameter. Typically it is set to 1 because its
role in this optimization problem is shared with
λ
, and there is no need for both
hyperparameters. In principle, we could also treat
β
as a parameter of the model
and learn it. Our presentation here has discarded some terms that do not depend
on
h
but do depend on
β
. To learn
β
, these terms must be included, or
β
will
collapse to 0.
Not all approaches to sparse coding explicitly build a
p
(
h
) and a
p
(
x | h
).
Often we are just interested in learning a dictionary of features with activation
values that will often be zero when extracted using this inference procedure.
If we sample
h
from a Laplace prior, it is in fact a zero probability event for
an element of
h
to actually be zero. The generative model itself is not especially
sparse; only the feature extractor is. Goodfellow et al. (2013d) describe approximate
inference in a different model family, the spike and slab sparse coding model, for
which samples from the prior usually contain true zeros.
493
CHAPTER 13. LINEAR FACTOR MODELS
The sparse coding approach combined with the use of the nonparametric
encoder can in principle minimize the combination of reconstruction error and
log-prior better than any specific parametric encoder. Another advantage is that
there is no generalization error to the encoder. A parametric encoder must learn
how to map
x
to
h
in a way that generalizes. For unusual
x
that do not resemble
the training data, a learned parametric encoder may fail to find an
h
that results in
accurate reconstruction or in a sparse code. For the vast majority of formulations
of sparse coding models, where the inference problem is convex, the optimization
procedure will always find the optimal code (unless degenerate cases such as
replicated weight vectors occur). Obviously, the sparsity and reconstruction costs
can still rise on unfamiliar points, but this is due to generalization error in the
decoder weights, rather than to generalization error in the encoder. The lack of
generalization error in sparse coding’s optimization-based encoding process may
result in better generalization when sparse coding is used as a feature extractor for
a classifier than when a parametric function is used to predict the code. Coates
and Ng (2011) demonstrated that sparse coding features generalize better for
object recognition tasks than the features of a related model based on a parametric
encoder, the linear-sigmoid autoencoder. Inspired by their work, Goodfellow et al.
(2013d) showed that a variant of sparse coding generalizes better than other feature
extractors in the regime where extremely few labels are available (twenty or fewer
labels per class).
The primary disadvantage of the nonparametric encoder is that it requires
greater time to compute
h
given
x
because the nonparametric approach requires
running an iterative algorithm. The parametric autoencoder approach, developed in
chapter 14, uses only a fixed number of layers, often only one. Another disadvantage
is that it is not straightforward to back-propagate through the nonparametric
encoder, which makes it difficult to pretrain a sparse coding model with an
unsupervised criterion and then fine-tune it using a supervised criterion. Modified
versions of sparse coding that permit approximate derivatives do exist but are not
widely used (Bagnell and Bradley, 2009).
Sparse coding, like other linear factor models, often produces poor samples, as
shown in figure 13.2. This happens even when the model is able to reconstruct
the data well and provide useful features for a classifier. The reason is that each
individual feature may be learned well, but the factorial prior on the hidden code
results in the model including random subsets of all the features in each generated
sample. This motivates the development of deeper models that can impose a
nonfactorial distribution on the deepest code layer, as well as the development of
more sophisticated shallow models.
494
CHAPTER 13. LINEAR FACTOR MODELS
Figure 13.2: Example samples and weights from a spike and slab sparse coding model
trained on the MNIST dataset. (Left) The samples from the model do not resemble the
training examples. At first glance, one might assume the model is poorly fit. (Right) The
weight vectors of the model have learned to represent penstrokes and sometimes complete
digits. The model has thus learned useful features. The problem is that the factorial prior
over features results in random subsets of features being combined. Few such subsets
are appropriate to form a recognizable MNIST digit. This motivates the development of
generative models that have more powerful distributions over their latent codes. Figure
reproduced with permission from Goodfellow et al. (2013d).
495
CHAPTER 13. LINEAR FACTOR MODELS
13.5 Manifold Interpretation of PCA
Linear factor models including PCA and factor analysis can be interpreted as
learning a manifold (Hinton et al., 1997). We can view probabilistic PCA as
defining a thin pancake-shaped region of high probability—a Gaussian distribution
that is very narrow along some axes, just as a pancake is very flat along its vertical
axis, but is elongated along other axes, just as a pancake is wide along its horizontal
axes. This is illustrated in figure 13.3. PCA can be interpreted as aligning this
pancake with a linear manifold in a higher-dimensional space. This interpretation
applies not just to traditional PCA but also to any linear autoencoder that learns
matrices
W
and
V
with the goal of making the reconstruction of
x
lie as close to
x as possible.
Let the encoder be
h = f(x) = W
(x µ). (13.19)
The encoder computes a low-dimensional representation of
h
. With the autoencoder
view, we have a decoder computing the reconstruction
ˆ
x = g(h) = b + V h. (13.20)
The choices of linear encoder and decoder that minimize reconstruction error
E[||x
ˆ
x||
2
] (13.21)
correspond to
V
=
W
,
µ
=
b
=
E
[
x
] and the columns of
W
form an orthonormal
basis, which spans the same subspace as the principal eigenvectors of the covariance
matrix
C = E[(x µ)(x µ)
]. (13.22)
In the case of PCA, the columns of
W
are these eigenvectors, ordered by the
magnitude of the corresponding eigenvalues (which are all real and non-negative).
One can also show that eigenvalue
λ
i
of
C
corresponds to the variance of
x
in the direction of eigenvector
v
(i)
. If
x R
D
and
h R
d
with
d < D
, then the
optimal reconstruction error (choosing µ, b, V and W as above) is
min E[||x
ˆ
x||
2
] =
D
i=d+1
λ
i
. (13.23)
Hence, if the covariance has rank
d
, the eigenvalues
λ
d+1
to
λ
D
are 0 and recon-
struction error is 0.
496
CHAPTER 13. LINEAR FACTOR MODELS
Figure 13.3: Flat Gaussian capturing probability concentration near a low-dimensional
manifold. The figure shows the upper half of the “pancake” above the “manifold plane,”
which goes through its middle. The variance in the direction orthogonal to the manifold
is very small (arrow pointing out of plane) and can be considered “noise,” while the other
variances are large (arrows in the plane) and correspond to “signal” and to a coordinate
system for the reduced-dimension data.
497
CHAPTER 13. LINEAR FACTOR MODELS
Furthermore, one can also show that the above solution can be obtained by
maximizing the variances of the elements of
h
, under orthogonal
W
, instead of
minimizing reconstruction error.
Linear factor models are some of the simplest generative models and some of the
simplest models that learn a representation of data. Much as linear classifiers and
linear regression models may be extended to deep feedforward networks, these linear
factor models may be extended to autoencoder networks and deep probabilistic
models that perform the same tasks but with a much more powerful and flexible
model family.
498