Chapter 13

Linear Factor Models

Many of the research frontiers in deep learning involve building a probabilistic

model of the input,

model

(

). Such a model can, in principle, use probabilis-

tic inference to predict any of the variables in its environment given any of

the other variables. Many of these models also have latent variables

, with

model

(x) = E

model

(x | h).

These latent variables provide another means of rep-

resenting the data. Distributed representations based on latent variables can

obtain all the advantages of representation learning that we have seen with deep

feedforward and recurrent networks.

In this chapter, we describe some of the simplest probabilistic models with

latent variables: linear factor models. These models are sometimes used as building

blocks of mixture models (Hinton et al., 1995a; Ghahramani and Hinton, 1996;

Roweis et al., 2002) or of larger, deep probabilistic models (Tang et al., 2012).

They also show many of the basic approaches necessary to building generative

models that the more advanced deep models will extend further.

A linear factor model is deﬁned by the use of a stochastic linear decoder function

that generates x by adding noise to a linear transformation of h.

These models are interesting because they allow us to discover explanatory

factors that have a simple joint distribution. The simplicity of using a linear decoder

made these models some of the ﬁrst latent variable models to be extensively studied.

A linear factor model describes the data-generation process as follows. First,

we sample the explanatory factors h from a distribution

h ∼ p(h), (13.1)

where

(

) is a factorial distribution, with

(

) =



(

), so that it is easy

485

CHAPTER 13. LINEAR FACTOR MODELS

x = W h + b + noisex = W h + b + noise

Figure 13.1: The directed graphical model describing the linear factor model family, in

which we assume that an observed data vector

is obtained by a linear combination of

independent latent factors

, plus some noise. Diﬀerent models, such as probabilistic

PCA, factor analysis or ICA, make diﬀerent choices about the form of the noise and of

the prior p(h).

to sample from. Next we sample the real-valued observable variables given the

factors

x = W h + b + noise, (13.2)

where the noise is typically Gaussian and diagonal (independent across dimensions).

This is illustrated in ﬁgure 13.1.

13.1 Probabilistic PCA and Factor Analysis

Probabilistic PCA (principal components analysis), factor analysis and other linear

factor models are special cases of the above equations (13.1 and 13.2) and only

diﬀer in the choices made for the noise distribution and the model’s prior over

latent variables h before observing x.

factor analysis

(Bartholomew, 1987; Basilevsky, 1994), the latent variable

prior is just the unit variance Gaussian

h ∼ N(h; 0, I), (13.3)

while the observed variables

are assumed to be

conditionally independent

given

. Speciﬁcally, the noise is assumed to be drawn from a diagonal co-

variance Gaussian distribution, with covariance matrix

diag

(

), with

= [σ

, σ

, . . . , σ

]



a vector of per-variable variances.

The role of the latent variables is thus to capture the dependencies between

the diﬀerent observed variables

. Indeed, it can easily be shown that

is just a

multivariate normal random variable, with

x ∼ N(x; b, W W



+ ψ). (13.4)

486

CHAPTER 13. LINEAR FACTOR MODELS

To cast PCA in a probabilistic framework, we can make a slight modiﬁcation

to the factor analysis model, making the conditional variances

equal to each

other. In that case the covariance of

is just

W W



, where

is now a

scalar. This yields the conditional distribution

x ∼ N(x; b, W W



+ σ

I), (13.5)

or equivalently

x = W h + b + σz, (13.6)

where

z ∼ N

(

;

0, I

) is Gaussian noise. Then, as Tipping and Bishop (1999) show,

we can use an iterative EM algorithm for estimating the parameters W and σ

This

probabilistic

PCA model takes advantage of the observation that most

variations in the data can be captured by the latent variables

, up to some small

residual

reconstruction error σ

. As shown by Tipping and Bishop (1999),

probabilistic PCA becomes PCA as

σ →

0. In that case, the conditional expected

value of

given

becomes an orthogonal projection of

x − b

onto the space

spanned by the d columns of W , as in PCA.

σ →

0, the density model deﬁned by probabilistic PCA becomes very sharp

around these

dimensions spanned by the columns of

. This can make the

model assign very low likelihood to the data if the data do not actually cluster

near a hyperplane.

13.2 Independent Component Analysis (ICA)

Independent component analysis (ICA) is among the oldest representation learning

algorithms (Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994;

Hyvärinen, 1999; Hyvärinen et al., 2001a; Hinton et al., 2001; Teh et al., 2003).

It is an approach to modeling linear factors that seeks to separate an observed

signal into many underlying signals that are scaled and added together to form

the observed data. These signals are intended to be fully independent, rather than

merely decorrelated from each other.

Many diﬀerent speciﬁc methodologies are referred to as ICA. The variant

that is most similar to the other generative models we have described here is a

variant (Pham et al., 1992) that trains a fully parametric generative model. The

prior distribution over the underlying factors,

(

), must be ﬁxed ahead of time by

the user. The model then deterministically generates

W h

. We can perform a

See section 3.8 for a discussion of the diﬀerence between uncorrelated variables and indepen-

dent variables.

487

CHAPTER 13. LINEAR FACTOR MODELS

nonlinear change of variables (using equation 3.47) to determine

(

)

Learning

the model then proceeds as usual, using maximum likelihood.

The motivation for this approach is that by choosing

(

) to be independent,

we can recover underlying factors that are as close as possible to independent. This

is commonly used, not to capture high-level abstract causal factors, but to recover

low-level signals that have been mixed together. In this setting, each training

example is one moment in time, each

is one sensor’s observation of the mixed

signals, and each

is one estimate of one of the original signals. For example, we

might have

people speaking simultaneously. If we have

diﬀerent microphones

placed in diﬀerent locations, ICA can detect the changes in the volume between

each speaker as heard by each microphone and separate the signals so that each

contains only one person speaking clearly. This is commonly used in neuroscience

for electroencephalography, a technology for recording electrical signals originating

in the brain. Multiple electrode sensors placed on the subject’s head are used

to measure many electrical signals coming from the body. The experimenter is

typically only interested in signals from the brain, but signals from the subject’s

heart and eyes are strong enough to confound measurements taken at the subject’s

scalp. The signals arrive at the electrodes mixed together, so ICA is necessary to

separate the electrical signature of the heart from the signals originating in the

brain, and to separate signals in diﬀerent brain regions from each other.

As mentioned before, many variants of ICA are possible. Some add some noise

in the generation of

rather than using a deterministic decoder. Most do not

use the maximum likelihood criterion, but instead aim to make the elements of

−1

independent from each other. Many criteria that accomplish this goal

are possible. Equation 3.47 requires taking the determinant of

, which can be

an expensive and numerically unstable operation. Some variants of ICA avoid this

problematic operation by constraining W to be orthogonal.

All variants of ICA require that

(

) be non-Gaussian. This is because if

(

)

is an independent prior with Gaussian components, then

is not identiﬁable.

We can obtain the same distribution over

(

) for many values of

. This is very

diﬀerent from other linear factor models like probabilistic PCA and factor analysis,

which often require

(

) to be Gaussian in order to make many operations on the

model have closed form solutions. In the maximum likelihood approach, where the

user explicitly speciﬁes the distribution, a typical choice is to use

(

) =

(

Typical choices of these non-Gaussian distributions have larger peaks near 0 than

does the Gaussian distribution, so we can also see most implementations of ICA

as learning sparse features.

Many variants of ICA are not generative models in the sense that we use the

488

CHAPTER 13. LINEAR FACTOR MODELS

phrase. In this book, a generative model either represents

(

) or can draw samples

from it. Many variants of ICA only know how to transform between

and

but

do not have any way of representing

(

), and thus do not impose a distribution

over

(

). For example, many ICA variants aim to increase the sample kurtosis of

−1

, because high kurtosis indicates that

(

) is non-Gaussian, but this is

accomplished without explicitly representing

(

). This is because ICA is more

often used as an analysis tool for separating signals, rather than for generating

data or estimating its density.

Just as PCA can be generalized to the nonlinear autoencoders described in

chapter 14, ICA can be generalized to a nonlinear generative model, in which

we use a nonlinear function

to generate the observed data. See Hyvärinen and

Pajunen (1999) for the initial work on nonlinear ICA and its successful use with

ensemble learning by Roberts and Everson (2001) and Lappalainen et al. (2000).

Another nonlinear extension of ICA is the approach of

nonlinear independent

components estimation

, or NICE (Dinh et al., 2014), which stacks a series of

invertible transformations (encoder stages) with the property that the determinant

of the Jacobian of each transformation can be computed eﬃciently. This makes

it possible to compute the likelihood exactly, and like ICA, NICE attempts to

transform the data into a space where it has a factorized marginal distribution, but

it is more likely to succeed thanks to the nonlinear encoder. Because the encoder

is associated with a decoder that is its perfect inverse, generating samples from

the model is straightforward (by ﬁrst sampling from

(

) and then applying the

decoder).

Another generalization of ICA is to learn groups of features, with statistical

dependence allowed within a group but discouraged between groups (Hyvärinen

and Hoyer, 1999; Hyvärinen et al., 2001b). When the groups of related units are

chosen to be nonoverlapping, this is called

independent subspace analysis

. It is

also possible to assign spatial coordinates to each hidden unit and form overlapping

groups of spatially neighboring units. This encourages nearby units to learn similar

features. When applied to natural images, this

topographic ICA

approach learns

Gabor ﬁlters, such that neighboring features have similar orientation, location or

frequency. Many diﬀerent phase oﬀsets of similar Gabor functions occur within

each region, so that pooling over small regions yields translation invariance.

13.3 Slow Feature Analysis

Slow feature analysis

(SFA) is a linear factor model that uses information from

time signals to learn invariant features (Wiskott and Sejnowski, 2002).

489

CHAPTER 13. LINEAR FACTOR MODELS

Slow feature analysis is motivated by a general principle called the slowness

principle. The idea is that the important characteristics of scenes change very

slowly compared to the individual measurements that make up a description of a

scene. For example, in computer vision, individual pixel values can change very

rapidly. If a zebra moves from left to right across the image, an individual pixel

will rapidly change from black to white and back again as the zebra’s stripes pass

over the pixel. By comparison, the feature indicating whether a zebra is in the

image will not change at all, and the feature describing the zebra’s position will

change slowly. We therefore may wish to regularize our model to learn features

that change slowly over time.

The slowness principle predates slow feature analysis and has been applied

to a wide variety of models (Hinton, 1989; Földiák, 1989; Mobahi et al., 2009;

Bergstra and Bengio, 2009). In general, we can apply the slowness principle to any

diﬀerentiable model trained with gradient descent. The slowness principle may be

introduced by adding a term to the cost function of the form



L(f(x

(t+1)

), f(x

(t)

)), (13.7)

where

is a hyperparameter determining the strength of the slowness regularization

term,

is the index into a time sequence of examples,

is the feature extractor

to be regularized, and

is a loss function measuring the distance between

(

(t)

)

and f(x

(t+1)

). A common choice for L is the mean squared diﬀerence.

Slow feature analysis is a particularly eﬃcient application of the slowness

principle. It is eﬃcient because it is applied to a linear feature extractor and can

thus be trained in closed form. Like some variants of ICA, SFA is not quite a

generative model per se, in the sense that it deﬁnes a linear map between input

space and feature space but does not deﬁne a prior over feature space and thus

does not impose a distribution p(x) on input space.

The SFA algorithm (Wiskott and Sejnowski, 2002) consists of deﬁning

(

;

)

to be a linear transformation, then solving the optimization problem

min

(f(x

(t+1)

)

− f (x

(t)

)

(13.8)

subject to the constraints

f(x

(t)

)

= 0 (13.9)

and

[f(x

(t)

)

] = 1. (13.10)

The constraint that the learned feature have zero mean is necessary to make the

problem have a unique solution; otherwise we could add a constant to all feature

490

CHAPTER 13. LINEAR FACTOR MODELS

values and obtain a diﬀerent solution with equal value of the slowness objective.

The constraint that the features have unit variance is necessary to prevent the

pathological solution where all features collapse to 0. Like PCA, the SFA features

are ordered, with the ﬁrst feature being the slowest. To learn multiple features, we

must also add the constraint

∀i < j, E

[f(x

(t)

)

f(x

(t)

)

] = 0. (13.11)

This speciﬁes that the learned features must be linearly decorrelated from each

other. Without this constraint, all the learned features would simply capture the

one slowest signal. One could imagine using other mechanisms, such as minimizing

reconstruction error, to force the features to diversify, but this decorrelation

mechanism admits a simple solution due to the linearity of SFA features. The SFA

problem may be solved in closed form by a linear algebra package.

SFA is typically used to learn nonlinear features by applying a nonlinear basis

expansion to

before running SFA. For example, it is common to replace

with the quadratic basis expansion, a vector containing elements

for all

and

. Linear SFA modules may then be composed to learn deep nonlinear slow

feature extractors by repeatedly learning a linear SFA feature extractor, applying

a nonlinear basis expansion to its output, and then learning another linear SFA

feature extractor on top of that expansion.

When trained on small spatial patches of videos of natural scenes, SFA with

quadratic basis expansions learns features that share many characteristics with

those of complex cells in V1 cortex (Berkes and Wiskott, 2005). When trained

on videos of random motion within 3-D computer-rendered environments, deep

SFA learns features that share many characteristics with the features represented

by neurons in rat brains that are used for navigation (Franzius et al., 2007). SFA

thus seems to be a reasonably biologically plausible model.

A major advantage of SFA is that it is possible to theoretically predict which

features SFA will learn, even in the deep nonlinear setting. To make such theoretical

predictions, one must know about the dynamics of the environment in terms of

conﬁguration space (e.g., in the case of random motion in the 3-D rendered

environment, the theoretical analysis proceeds from knowledge of the probability

distribution over position and velocity of the camera). Given the knowledge of how

the underlying factors actually change, it is possible to analytically solve for the

optimal functions expressing these factors. In practice, experiments with deep SFA

applied to simulated data seem to recover the theoretically predicted functions.

This is in comparison to other learning algorithms, where the cost function depends

highly on speciﬁc pixel values, making it much more diﬃcult to determine what

features the model will learn.

491

CHAPTER 13. LINEAR FACTOR MODELS

Deep SFA has also been used to learn features for object recognition and pose

estimation (Franzius et al., 2008). So far, the slowness principle has not become

the basis for any state-of-the-art applications. It is unclear what factor has limited

its performance. We speculate that perhaps the slowness prior is too strong, and

that, rather than imposing a prior that features should be approximately constant,

it would be better to impose a prior that features should be easy to predict from

one time step to the next. The position of an object is a useful feature regardless of

whether the object’s velocity is high or low, but the slowness principle encourages

the model to ignore the position of objects that have high velocity.

13.4 Sparse Coding

Sparse coding

(Olshausen and Field, 1996) is a linear factor model that has

been heavily studied as an unsupervised feature learning and feature extraction

mechanism. Strictly speaking, the term “sparse coding” refers to the process of

inferring the value of

in this model, while “sparse modeling” refers to the process

of designing and learning the model, but the term “sparse coding” is often used to

refer to both.

Like most other linear factor models, it uses a linear decoder plus noise to

obtain reconstructions of

, as speciﬁed in equation 13.2. More speciﬁcally, sparse

coding models typically assume that the linear factors have Gaussian noise with

isotropic precision β:

p(x | h) = N(x; W h + b,

I). (13.12)

The distribution

(

) is chosen to be one with sharp peaks near 0 (Olshausen

and Field, 1996). Common choices include factorized Laplace, Cauchy or factorized

Student

-distributions. For example, the Laplace prior parametrized in terms of

the sparsity penalty coeﬃcient λ is given by

p(h

) = Laplace(h

; 0,

) =

−

λ|h

, (13.13)

and the Student t prior by

p(h

) ∝

(1 +

)

ν+1

. (13.14)

Training sparse coding with maximum likelihood is intractable. Instead, the

training alternates between encoding the data and training the decoder to better

492

CHAPTER 13. LINEAR FACTOR MODELS

reconstruct the data given the encoding. This approach will be justiﬁed further as

a principled approximation to maximum likelihood later, in section 19.3.

For models such as PCA, we have seen the use of a parametric encoder function

that predicts

and consists only of multiplication by a weight matrix. The encoder

that we use with sparse coding is not a parametric encoder. Instead, the encoder

is an optimization algorithm, which solves an optimization problem in which we

seek the single most likely code value:

∗

= f(x) = arg max

p(h | x). (13.15)

When combined with equation 13.13 and equation 13.12, this yields the following

optimization problem:

arg max

p(h | x) (13.16)

= arg max

log p(h | x) (13.17)

= arg min

λ||h||

+ β||x − W h||

, (13.18)

where we have dropped terms not depending on

and divided by positive scaling

factors to simplify the equation.

Due to the imposition of an

norm on

, this procedure will yield a sparse

∗

(see section 7.1.2).

To train the model rather than just perform inference, we alternate between

minimization with respect to

and minimization with respect to

. In this

presentation, we treat

as a hyperparameter. Typically it is set to 1 because its

role in this optimization problem is shared with

, and there is no need for both

hyperparameters. In principle, we could also treat

as a parameter of the model

and learn it. Our presentation here has discarded some terms that do not depend

but do depend on

. To learn

, these terms must be included, or

will

collapse to 0.

Not all approaches to sparse coding explicitly build a

(

) and a

(

x | h

Often we are just interested in learning a dictionary of features with activation

values that will often be zero when extracted using this inference procedure.

If we sample

from a Laplace prior, it is in fact a zero probability event for

an element of

to actually be zero. The generative model itself is not especially

sparse; only the feature extractor is. Goodfellow et al. (2013d) describe approximate

inference in a diﬀerent model family, the spike and slab sparse coding model, for

which samples from the prior usually contain true zeros.

493

CHAPTER 13. LINEAR FACTOR MODELS

The sparse coding approach combined with the use of the nonparametric

encoder can in principle minimize the combination of reconstruction error and

log-prior better than any speciﬁc parametric encoder. Another advantage is that

there is no generalization error to the encoder. A parametric encoder must learn

how to map

in a way that generalizes. For unusual

that do not resemble

the training data, a learned parametric encoder may fail to ﬁnd an

that results in

accurate reconstruction or in a sparse code. For the vast majority of formulations

of sparse coding models, where the inference problem is convex, the optimization

procedure will always ﬁnd the optimal code (unless degenerate cases such as

replicated weight vectors occur). Obviously, the sparsity and reconstruction costs

can still rise on unfamiliar points, but this is due to generalization error in the

decoder weights, rather than to generalization error in the encoder. The lack of

generalization error in sparse coding’s optimization-based encoding process may

result in better generalization when sparse coding is used as a feature extractor for

a classiﬁer than when a parametric function is used to predict the code. Coates

and Ng (2011) demonstrated that sparse coding features generalize better for

object recognition tasks than the features of a related model based on a parametric

encoder, the linear-sigmoid autoencoder. Inspired by their work, Goodfellow et al.

(2013d) showed that a variant of sparse coding generalizes better than other feature

extractors in the regime where extremely few labels are available (twenty or fewer

labels per class).

The primary disadvantage of the nonparametric encoder is that it requires

greater time to compute

given

because the nonparametric approach requires

running an iterative algorithm. The parametric autoencoder approach, developed in

chapter 14, uses only a ﬁxed number of layers, often only one. Another disadvantage

is that it is not straightforward to back-propagate through the nonparametric

encoder, which makes it diﬃcult to pretrain a sparse coding model with an

unsupervised criterion and then ﬁne-tune it using a supervised criterion. Modiﬁed

versions of sparse coding that permit approximate derivatives do exist but are not

widely used (Bagnell and Bradley, 2009).

Sparse coding, like other linear factor models, often produces poor samples, as

shown in ﬁgure 13.2. This happens even when the model is able to reconstruct

the data well and provide useful features for a classiﬁer. The reason is that each

individual feature may be learned well, but the factorial prior on the hidden code

results in the model including random subsets of all the features in each generated

sample. This motivates the development of deeper models that can impose a

nonfactorial distribution on the deepest code layer, as well as the development of

more sophisticated shallow models.

494

CHAPTER 13. LINEAR FACTOR MODELS

Figure 13.2: Example samples and weights from a spike and slab sparse coding model

trained on the MNIST dataset. (Left) The samples from the model do not resemble the

training examples. At ﬁrst glance, one might assume the model is poorly ﬁt. (Right) The

weight vectors of the model have learned to represent penstrokes and sometimes complete

digits. The model has thus learned useful features. The problem is that the factorial prior

over features results in random subsets of features being combined. Few such subsets

are appropriate to form a recognizable MNIST digit. This motivates the development of

generative models that have more powerful distributions over their latent codes. Figure

reproduced with permission from Goodfellow et al. (2013d).

495

CHAPTER 13. LINEAR FACTOR MODELS

13.5 Manifold Interpretation of PCA

Linear factor models including PCA and factor analysis can be interpreted as

learning a manifold (Hinton et al., 1997). We can view probabilistic PCA as

deﬁning a thin pancake-shaped region of high probability—a Gaussian distribution

that is very narrow along some axes, just as a pancake is very ﬂat along its vertical

axis, but is elongated along other axes, just as a pancake is wide along its horizontal

axes. This is illustrated in ﬁgure 13.3. PCA can be interpreted as aligning this

pancake with a linear manifold in a higher-dimensional space. This interpretation

applies not just to traditional PCA but also to any linear autoencoder that learns

matrices

and

with the goal of making the reconstruction of

lie as close to

x as possible.

Let the encoder be

h = f(x) = W



(x − µ). (13.19)

The encoder computes a low-dimensional representation of

. With the autoencoder

view, we have a decoder computing the reconstruction

x = g(h) = b + V h. (13.20)

The choices of linear encoder and decoder that minimize reconstruction error

E[||x −

x||

] (13.21)

correspond to

[

] and the columns of

form an orthonormal

basis, which spans the same subspace as the principal eigenvectors of the covariance

matrix

C = E[(x − µ)(x − µ)



]. (13.22)

In the case of PCA, the columns of

are these eigenvectors, ordered by the

magnitude of the corresponding eigenvalues (which are all real and non-negative).

One can also show that eigenvalue

corresponds to the variance of

in the direction of eigenvector

(i)

. If

x ∈ R

and

h ∈ R

with

d < D

, then the

optimal reconstruction error (choosing µ, b, V and W as above) is

min E[||x −

x||

] =



i=d+1

. (13.23)

Hence, if the covariance has rank

, the eigenvalues

d+1

are 0 and recon-

struction error is 0.

496

CHAPTER 13. LINEAR FACTOR MODELS

Figure 13.3: Flat Gaussian capturing probability concentration near a low-dimensional

manifold. The ﬁgure shows the upper half of the “pancake” above the “manifold plane,”

which goes through its middle. The variance in the direction orthogonal to the manifold

is very small (arrow pointing out of plane) and can be considered “noise,” while the other

variances are large (arrows in the plane) and correspond to “signal” and to a coordinate

system for the reduced-dimension data.

497

CHAPTER 13. LINEAR FACTOR MODELS

Furthermore, one can also show that the above solution can be obtained by

maximizing the variances of the elements of

, under orthogonal

, instead of

minimizing reconstruction error.

Linear factor models are some of the simplest generative models and some of the

simplest models that learn a representation of data. Much as linear classiﬁers and

linear regression models may be extended to deep feedforward networks, these linear

factor models may be extended to autoencoder networks and deep probabilistic

models that perform the same tasks but with a much more powerful and ﬂexible

model family.

498