Chapter 17

The Manifold Perspective on

Representation Learning

Manifold learning is an approach to machine learning that is capitalizing on the

manifold hypothesis (Cayton, 2005; Narayanan and Mitter, 2010): the data gen-

erating distribution is assumed to concentrate near regions of low dimensionality.

The notion of manifold in mathematics refers to continuous spaces that locally

resemble Euclidean space, and the term we should be using is really submanifold,

which corresponds to a subset which has a manifold structure. The use of the

term manifold in machine learning is much looser than its use in mathematics,

though:

• the data may not be strictly on the manifold, but only near it,

• the dimensionality may not be the same everywhere,

• the notion actually referred to in machine learning naturally extends to

discrete spaces.

Indeed, although the very notions of a manifold or submanifold are deﬁned

for continuous spaces, the more general notion of probability concentration ap-

plies equally well to discrete data. It is a kind of informal prior assumption about

the data generating distribution that seems particularly well-suited for AI tasks

such as those involving images, video, speech, music, text, etc. In all of these

cases the natural data has the property that randomly choosing conﬁgurations of

the observed variables according to a factored distribution (e.g. uniformly) are

very unlikely to generate the kind of observations we want to model. What is the

probability of generating a natural looking image by choosing pixel intensities

independently of each other? What is the probability of generating a meaning-

ful natural language paragraph by independently choosing each character in a

523

CHAPTER 17. THE MANIFOLD PERSPECTIVE ON REPRESENTATION LEARNING

⇓

Figure 17.1: Top: data sampled from a distribution in a high-dimensional space (one 2

dimensions shown for illustration) that is actually concentrated near a one-dimensional

manifold, which here is like a twisted string. Bottom: the underlying manifold that the

learner should infer.

string? Doing a thought experiment should give a clear answer: an exponentially

tiny probability. This is because the probability distribution of interest concen-

trates in a tiny volume of the total space of conﬁgurations. That means that to

the ﬁrst degree, the problem of characterizing the data generating distribution

can be reduced to a binary classiﬁcation problem: is this conﬁguration probable

or not?. Is this a grammatically and semantically plausible sentence in English?

Is this a natural-looking image? Answering these questions tells us much more

about the nature of natural language or text than the additional information one

would have by being able to assign a precise probability to each possible sequence

of characters or set of pixels. Hence, simply characterizing where probability

concentrates is a fundamental importance, and this is what manifold learning

algorithms attempt to do. Because it is a where question, it is more about ge-

ometry than about probability distributions, although we ﬁnd both views useful

524

CHAPTER 17. THE MANIFOLD PERSPECTIVE ON REPRESENTATION LEARNING

when designing learning algorithms for AI tasks.

tangent directions

tangent plane

Data on a curved manifold

Figure 17.2: A two-dimensional manifold near which training examples are concentrated,

along with a tangent plane and its associated tangent directions, forming a basis that

specify the directions of small moves one can make to stay on the manifold.

Figure 17.3: Illustration of tangent vectors of the manifold estimated by a contractive

auto-encoder (CAE), at some input point (top left, image of a zero). Each image on the

top right corresponds to a tangent vector. They are obtained by picking the dominant

singular vectors (with largest singular value) of the Jacobian

∂f (x)

∂x

(see Section 15.10).

Taking the original image plus a small quantity of any of these tangent vectors yields

another plausible image, as illustrated in the bottom. The leading tangent vectors seem

to correspond to small deformations, such as translation, or shifting ink around locally in

the original image. Reproduced with permission from the authors of Rifai et al. (2011a).

In addition to the property of probability concentration, there is another one

525

CHAPTER 17. THE MANIFOLD PERSPECTIVE ON REPRESENTATION LEARNING

that characterizes the manifold hypothesis: when a conﬁguration is probable it

is generally surrounded (at least in some directions) by other probable conﬁgura-

tions. If a conﬁguration of pixels looks like a natural image, then there are tiny

changes one can make to the image (like translating everything by 0.1 pixel to the

left) which yield another natural-looking image. The number of independent ways

(each characterized by a number indicating how much or whether we do it) by

which a probable conﬁguration can be locally transformed into another probable

conﬁguration indicates the local dimension of the manifold. Whereas maximum

likelihood procedures tend to concentrate probability mass on the training ex-

amples (which can each become a local maximum of probability when the model

overﬁts), the manifold hypothesis suggests that good solutions instead concen-

trate probability along ridges of high probability (or their high-dimensional gen-

eralization) that connect nearby examples to each other. This is illustrated in

Figure 17.1.

What is most commonly learned to characterize a manifold is a representation

of the data points on (or near, i.e. projected on) the manifold. Such a representa-

tion for a particular example is also called its embedding. It is typically given by a

low-dimensional vector, with less dimensions than the “ambient” space of which

the manifold is a low-dimensional subset. Some algorithms (non-parametric man-

ifold learning algorithms, discussed below) directly learn an embedding for each

training example, while others learn a more general mapping, sometimes called

an encoder, or representation function, that maps any point in the ambient space

(the input space) to its embedding.

Another important characterization of a manifold is the set of its tangent

planes. At a point x on a d-dimensional manifold, the tangent plane is given by d

basis vectors that span the local directions of variation allowed on the manifold.

As illustrated in Figure 17.2, these local directions specify how one can change x

inﬁnitesimally while staying on the manifold.

Manifold learning has mostly focused on unsupervised learning procedures

that attempt to capture these manifolds. Most of the initial machine learning re-

search on learning non-linear manifolds has focused on non-parametric methods

based on the nearest-neighbor graph. This graph has one node per training ex-

ample and edges connecting near neighbors. Basically, these methods (Sch¨olkopf

et al., 1998; Roweis and Saul, 2000; Tenenbaum et al., 2000; Brand, 2003; Belkin

and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger and Saul, 2004; Hinton

and Roweis, 2003; van der Maaten and Hinton, 2008a) associate each of these

nodes with a tangent plane that spans the directions of variations associated with

the diﬀerence vectors between the example and its neighbors, as illustrated in

Figure 17.4.

A global coordinate system can then be obtained through an optimization or

526

CHAPTER 17. THE MANIFOLD PERSPECTIVE ON REPRESENTATION LEARNING

Figure 17.4: Non-parametric manifold learning procedures build a nearest neighbor graph

whose nodes are training examples and arcs connect nearest neighbors. Various proce-

dures can thus obtain the tangent plane associated with a neighborhood of the graph,

and a coordinate system that associates each training example with a real-valued vector

position, or embedding. It is possible to generalize such a representation to new examples

by a form of interpolation. So long as the number of examples is large enough to cover

the curvature and twists of the manifold, these approaches work well. Images from the

QMUL Multiview Face Dataset (Gong et al., 2000).

solving a linear system. Figure 17.5 illustrates how a manifold can be tiled by a

large number of locally linear Gaussian-like patches (or “pancakes”, because the

Gaussians are ﬂat in the tangent directions).

527

CHAPTER 17. THE MANIFOLD PERSPECTIVE ON REPRESENTATION LEARNING

Figure 17.5: If the tangent plane at each location is known, then they can be tiled to

form a global coordinate system or a density function. In the ﬁgure, each local patch

can be thought of as a local Euclidean coordinate system or as a locally ﬂat Gaussian,

or “pancake”, with a very small variance in the directions orthogonal to the pancake and

a very large variance in the directions deﬁning the coordinate system on the pancake.

The average of all these Gaussians would provide an estimated density function, as in the

Manifold Parzen algorithm (Vincent and Bengio, 2003) or its non-local neural-net based

variant (Bengio et al., 2006c).

tangent directions

tangent image

tangent directions

tangent image

shifted

image

high−contrast image

Figure 17.6: When the data are images, the tangent vectors can also be visualized like

images. Here we show the tangent vector associated with translation: it corresponds to

the diﬀerence between an image and a slightly translated version. This basically extracts

edges, with the sign (shown as red or green color) indicating whether ink is being added

or subtracted when moving to the right. The ﬁgure illustrates the fact that the tangent

vectors along the manifolds associated with translation, rotation and such apparently

benign transformations of images, can be highly curved: the tangent vector at a position

x is very diﬀerent from the tangent vector at a nearby position (slightly translated image).

Note how the two tangent vectors shown have almost no overlap, i.e., their dot product

is nearly 0, and they are as far as two such images can be. If the tangent vectors of a

manifold change quickly, it means that the manifold is highly curved. This is going to

make it very diﬃcult to capture such manifolds (with a huge numbers of twists, ups and

downs) with local non-parametric methods such as graph-based non-parametric manifold

learning. This point was made in Bengio and Monperrus (2005).

528

CHAPTER 17. THE MANIFOLD PERSPECTIVE ON REPRESENTATION LEARNING

However, there is a fundamental diﬃculty with such non-parametric neighborhood-

based approaches to manifold learning, raised in Bengio and Monperrus (2005):

if the manifolds are not very smooth (they have many ups and downs and twists),

one may need a very large number of training examples to cover each one of

these variations, with no chance to generalize to unseen variations. Indeed, these

methods can only generalize the shape of the manifold by interpolating between

neighboring examples. Unfortunately, the manifolds of interest in AI have many

ups and downs and twists and strong curvature, as illustrated in Figure 17.6. This

motivates the use of distributed representations and deep learning for capturing

manifold structure, which is the subject of this chapter.

Figure 17.7: Training examples of a face dataset – the QMUL Multiview Face

Dataset (Gong et al., 2000) – for which the subjects were asked to move in such a way as

to cover the two-dimensional manifold corresponding to two angles of rotation. We would

like learning algorithms to be able to discover and disentangle such factors. Figure 17.8

illustrates such a feat.

The hope of many manifold learning algorithms, including those based on deep

learning and auto-encoders, is that one learns an explicit or implicit coordinate

system for the leading factors of variation that explain most of the structure in

the unknown data generating distribution. An example of an explicit coordinate

system is one where the dimensions of the representation (e.g., the outputs of the

encoder, i.e., of the hidden units that compute the “code” associated with the

input) are directly the coordinates that map the unknown manifold. Training

examples of a face dataset in which the images have been arranged visually on a

2-D manifold are shown in Figure 17.7, with the images laid down so that each

of the two axes corresponds to one of the two angles of rotation of the face.

However, the objective is to discover such manifolds, and Figure 17.8 illus-

trates the images generated by a variational auto-encoder (Kingma and Welling,

2014a) when the two-dimensional auto-encoder code (representation) is varied

529