Chapter 17
The Manifold Perspective on
Representation Learning
Manifold learning is an approach to machine learning that is capitalizing on the
manifold hypothesis (Cayton, 2005; Narayanan and Mitter, 2010): the data gen-
erating distribution is assumed to concentrate near regions of low dimensionality.
The notion of manifold in mathematics refers to continuous spaces that locally
resemble Euclidean space, and the term we should be using is really submanifold,
which corresponds to a subset which has a manifold structure. The use of the
term manifold in machine learning is much looser than its use in mathematics,
the data may not be strictly on the manifold, but only near it,
the dimensionality may not be the same everywhere,
the notion actually referred to in machine learning naturally extends to
discrete spaces.
Indeed, although the very notions of a manifold or submanifold are defined
for continuous spaces, the more general notion of probability concentration ap-
plies equally well to discrete data. It is a kind of informal prior assumption about
the data generating distribution that seems particularly well-suited for AI tasks
such as those involving images, video, speech, music, text, etc. In all of these
cases the natural data has the property that randomly choosing configurations of
the observed variables according to a factored distribution (e.g. uniformly) are
very unlikely to generate the kind of observations we want to model. What is the
probability of generating a natural looking image by choosing pixel intensities
independently of each other? What is the probability of generating a meaning-
ful natural language paragraph by independently choosing each character in a
Figure 17.1: Top: data sampled from a distribution in a high-dimensional space (one 2
dimensions shown for illustration) that is actually concentrated near a one-dimensional
manifold, which here is like a twisted string. Bottom: the underlying manifold that the
learner should infer.
string? Doing a thought experiment should give a clear answer: an exponentially
tiny probability. This is because the probability distribution of interest concen-
trates in a tiny volume of the total space of configurations. That means that to
the first degree, the problem of characterizing the data generating distribution
can be reduced to a binary classification problem: is this configuration probable
or not?. Is this a grammatically and semantically plausible sentence in English?
Is this a natural-looking image? Answering these questions tells us much more
about the nature of natural language or text than the additional information one
would have by being able to assign a precise probability to each possible sequence
of characters or set of pixels. Hence, simply characterizing where probability
concentrates is a fundamental importance, and this is what manifold learning
algorithms attempt to do. Because it is a where question, it is more about ge-
ometry than about probability distributions, although we find both views useful
when designing learning algorithms for AI tasks.
tangent directions
tangent plane
Data on a curved manifold
Figure 17.2: A two-dimensional manifold near which training examples are concentrated,
along with a tangent plane and its associated tangent directions, forming a basis that
specify the directions of small moves one can make to stay on the manifold.
Figure 17.3: Illustration of tangent vectors of the manifold estimated by a contractive
auto-encoder (CAE), at some input point (top left, image of a zero). Each image on the
top right corresponds to a tangent vector. They are obtained by picking the dominant
singular vectors (with largest singular value) of the Jacobian
f (x)
(see Section 15.10).
Taking the original image plus a small quantity of any of these tangent vectors yields
another plausible image, as illustrated in the bottom. The leading tangent vectors seem
to correspond to small deformations, such as translation, or shifting ink around locally in
the original image. Reproduced with permission from the authors of Rifai et al. (2011a).
In addition to the property of probability concentration, there is another one
that characterizes the manifold hypothesis: when a configuration is probable it
is generally surrounded (at least in some directions) by other probable configura-
tions. If a configuration of pixels looks like a natural image, then there are tiny
changes one can make to the image (like translating everything by 0.1 pixel to the
left) which yield another natural-looking image. The number of independent ways
(each characterized by a number indicating how much or whether we do it) by
which a probable configuration can be locally transformed into another probable
configuration indicates the local dimension of the manifold. Whereas maximum
likelihood procedures tend to concentrate probability mass on the training ex-
amples (which can each become a local maximum of probability when the model
overfits), the manifold hypothesis suggests that good solutions instead concen-
trate probability along ridges of high probability (or their high-dimensional gen-
eralization) that connect nearby examples to each other. This is illustrated in
Figure 17.1.
What is most commonly learned to characterize a manifold is a representation
of the data points on (or near, i.e. projected on) the manifold. Such a representa-
tion for a particular example is also called its embedding. It is typically given by a
low-dimensional vector, with less dimensions than the “ambient” space of which
the manifold is a low-dimensional subset. Some algorithms (non-parametric man-
ifold learning algorithms, discussed below) directly learn an embedding for each
training example, while others learn a more general mapping, sometimes called
an encoder, or representation function, that maps any point in the ambient space
(the input space) to its embedding.
Another important characterization of a manifold is the set of its tangent
planes. At a point x on a d-dimensional manifold, the tangent plane is given by d
basis vectors that span the local directions of variation allowed on the manifold.
As illustrated in Figure 17.2, these local directions specify how one can change x
infinitesimally while staying on the manifold.
Manifold learning has mostly focused on unsupervised learning procedures
that attempt to capture these manifolds. Most of the initial machine learning re-
search on learning non-linear manifolds has focused on non-parametric methods
based on the nearest-neighbor graph. This graph has one node per training ex-
ample and edges connecting near neighbors. Basically, these methods (Sch¨olkopf
et al., 1998; Roweis and Saul, 2000; Tenenbaum et al., 2000; Brand, 2003; Belkin
and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger and Saul, 2004; Hinton
and Roweis, 2003; van der Maaten and Hinton, 2008a) associate each of these
nodes with a tangent plane that spans the directions of variations associated with
the difference vectors between the example and its neighbors, as illustrated in
Figure 17.4.
A global coordinate system can then be obtained through an optimization or
Figure 17.4: Non-parametric manifold learning procedures build a nearest neighbor graph
whose nodes are training examples and arcs connect nearest neighbors. Various proce-
dures can thus obtain the tangent plane associated with a neighborhood of the graph,
and a coordinate system that associates each training example with a real-valued vector
position, or embedding. It is possible to generalize such a representation to new examples
by a form of interpolation. So long as the number of examples is large enough to cover
the curvature and twists of the manifold, these approaches work well. Images from the
QMUL Multiview Face Dataset (Gong et al., 2000).
solving a linear system. Figure 17.5 illustrates how a manifold can be tiled by a
large number of locally linear Gaussian-like patches (or “pancakes”, because the
Gaussians are flat in the tangent directions).
Figure 17.5: If the tangent plane at each location is known, then they can be tiled to
form a global coordinate system or a density function. In the figure, each local patch
can be thought of as a local Euclidean coordinate system or as a locally flat Gaussian,
or “pancake”, with a very small variance in the directions orthogonal to the pancake and
a very large variance in the directions defining the coordinate system on the pancake.
The average of all these Gaussians would provide an estimated density function, as in the
Manifold Parzen algorithm (Vincent and Bengio, 2003) or its non-local neural-net based
variant (Bengio et al., 2006c).
tangent directions
tangent image
tangent directions
tangent image
high−contrast image
Figure 17.6: When the data are images, the tangent vectors can also be visualized like
images. Here we show the tangent vector associated with translation: it corresponds to
the difference between an image and a slightly translated version. This basically extracts
edges, with the sign (shown as red or green color) indicating whether ink is being added
or subtracted when moving to the right. The figure illustrates the fact that the tangent
vectors along the manifolds associated with translation, rotation and such apparently
benign transformations of images, can be highly curved: the tangent vector at a position
x is very different from the tangent vector at a nearby position (slightly translated image).
Note how the two tangent vectors shown have almost no overlap, i.e., their dot product
is nearly 0, and they are as far as two such images can be. If the tangent vectors of a
manifold change quickly, it means that the manifold is highly curved. This is going to
make it very difficult to capture such manifolds (with a huge numbers of twists, ups and
downs) with local non-parametric methods such as graph-based non-parametric manifold
learning. This point was made in Bengio and Monperrus (2005).
However, there is a fundamental difficulty with such non-parametric neighborhood-
based approaches to manifold learning, raised in Bengio and Monperrus (2005):
if the manifolds are not very smooth (they have many ups and downs and twists),
one may need a very large number of training examples to cover each one of
these variations, with no chance to generalize to unseen variations. Indeed, these
methods can only generalize the shape of the manifold by interpolating between
neighboring examples. Unfortunately, the manifolds of interest in AI have many
ups and downs and twists and strong curvature, as illustrated in Figure 17.6. This
motivates the use of distributed representations and deep learning for capturing
manifold structure, which is the subject of this chapter.
Figure 17.7: Training examples of a face dataset the QMUL Multiview Face
Dataset (Gong et al., 2000) – for which the subjects were asked to move in such a way as
to cover the two-dimensional manifold corresponding to two angles of rotation. We would
like learning algorithms to be able to discover and disentangle such factors. Figure 17.8
illustrates such a feat.
The hope of many manifold learning algorithms, including those based on deep
learning and auto-encoders, is that one learns an explicit or implicit coordinate
system for the leading factors of variation that explain most of the structure in
the unknown data generating distribution. An example of an explicit coordinate
system is one where the dimensions of the representation (e.g., the outputs of the
encoder, i.e., of the hidden units that compute the “code” associated with the
input) are directly the coordinates that map the unknown manifold. Training
examples of a face dataset in which the images have been arranged visually on a
2-D manifold are shown in Figure 17.7, with the images laid down so that each
of the two axes corresponds to one of the two angles of rotation of the face.
However, the objective is to discover such manifolds, and Figure 17.8 illus-
trates the images generated by a variational auto-encoder (Kingma and Welling,
2014a) when the two-dimensional auto-encoder code (representation) is varied