Chapter 12
Applications
In this chapter, we describe how to use deep learning to solve applications in
computer vision, speech recognition, natural language processing, and other areas
of commercial interest. We begin by discussing the large-scale neural network
implementations required for most serious AI applications. Next, we review several
specific application areas that deep learning has been used to solve. While one
goal of deep learning is to design algorithms that are capable of solving a broad
variety of tasks, so far some degree of specialization is needed. For example, vision
tasks require processing a large number of input features (pixels) per example.
Language tasks require modeling a large number of possible values (words in the
vocabulary) per input feature.
12.1 Large-Scale Deep Learning
Deep learning is based on the philosophy of connectionism: while an individual
biological neuron or an individual feature in a machine learning model is not
intelligent, a large population of these neurons or features acting together can
exhibit intelligent behavior. It truly is important to emphasize the fact that the
number of neurons must be large. One of the key factors responsible for the
improvement in neural network’s accuracy and the improvement of the complexity
of tasks they can solve between the 1980s and today is the dramatic increase in
the size of the networks we use. As we saw in section 1.2.3, network sizes have
grown exponentially for the past three decades, yet artificial neural networks are
only as large as the nervous systems of insects.
Because the size of neural networks is critical, deep learning requires high
438
CHAPTER 12. APPLICATIONS
performance hardware and software infrastructure.
12.1.1 Fast CPU Implementations
Traditionally, neural networks were trained using the CPU of a single machine.
Today, this approach is generally considered insufficient. We now mostly use GPU
computing or the CPUs of many machines networked together. Before moving to
these expensive setups, researchers worked hard to demonstrate that CPUs could
not manage the high computational workload required by neural networks.
A description of how to implement efficient numerical CPU code is beyond
the scope of this book, but we emphasize here that careful implementation for
specific CPU families can yield large improvements. For example, in 2011, the best
CPUs available could run neural network workloads faster when using fixed-point
arithmetic rather than floating-point arithmetic. By creating a carefully tuned fixed-
point implementation, Vanhoucke et al. (2011) obtained a threefold speedup over
a strong floating-point system. Each new model of CPU has different performance
characteristics, so sometimes floating-point implementations can be faster too.
The important principle is that careful specialization of numerical computation
routines can yield a large payoff. Other strategies, besides choosing whether to use
fixed or floating point, include optimizing data structures to avoid cache misses
and using vector instructions. Many machine learning researchers neglect these
implementation details, but when the performance of an implementation restricts
the size of the model, the accuracy of the model suffers.
12.1.2 GPU Implementations
Most modern neural network implementations are based on graphics processing
units. Graphics processing units (GPUs) are specialized hardware components
that were originally developed for graphics applications. The consumer market for
video gaming systems spurred development of graphics processing hardware. The
performance characteristics needed for good video gaming systems turn out to be
beneficial for neural networks as well.
Video game rendering requires performing many operations in parallel quickly.
Models of characters and environments are specified via lists of 3-D coordinates of
vertices. Graphics cards must perform matrix multiplication and division on many
vertices in parallel to convert these 3-D coordinates into 2-D on-screen coordinates.
The graphics card must then perform many computations at each pixel in parallel to
determine the color of each pixel. In both cases, the computations are fairly simple
439
CHAPTER 12. APPLICATIONS
and do not involve much branching compared to the computational workload that
a CPU usually encounters. For example, each vertex in the same rigid object will
be multiplied by the same matrix; there is no need to evaluate an if statement per
vertex to determine which matrix to multiply by. The computations are also entirely
independent of each other, and thus may be parallelized easily. The computations
also involve processing massive buffers of memory, containing bitmaps describing
the texture (color pattern) of each object to be rendered. Together, this results in
graphics cards having been designed to have a high degree of parallelism and high
memory bandwidth, at the cost of having a lower clock speed and less branching
capability relative to traditional CPUs.
Neural network algorithms require the same performance characteristics as the
real-time graphics algorithms described above. Neural networks usually involve
large and numerous buffers of parameters, activation values, and gradient values,
each of which must be completely updated during every step of training. These
buffers are large enough to fall outside the cache of a traditional desktop computer,
so the memory bandwidth of the system often becomes the rate-limiting factor.
GPUs offer a compelling advantage over CPUs because of their high memory
bandwidth. Neural network training algorithms typically do not involve much
branching or sophisticated control, so they are appropriate for GPU hardware.
Since neural networks can be divided into multiple individual “neurons” that can be
processed independently from the other neurons in the same layer, neural networks
easily benefit from the parallelism of GPU computing.
GPU hardware was originally so specialized that it could be used only for
graphics tasks. Over time, GPU hardware became more flexible, allowing custom
subroutines to be used to transform the coordinates of vertices or to assign colors
to pixels. In principle, there was no requirement that these pixel values actually
be based on a rendering task. These GPUs could be used for scientific computing
by writing the output of a computation to a buffer of pixel values. Steinkrau et al.
(2005) implemented a two-layer fully connected neural network on a GPU and
reported a three-times speedup over their CPU-based baseline. Shortly thereafter,
Chellapilla et al. (2006) demonstrated that the same technique could be used to
accelerate supervised convolutional networks.
The popularity of graphics cards for neural network training exploded after
the advent of
general purpose GPUs
. These GP-GPUs could execute arbitrary
code, not just rendering subroutines. NVIDIA’s CUDA programming language
provided a way to write this arbitrary code in a C-like language. With their
relatively convenient programming model, massive parallelism, and high memory
bandwidth, GP-GPUs now offer an ideal platform for neural network programming.
440
CHAPTER 12. APPLICATIONS
This platform was rapidly adopted by deep learning researchers soon after it became
available (Raina et al., 2009; Ciresan et al., 2010).
Writing efficient code for GP-GPUs remains a difficult task best left to special-
ists. The techniques required to obtain good performance on GPU are very different
from those used on CPU. For example, good CPU-based code is usually designed
to read information from the cache as much as possible. On GPU, most writable
memory locations are not cached, so it can actually be faster to compute the same
value twice, rather than compute it once and read it back from memory. GPU code
is also inherently multithreaded and the different threads must be coordinated
with each other carefully. For example, memory operations are faster if they can
be
coalesced
. Coalesced reads or writes occur when several threads can each
read or write a value that they need simultaneously, as part of a single memory
transaction. Different models of GPUs are able to coalesce different kinds of read
patterns and different kinds of write patterns. Typically, memory operations are
easier to coalesce if among
n
threads, thread
i
accesses byte
i
+
j
of memory, and
j
is a multiple of some power of 2. The exact specifications differ between models of
GPU. Another common consideration for GPUs is making sure that each thread in
a group executes the same instruction simultaneously. This means that branching
can be difficult on GPU. Threads are divided into small groups called
warps
. Each
thread in a warp executes the same instruction during each cycle, so if different
threads within the same warp need to execute different code paths, these different
code paths must be traversed sequentially rather than in parallel.
Because of the difficulty of writing high-performance GPU code, researchers
should structure their workflow to avoid needing to write new GPU code to test
new models or algorithms. Typically, one can do this by building a software library
of high-performance operations like convolution and matrix multiplication, then
specifying models in terms of calls to this library of operations. For example,
the machine learning library Pylearn2 (Goodfellow et al., 2013c) specifies all its
machine learning algorithms in terms of calls to Theano (Bergstra et al., 2010;
Bastien et al., 2012) and cuda-convnet (Krizhevsky, 2010), which provide these
high-performance operations. This factored approach can also ease support for
multiple kinds of hardware. For example, the same Theano program can run on
either CPU or GPU, without needing to change any of the calls to Theano itself.
Other libraries like TensorFlow (Abadi et al., 2015) and Torch (Collobert et al.,
2011b) provide similar features.
441
CHAPTER 12. APPLICATIONS
12.1.3 Large-Scale Distributed Implementations
In many cases, the computational resources available on a single machine are
insufficient. We therefore want to distribute the workload of training and inference
across many machines.
Distributing inference is simple, because each input example we want to process
can be run by a separate machine. This is known as data parallelism.
It is also possible to get
model parallelism
, where multiple machines work
together on a single data point, with each machine running a different part of the
model. This is feasible for both inference and training.
Data parallelism during training is somewhat harder. We can increase the size
of the minibatch used for a single SGD step, but usually we get less than linear
returns in terms of optimization performance. It would be better to allow multiple
machines to compute multiple gradient descent steps in parallel. Unfortunately,
the standard definition of gradient descent is as a completely sequential algorithm:
the gradient at step t is a function of the parameters produced by step t 1.
This can be solved using
asynchronous stochastic gradient descent
(Ben-
gio et al., 2001; Recht et al., 2011). In this approach, several processor cores share
the memory representing the parameters. Each core reads parameters without a
lock, then computes a gradient, then increments the parameters without a lock.
This reduces the average amount of improvement that each gradient descent step
yields, because some of the cores overwrite each other’s progress, but the increased
rate of production of steps causes the learning process to be faster overall. Dean
et al. (2012) pioneered the multimachine implementation of this lock-free approach
to gradient descent, where the parameters are managed by a
parameter server
rather than stored in shared memory. Distributed asynchronous gradient descent
remains the primary strategy for training large deep networks and is used by
most major deep learning groups in industry (Chilimbi et al., 2014; Wu et al.,
2015). Academic deep learning researchers typically cannot afford the same scale
of distributed learning systems, but some research has focused on how to build
distributed networks with relatively low-cost hardware available in the university
setting (Coates et al., 2013).
12.1.4 Model Compression
In many commercial applications, it is much more important that the time and
memory cost of running inference in a machine learning model be low than that
the time and memory cost of training be low. For applications that do not require
442
CHAPTER 12. APPLICATIONS
personalization, it is possible to train a model once, then deploy it to be used by
billions of users. In many cases, the end user is more resource constrained than
the developer. For example, one might train a speech recognition network with a
powerful computer cluster, then deploy it on mobile phones.
A key strategy for reducing the cost of inference is
model compression
(Bu-
ciluˇa et al., 2006). The basic idea of model compression is to replace the original,
expensive model with a smaller model that requires less memory and runtime to
store and evaluate.
Model compression is applicable when the size of the original model is driven
primarily by a need to prevent overfitting. In most cases, the model with the
lowest generalization error is an ensemble of several independently trained models.
Evaluating all
n
ensemble members is expensive. Sometimes, even a single model
generalizes better if it is large (for example, if it is regularized with dropout).
These large models learn some function
f
(
x
), but do so using many more
parameters than are necessary for the task. Their size is necessary only because of
the limited number of training examples. As soon as we have fit this function
f
(
x
),
we can generate a training set containing infinitely many examples, simply by
applying
f
to randomly sampled points
x
. We then train the new, smaller model
to match
f
(
x
) on these points. To most efficiently use the capacity of the new,
small model, it is best to sample the new
x
points from a distribution resembling
the actual test inputs that will be supplied to the model later. This can be done
by corrupting training examples or by drawing points from a generative model
trained on the original training set.
Alternatively, one can train the smaller model only on the original training
points, but train it to copy other features of the model, such as its posterior
distribution over the incorrect classes (Hinton et al., 2014, 2015).
12.1.5 Dynamic Structure
One strategy for accelerating data-processing systems in general is to build systems
that have
dynamic structure
in the graph describing the computation needed
to process an input. Data-processing systems can dynamically determine which
subset of many neural networks should be run on a given input. Individual neural
networks can also exhibit dynamic structure internally by determining which subset
of features (hidden units) to compute given information from the input. This
form of dynamic structure inside neural networks is sometimes called
conditional
computation
(Bengio, 2013; Bengio et al., 2013b). Since many components of
the architecture may be relevant only for a small amount of possible inputs, the
443
CHAPTER 12. APPLICATIONS
system can run faster by computing these features only when they are needed.
Dynamic structure of computations is a basic computer science principle applied
generally throughout the software engineering discipline. The simplest versions
of dynamic structure applied to neural networks are based on determining which
subset of some group of neural networks (or other machine learning models) should
be applied to a particular input.
A venerable strategy for accelerating inference in a classifier is to use a
cascade
of classifiers. The cascade strategy may be applied when the goal is to detect the
presence of a rare object (or event). To know for sure that the object is present,
we must use a sophisticated classifier with high capacity, which is expensive to run.
Because the object is rare, however, we can usually use much less computation
to reject inputs as not containing the object. In these situations, we can train a
sequence of classifiers. The first classifiers in the sequence have low capacity and
are trained to have high recall. In other words, they are trained to make sure
we do not wrongly reject an input when the object is present. The final classifier
is trained to have high precision. At test time, we run inference by running the
classifiers in a sequence, abandoning any example as soon as any one element in
the cascade rejects it. Overall, this allows us to verify the presence of objects with
high confidence, using a high capacity model, but does not force us to pay the cost
of full inference for every example. There are two different ways that the cascade
can achieve high capacity. One way is to make the later members of the cascade
individually have high capacity. In this case, the system as a whole obviously has
high capacity, because some of its individual members do. It is also possible to
make a cascade in which every individual model has low capacity but the system
as a whole has high capacity as a result of the combination of many small models.
Viola and Jones (2001) used a cascade of boosted decision trees to implement a
fast and robust face detector suitable for use in handheld digital cameras. Their
classifier localizes a face using essentially a sliding window approach in which many
windows are examined and rejected if they do not contain faces. Another version of
cascades uses the earlier models to implement a sort of hard attention mechanism:
the early members of the cascade localize an object, and later members of the
cascade perform further processing given the location of the object. For example,
Google transcribes address numbers from Street View imagery using a two-step
cascade that first locates the address number with one machine learning model
and then transcribes it with another (Goodfellow et al., 2014d).
Decision trees themselves are an example of dynamic structure, because each
node in the tree determines which of its subtrees should be evaluated for each input.
A simple way to accomplish the union of deep learning and dynamic structure
444
CHAPTER 12. APPLICATIONS
is to train a decision tree in which each node uses a neural network to make the
splitting decision (Guo and Gelfand, 1992), though this has typically not been
done with the primary goal of accelerating inference computations.
In the same spirit, one can use a neural network called the
gater
to select
which one out of several
expert networks
will be used to compute the output,
given the current input. The first version of this idea is called the
mixture of
experts
(Nowlan, 1990; Jacobs et al., 1991), in which the gater outputs a set
of probabilities or weights (obtained via a softmax nonlinearity), one per expert,
and the final output is obtained by the weighted combination of the output of
the experts. In that case, the use of the gater does not offer a reduction in
computational cost, but if a single expert is chosen by the gater for each example,
we obtain the
hard mixture of experts
(Collobert et al., 2001, 2002), which
can considerably accelerate training and inference time. This strategy works well
when the number of gating decisions is small because it is not combinatorial. But
when we want to select different subsets of units or parameters, it is not possible
to use a “soft switch” because it requires enumerating (and computing outputs for)
all the gater configurations. To deal with this problem, several approaches have
been explored to train combinatorial gaters. Bengio et al. (2013b) experiment with
several estimators of the gradient on the gating probabilities, while Bacon et al.
(2015) and Bengio et al. (2015a) use reinforcement learning techniques (policy
gradient) to learn a form of conditional dropout on blocks of hidden units and get
an actual reduction in computational cost without negatively affecting the quality
of the approximation.
Another kind of dynamic structure is a switch, where a hidden unit can
receive input from different units depending on the context. This dynamic routing
approach can be interpreted as an attention mechanism (Olshausen et al., 1993).
So far, the use of a hard switch has not proven effective on large-scale applications.
Contemporary approaches instead use a weighted average over many possible
inputs, and thus do not achieve all the possible computational benefits of dynamic
structure. Contemporary attention mechanisms are described in section 12.4.5.1.
One major obstacle to using dynamically structured systems is the decreased
degree of parallelism that results from the system following different code branches
for different inputs. This means that few operations in the network can be described
as matrix multiplication or batch convolution on a minibatch of examples. We
can write more specialized subroutines that convolve each example with different
kernels or multiply each row of a design matrix by a different set of columns
of weights. Unfortunately, these more specialized subroutines are difficult to
implement efficiently. CPU implementations will be slow as a result of the lack
445
CHAPTER 12. APPLICATIONS
of cache coherence, and GPU implementations will be slow because of the lack of
coalesced memory transactions and the need to serialize warps when members of
a warp take different branches. In some cases, these issues can be mitigated by
partitioning the examples into groups that all take the same branch, then processing
these groups of examples simultaneously. This can be an acceptable strategy for
minimizing the time required to process a fixed amount of examples in an offline
setting. In a real-time setting where examples must be processed continuously,
partitioning the workload can result in load-balancing issues. For example, if we
assign one machine to process the first step in a cascade and another machine to
process the last step in a cascade, then the first will tend to be overloaded, and the
last will tend to be underloaded. Similar issues arise if each machine is assigned to
implement different nodes of a neural decision tree.
12.1.6 Specialized Hardware Implementations of Deep Networks
Since the early days of neural networks research, hardware designers have worked
on specialized hardware implementations that could speed up training and/or
inference of neural network algorithms. See early and more recent reviews of
specialized hardware for deep networks (Lindsey and Lindblad, 1994; Beiu et al.,
2003; Misra and Saha, 2010).
Different forms of specialized hardware (Graf and Jackel, 1989; Mead and Ismail,
2012; Kim et al., 2009; Pham et al., 2012; Chen et al., 2014a,b) have been developed
over the last decades with ASICs (application-specific integrated circuits), either
with digital (based on binary representations of numbers), analog (Graf and Jackel,
1989; Mead and Ismail, 2012) (based on physical implementations of continuous
values as voltages or currents), or hybrid implementations (combining digital and
analog components). In recent years more flexible FPGA (field programmable
gated array) implementations (where the particulars of the circuit can be written
on the chip after it has been built) have been developed.
Though software implementations on general-purpose processing units (CPUs
and GPUs) typically use 32 or 64 bits of precision to represent floating-point
numbers, it has long been known that it was possible to use less precision, at
least at inference time (Holt and Baker, 1991; Holi and Hwang, 1993; Presley
and Haggard, 1994; Simard and Graf, 1994; Wawrzynek et al., 1996; Savich et al.,
2007). This has become a more pressing issue in recent years as deep learning
has gained in popularity in industrial products, and as the great impact of faster
hardware was demonstrated with GPUs. Another factor that motivates current
research on specialized hardware for deep networks is that the rate of progress of
a single CPU or GPU core has slowed down, and most recent improvements in
446
CHAPTER 12. APPLICATIONS
computing speed have come from parallelization across cores (either in CPUs or
GPUs). This is very different from the situation of the 1990s (the previous neural
network era), when the hardware implementations of neural networks (which might
take two years from inception to availability of a chip) could not keep up with
the rapid progress and low prices of general-purpose CPUs. Building specialized
hardware is thus a way to push the envelope further, at a time when new hardware
designs are being developed for low-power devices such as phones, aiming for
general-public applications of deep learning (e.g., with speech, computer vision or
natural language).
Recent work on low-precision implementations of backprop-based neural nets
(Vanhoucke et al., 2011; Courbariaux et al., 2015; Gupta et al., 2015) suggests
that between 8 and 16 bits of precision can suffice for using or training deep
neural networks with back-propagation. What is clear is that more precision is
required during training than at inference time, and that some forms of dynamic
fixed-point representation of numbers can be used to reduce how many bits are
required per number. Traditional fixed-point numbers are restricted to a fixed
range (which corresponds to a given exponent in a floating-point representation).
Dynamic fixed-point representations share that range among a set of numbers
(such as all the weights in one layer). Using fixed-point rather than floating-point
representations and using fewer bits per number reduces the hardware surface area,
power requirements, and computing time needed for performing multiplications,
and multiplications are the most demanding of the operations needed to use or
train a modern deep network with backprop.
12.2 Computer Vision
Computer vision has traditionally been one of the most active research areas for
deep learning applications, because vision is a task that is effortless for humans
and many animals but challenging for computers (Ballard et al., 1983). Many of
the most popular standard benchmark tasks for deep learning algorithms are forms
of object recognition or optical character recognition.
Computer vision is a very broad field encompassing a wide variety of ways to
process images and an amazing diversity of applications. Applications of computer
vision range from reproducing human visual abilities, such as recognizing faces,
to creating entirely new categories of visual abilities. As an example of the latter
category, one recent computer vision application is to recognize sound waves
from the vibrations they induce in objects visible in a video (Davis et al., 2014).
Most deep learning research on computer vision has focused not on such exotic
447
CHAPTER 12. APPLICATIONS
applications that expand the realm of what is possible with imagery but rather
on a small core of AI goals aimed at replicating human abilities. Most deep
learning for computer vision is used for object recognition or detection of some
form, whether this means reporting which object is present in an image, annotating
an image with bounding boxes around each object, transcribing a sequence of
symbols from an image, or labeling each pixel in an image with the identity of the
object it belongs to. Because generative modeling has been a guiding principle
of deep learning research, there is also a large body of work on image synthesis
using deep models. While image synthesis ex nihilo is usually not considered a
computer vision endeavor, models capable of image synthesis are usually useful for
image restoration, a computer vision task involving repairing defects in images or
removing objects from images.
12.2.1 Preprocessing
Many application areas require sophisticated preprocessing because the original
input comes in a form that is difficult for many deep learning architectures to
represent. Computer vision usually requires relatively little of this kind of pre-
processing. The images should be standardized so that their pixels all lie in the
same reasonable range, like [0,1] or [-1, 1]. Mixing images that lie in [0,1] with
images that lie in [0, 255] will usually result in failure. Formatting images to have
the same scale is the only kind of preprocessing that is strictly necessary. Many
computer vision architectures require images of a standard size, so images must be
cropped or scaled to fit that size. Even this rescaling is not always strictly necessary.
Some convolutional models accept variably sized inputs and dynamically adjust
the size of their pooling regions to keep the output size constant (Waibel et al.,
1989). Other convolutional models have variably sized output that automatically
scales in size with the input, such as models that denoise or label each pixel in an
image (Hadsell et al., 2007).
Dataset augmentation may be seen as a way of preprocessing the training set
only. Dataset augmentation is an excellent way to reduce the generalization error
of most computer vision models. A related idea applicable at test time is to show
the model many different versions of the same input (for example, the same image
cropped at slightly different locations) and have the different instantiations of the
model vote to determine the output. This latter idea can be interpreted as an
ensemble approach, and it helps to reduce generalization error.
Other kinds of preprocessing are applied to both the training and the test set
with the goal of putting each example into a more canonical form to reduce the
amount of variation that the model needs to account for. Reducing the amount of
448
CHAPTER 12. APPLICATIONS
variation in the data can reduce both generalization error and the size of the model
needed to fit the training set. Simpler tasks may be solved by smaller models, and
simpler solutions are more likely to generalize well. Preprocessing of this kind is
usually designed to remove some kind of variability in the input data that is easy
for a human designer to describe and that the human designer is confident has no
relevance to the task. When training with large datasets and large models, this
kind of preprocessing is often unnecessary, and it is best to just let the model learn
which kinds of variability it should become invariant to. For example, the AlexNet
system for classifying ImageNet has only one preprocessing step: subtracting the
mean across training examples of each pixel (Krizhevsky et al., 2012).
12.2.1.1 Contrast Normalization
One of the most obvious sources of variation that can be safely removed for
many tasks is the amount of contrast in the image. Contrast simply refers to the
magnitude of the difference between the bright and the dark pixels in an image.
There are many ways of quantifying the contrast of an image. In the context of
deep learning, contrast usually refers to the standard deviation of the pixels in an
image or region of an image. Suppose we have an image represented by a tensor
X R
r×c×3
, with X
i,j,1
being the red intensity at row
i
, and column
j
, X
i,j,2
giving
the green intensity, and X
i,j,3
giving the blue intensity. Then the contrast of the
entire image is given by
1
3rc
r
i=1
c
j=1
3
k=1
X
i,j,k
¯
X
2
, (12.1)
where
¯
X is the mean intensity of the entire image:
¯
X =
1
3rc
r
i=1
c
j=1
3
k=1
X
i,j,k
. (12.2)
Global contrast normalization
(GCN) aims to prevent images from having
varying amounts of contrast by subtracting the mean from each image, then
rescaling it so that the standard deviation across its pixels is equal to some
constant
s
. This approach is complicated by the fact that no scaling factor can
change the contrast of a zero-contrast image (one whose pixels all have equal
intensity). Images with very low but nonzero contrast often have little information
content. Dividing by the true standard deviation usually accomplishes nothing
more than amplifying sensor noise or compression artifacts in such cases. This
449
CHAPTER 12. APPLICATIONS
1.5 0.0 1.5
x
0
1.5
0.0
1.5
x
1
Raw input
1.5 0.0 1.5
x
0
GCN, λ = 10
2
1.5 0.0 1.5
x
0
GCN, λ = 0
Figure 12.1: GCN maps examples onto a sphere. (Left)Raw input data may have any norm.
(Center)GCN with
λ
= 0 maps all nonzero examples perfectly onto a sphere. Here we use
s
= 1 and
= 10
8
. Because we use GCN based on normalizing the standard deviation
rather than the
L
2
norm, the resulting sphere is not the unit sphere. (Right)Regularized
GCN, with
λ >
0, draws examples toward the sphere but does not completely discard the
variation in their norm. We leave s and the same as before.
motivates introducing a small positive regularization parameter
λ
to bias the
estimate of the standard deviation. Alternately, one can constrain the denominator
to be at least
. Given an input image
X
, GCN produces an output image
X
,
defined such that
X
i,j,k
= s
X
i,j,k
¯
X
max
,
λ +
1
3rc
r
i=1
c
j=1
3
k=1
X
i,j,k
¯
X
2
. (12.3)
Datasets consisting of large images cropped to interesting objects are unlikely
to contain any images with nearly constant intensity. In these cases, it is safe
to practically ignore the small denominator problem by setting
λ
= 0 and avoid
division by 0 in extremely rare cases by setting
to an extremely low value like
10
8
. This is the approach used by Goodfellow et al. (2013a) on the CIFAR-10
dataset. Small images cropped randomly are more likely to have nearly constant
intensity, making aggressive regularization more useful. Coates et al. (2011) used
= 0 and λ = 10 on small, randomly selected patches drawn from CIFAR-10.
The scale parameter
s
can usually be set to 1, as done by Coates et al. (2011),
or chosen to make each individual pixel have standard deviation across examples
close to 1, as done by Goodfellow et al. (2013a).
The standard deviation in equation 12.3 is just a rescaling of the
L
2
norm
450
CHAPTER 12. APPLICATIONS
of the image (assuming the mean of the image has already been removed). It is
preferable to define GCN in terms of standard deviation rather than
L
2
norm
because the standard deviation includes division by the number of pixels, so GCN
based on standard deviation allows the same
s
to be used regardless of image
size. However, the observation that the
L
2
norm is proportional to the standard
deviation can help build a useful intuition. One can understand GCN as mapping
examples to a spherical shell. See figure 12.1 for an illustration. This can be a
useful property because neural networks are often better at responding to directions
in space rather than to exact locations. Responding to multiple distances in the
same direction requires hidden units with collinear weight vectors but different
biases. Such coordination can be difficult for the learning algorithm to discover.
Additionally, many shallow graphical models have problems with representing
multiple separated modes along the same line. GCN avoids these problems by
reducing each example to a direction rather than a direction and a distance.
Counterintuitively, there is a preprocessing operation known as
sphering
that
is not the same operation as GCN. Sphering does not refer to making the data
lie on a spherical shell, but rather refers to rescaling the principal components to
have equal variance, so that the multivariate normal distribution used by PCA has
spherical contours. Sphering is more commonly known as whitening.
Global contrast normalization will often fail to highlight image features we
would like to have stand out, such as edges and corners. If we have a scene with a
large dark area and a large bright area (such as a city square with half the image in
the shadow of a building), then global contrast normalization will ensure that there
is a large difference between the brightness of the dark area and the brightness
of the light area. It will not, however, ensure that edges within the dark region
stand out.
This motivates
local contrast normalization
. Local contrast normalization
ensures that the contrast is normalized across each small window, rather than over
the image as a whole. See figure 12.2 for a comparison of global and local contrast
normalization.
Various definitions of local contrast normalization are possible. In all cases,
one modifies each pixel by subtracting a mean of nearby pixels and dividing by
a standard deviation of nearby pixels. In some cases, this is literally the mean
and standard deviation of all pixels in a rectangular window centered on the
pixel to be modified (Pinto et al., 2008). In other cases, this is a weighted mean
and weighted standard deviation using Gaussian weights centered on the pixel to
be modified. In the case of color images, some strategies process different color
channels separately while others combine information from different channels to
451
CHAPTER 12. APPLICATIONS
Input image GCN LCN
Figure 12.2: A comparison of global and local contrast normalization. Visually, the effects
of global contrast normalization are subtle. It places all images on roughly the same
scale, which reduces the burden on the learning algorithm to handle multiple scales. Local
contrast normalization modifies the image much more, discarding all regions of constant
intensity. This allows the model to focus on just the edges. Regions of fine texture,
such as the houses in the second row, may lose some detail due to the bandwidth of the
normalization kernel being too high.
normalize each pixel (Sermanet et al., 2012).
Local contrast normalization can usually be implemented efficiently by using
separable convolution (see section 9.8) to compute feature maps of local means and
local standard deviations, then using element-wise subtraction and element-wise
division on different feature maps.
Local contrast normalization is a differentiable operation and can also be used as
a nonlinearity applied to the hidden layers of a network, as well as a preprocessing
operation applied to the input.
As with global contrast normalization, we typically need to regularize local
contrast normalization to avoid division by zero. In fact, because local contrast
normalization typically acts on smaller windows, it is even more important to
regularize. Smaller windows are more likely to contain values that are all nearly
the same as each other, and thus more likely to have zero standard deviation.
12.2.1.2 Dataset Augmentation
As described in section 7.4, it is easy to improve the generalization of a classifier
by increasing the size of the training set by adding extra copies of the training
452
CHAPTER 12. APPLICATIONS
examples that have been modified with transformations that do not change the
class. Object recognition is a classification task that is especially amenable to
this form of dataset augmentation because the class is invariant to so many
transformations and the input can be easily transformed with many geometric
operations. As described before, classifiers can benefit from random translations,
rotations, and in some cases, flips of the input to augment the dataset. In specialized
computer vision applications, more advanced transformations are commonly used
for dataset augmentation. These schemes include random perturbation of the
colors in an image (Krizhevsky et al., 2012) and nonlinear geometric distortions of
the input (LeCun et al., 1998b).
12.3 Speech Recognition
The task of speech recognition is to map an acoustic signal containing a spoken
natural language utterance into the corresponding sequence of words intended by
the speaker. Let
X
= (
x
(1)
, x
(2)
, . . . , x
(T )
) denote the sequence of acoustic input
vectors (traditionally produced by splitting the audio into 20ms frames). Most
speech recognition systems preprocess the input using specialized hand-designed
features, but some (Jaitly and Hinton, 2011) deep learning systems learn features
from raw input. Let
y
= (
y
1
, y
2
, . . . , y
N
) denote the target output sequence (usually
a sequence of words or characters). The
automatic speech recognition
(ASR)
task consists of creating a function
f
ASR
that computes the most probable linguistic
sequence y given the acoustic sequence X:
f
ASR
(X) = arg max
y
P
(y | X = X), (12.4)
where
P
is the true conditional distribution relating the inputs
X
to the targets
y
.
Since the 1980s and until about 2009–2012, state-of-the-art speech recognition
systems primarily combined hidden Markov models (HMMs) and Gaussian mixture
models (GMMs). GMMs modeled the association between acoustic features and
phonemes (Bahl et al., 1987), while HMMs modeled the sequence of phonemes.
The GMM-HMM model family treats acoustic waveforms as being generated
by the following process: first an HMM generates a sequence of phonemes and
discrete subphonemic states (such as the beginning, middle, and end of each
phoneme), then a GMM transforms each discrete symbol into a brief segment of
audio waveform. Although GMM-HMM systems dominated ASR until recently,
speech recognition was actually one of the first areas where neural networks were
applied, and numerous ASR systems from the late 1980s and early 1990s used
453
CHAPTER 12. APPLICATIONS
neural nets (Bourlard and Wellekens, 1989; Waibel et al., 1989; Robinson and
Fallside, 1991; Bengio et al., 1991, 1992; Konig et al., 1996). At the time, the
performance of ASR based on neural nets approximately matched the performance
of GMM-HMM systems. For example, Robinson and Fallside (1991) achieved 26
percent phoneme error rate on the TIMIT (Garofolo et al., 1993) corpus (with
39 phonemes to discriminate among), which was better than or comparable to
HMM-based systems. Since then, TIMIT has been a benchmark for phoneme
recognition, playing a role similar to the role MNIST plays for object recognition.
Nonetheless, because of the complex engineering involved in software systems for
speech recognition and the effort that had been invested in building these systems
on the basis of GMM-HMMs, the industry did not see a compelling argument
for switching to neural networks. As a consequence, until the late 2000s, both
academic and industrial research in using neural nets for speech recognition mostly
focused on using neural nets to learn extra features for GMM-HMM systems.
Later, with much larger and deeper models and much larger datasets, recognition
accuracy was dramatically improved by using neural networks to replace GMMs
for the task of associating acoustic features to phonemes (or subphonemic states).
Starting in 2009, speech researchers applied a form of deep learning based on
unsupervised learning to speech recognition. This approach to deep learning was
based on training undirected probabilistic models called restricted Boltzmann
machines (RBMs) to model the input data. RBMs are described in part III.
To solve speech recognition tasks, unsupervised pretraining was used to build
deep feedforward networks whose layers were each initialized by training an RBM.
These networks take spectral acoustic representations in a fixed-size input window
(around a center frame) and predict the conditional probabilities of HMM states
for that center frame. Training such deep networks helped to significantly improve
the recognition rate on TIMIT (Mohamed et al., 2009, 2012a), bringing down the
phoneme error rate from about 26 percent to 20.7 percent. See Mohamed et al.
(2012b) for an analysis of reasons for the success of these models. Extensions to the
basic phone recognition pipeline included the addition of speaker-adaptive features
(Mohamed et al., 2011) that further reduced the error rate. This was quickly
followed up by work to expand the architecture from phoneme recognition (which
is what TIMIT is focused on) to large-vocabulary speech recognition (Dahl et al.,
2012), which involves not just recognizing phonemes but also recognizing sequences
of words from a large vocabulary. Deep networks for speech recognition eventually
shifted from being based on pretraining and Boltzmann machines to being based
on techniques such as rectified linear units and dropout (Zeiler et al., 2013; Dahl
et al., 2013). By that time, several of the major speech groups in industry had
started exploring deep learning in collaboration with academic researchers. Hinton
454
CHAPTER 12. APPLICATIONS
et al. (2012a) describe the breakthroughs achieved by these collaborators, which
are now deployed in products such as mobile phones.
Later, as these groups explored larger and larger labeled datasets and incorpo-
rated some of the methods for initializing, training, and setting up the architecture
of deep nets, they realized that the unsupervised pretraining phase was either
unnecessary or did not bring any significant improvement.
These breakthroughs in recognition performance for word error rate in speech
recognition were unprecedented (around 30 percent improvement) and were follow-
ing a long period, of about ten years, during which error rates did not improve
much with the traditional GMM-HMM technology, in spite of the continuously
growing size of training sets (see figure 2.4 of Deng and Yu [2014]). This created
a rapid shift in the speech recognition community toward deep learning. In a
matter of roughly two years, most of the industrial products for speech recognition
incorporated deep neural networks, and this success spurred a new wave of research
into deep learning algorithms and architectures for ASR, which is ongoing today.
One of these innovations was the use of convolutional networks (Sainath et al.,
2013) that replicate weights across time and frequency, improving over the earlier
time-delay neural networks that replicated weights only across time. The new
two-dimensional convolutional models regard the input spectrogram not as one
long vector but as an image, with one axis corresponding to time and the other to
frequency of spectral components.
Another important push, still ongoing, has been toward end-to-end deep learning
speech recognition systems that completely remove the HMM. The first major
breakthrough in this direction came from Graves et al. (2013), who trained a deep
LSTM RNN (see section 10.10), using MAP inference over the frame-to-phoneme
alignment, as in LeCun et al. (1998b) and in the CTC framework (Graves et al.,
2006; Graves, 2012). A deep RNN (Graves et al., 2013) has state variables from
several layers at each time step, giving the unfolded graph two kinds of depth:
ordinary depth due to a stack of layers, and depth due to time unfolding. This
work brought the phoneme error rate on TIMIT to a record low of 17.7 percent.
See Pascanu et al. (2014a) and Chung et al. (2014) for other variants of deep RNNs,
applied in other settings.
Another contemporary step toward end-to-end deep learning ASR is to let the
system learn how to “align” the acoustic-level information with the phonetic-level
information (Chorowski et al., 2014; Lu et al., 2015).
455
CHAPTER 12. APPLICATIONS
12.4 Natural Language Processing
Natural language processing
(NLP) is the use of human languages, such as
English or French, by a computer. Computer programs typically read and emit
specialized languages designed to allow efficient and unambiguous parsing by simple
programs. More naturally occurring languages are often ambiguous and defy formal
description. Natural language processing includes applications such as machine
translation, in which the learner must read a sentence in one human language and
emit an equivalent sentence in another human language. Many NLP applications
are based on language models that define a probability distribution over sequences
of words, characters, or bytes in a natural language.
As with the other applications discussed in this chapter, very generic neural
network techniques can be successfully applied to natural language processing.
To achieve excellent performance and to scale well to large applications, however,
some domain-specific strategies become important. To build an efficient model of
natural language, we must usually use techniques that are specialized for processing
sequential data. In many cases, we choose to regard natural language as a sequence
of words, rather than a sequence of individual characters or bytes. Because the total
number of possible words is so large, word-based language models must operate on
an extremely high-dimensional and sparse discrete space. Several strategies have
been developed to make models of such a space efficient, in both a computational
and a statistical sense.
12.4.1 n-grams
A
language model
defines a probability distribution over sequences of tokens
in a natural language. Depending on how the model is designed, a token may
be a word, a character, or even a byte. Tokens are always discrete entities. The
earliest successful language models were based on models of fixed-length sequences
of tokens called n-grams. An n-gram is a sequence of n tokens.
Models based on
n
-grams define the conditional probability of the
n
-th token
given the preceding
n
1 tokens. The model uses products of these conditional
distributions to define the probability distribution over longer sequences:
P (x
1
, . . . , x
τ
) = P (x
1
, . . . , x
n1
)
τ
t=n
P (x
t
| x
tn+1
, . . . , x
t1
). (12.5)
This decomposition is justified by the chain rule of probability. The probability
distribution over the initial sequence
P
(
x
1
, . . . , x
n1
) may be modeled by a different
model with a smaller value of n.
456
CHAPTER 12. APPLICATIONS
Training
n
-gram models is straightforward because the maximum likelihood
estimate can be computed simply by counting how many times each possible
n
-gram occurs in the training set. Models based on
n
-grams have been the core
building block of statistical language modeling for many decades (Jelinek and
Mercer, 1980; Katz, 1987; Chen and Goodman, 1999).
For small values of
n
, models have particular names:
unigram
for
n
= 1,
bigram
for
n
= 2, and
trigram
for
n
= 3. These names derive from the Latin
prefixes for the corresponding numbers and the Greek suffix “-gram,” denoting
something that is written.
Usually we train both an
n
-gram model and an
n
1 gram model simultaneously.
This makes it easy to compute
P (x
t
| x
tn+1
, . . . , x
t1
) =
P
n
(x
tn+1
, . . . , x
t
)
P
n1
(x
tn+1
, . . . , x
t1
)
(12.6)
simply by looking up two stored probabilities. For this to exactly reproduce
inference in
P
n
, we must omit the final character from each sequence when we
train P
n1
.
As an example, we demonstrate how a trigram model computes the probability
of the sentence
THE DOG RAN AWAY
.” The first words of the sentence cannot be
handled by the default formula based on conditional probability because there is no
context at the beginning of the sentence. Instead, we must use the marginal prob-
ability over words at the start of the sentence. We thus evaluate
P
3
(
THE DOG RAN
).
Finally, the last word may be predicted using the typical case, of using the condi-
tional distribution
P
(
AWAY | DOG RAN
). Putting this together with equation 12.6,
we obtain:
P (THE DOG RAN AWAY) = P
3
(THE DOG RAN)P
3
(DOG RAN AWAY)/P
2
(DOG RAN).
(12.7)
A fundamental limitation of maximum likelihood for
n
-gram models is that
P
n
as estimated from training set counts is very likely to be zero in many cases, even
though the tuple (
x
tn+1
, . . . , x
t
) may appear in the test set. This can cause two
different kinds of catastrophic outcomes. When
P
n1
is zero, the ratio is undefined,
so the model does not even produce a sensible output. When
P
n1
is nonzero but
P
n
is zero, the test log-likelihood is
−∞
. To avoid such catastrophic outcomes,
most
n
-gram models employ some form of
smoothing
. Smoothing techniques
shift probability mass from the observed tuples to unobserved ones that are similar.
See Chen and Goodman (1999) for a review and empirical comparisons. One basic
technique consists of adding nonzero probability mass to all the possible next
symbol values. This method can be justified as Bayesian inference with a uniform
457
CHAPTER 12. APPLICATIONS
or Dirichlet prior over the count parameters. Another very popular idea is to form
a mixture model containing higher-order and lower-order
n
-gram models, with the
higher-order models providing more capacity and the lower-order models being
more likely to avoid counts of zero.
Back-off methods
look up the lower-order
n
-grams if the frequency of the context
x
t1
, . . . , x
tn+1
is too small to use the
higher-order model. More formally, they estimate the distribution over
x
t
by using
contexts
x
tn+k
, . . . , x
t1
, for increasing
k
, until a sufficiently reliable estimate is
found.
Classical
n
-gram models are particularly vulnerable to the curse of dimension-
ality. There are
|V|
n
possible
n
-grams and
|V|
is often very large. Even with a
massive training set and modest
n
, most
n
-grams will not occur in the training set.
One way to view a classical
n
-gram model is that it is performing nearest neighbor
lookup. In other words, it can be viewed as a local nonparametric predictor, similar
to
k
-nearest neighbors. The statistical problems facing these extremely local pre-
dictors are described in section 5.11.2. The problem for a language model is even
more severe than usual, because any two different words have the same distance
from each other in one-hot vector space. It is thus difficult to leverage much
information from any “neighbors”—only training examples that repeat literally the
same context are useful for local generalization. To overcome these problems, a
language model must be able to share knowledge between one word and other
semantically similar words.
To improve the statistical efficiency of
n
-gram models,
class-based language
models
(Brown et al., 1992; Ney and Kneser, 1993; Niesler et al., 1998) introduce
the notion of word categories and then share statistical strength between words that
are in the same category. The idea is to use a clustering algorithm to partition the
set of words into clusters or classes, based on their co-occurrence frequencies with
other words. The model can then use word class IDs rather than individual word
IDs to represent the context on the right side of the conditioning bar. Composite
models combining word-based and class-based models via mixing or back-off are
also possible. Although word classes provide a way to generalize between sequences
in which some word is replaced by another of the same class, much information is
lost in this representation.
12.4.2 Neural Language Models
Neural language models
, or NLMs, are a class of language model designed
to overcome the curse of dimensionality problem for modeling natural language
sequences by using a distributed representation of words (Bengio et al., 2001).
Unlike class-based
n
-gram models, neural language models are able to recognize
458
CHAPTER 12. APPLICATIONS
34 32 30 28 26
14
13
12
11
10
9
8
7
6
Canada
Europe
Ontario
North
English
Canadian
Union
African
Africa
British
France
Russian
China
Germany
French
Assembly
EU
Japan
Iraq
South
European
35.0 35.5 36.0 36.5 37.0 37.5 38.0
17
18
19
20
21
22
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
Figure 12.3: Two-dimensional visualizations of word embeddings obtained from a neural
machine translation model (Bahdanau et al., 2015), zooming in on specific areas where
semantically related words have embedding vectors that are close to each other. Countries
appear on the left and numbers on the right. Keep in mind that these embeddings are 2-D
for the purpose of visualization. In real applications, embeddings typically have higher
dimensionality and can simultaneously capture many kinds of similarity between words.
that two words are similar without losing the ability to encode each word as distinct
from the other. Neural language models share statistical strength between one
word (and its context) and other similar words and contexts. The distributed
representation the model learns for each word enables this sharing by allowing the
model to treat words that have features in common similarly. For example, if the
word
dog
and the word
cat
map to representations that share many attributes, then
sentences that contain the word
cat
can inform the predictions that will be made
by the model for sentences that contain the word
dog
, and vice versa. Because there
are many such attributes, there are many ways in which generalization can happen,
transferring information from each training sentence to an exponentially large
number of semantically related sentences. The curse of dimensionality requires the
model to generalize to a number of sentences that is exponential in the sentence
length. The model counters this curse by relating each training sentence to an
exponential number of similar sentences.
We sometimes call these word representations
word embeddings
. In this
interpretation, we view the raw symbols as points in a space of dimension equal
to the vocabulary size. The word representations embed those points in a feature
space of lower dimension. In the original space, every word is represented by
a one-hot vector, so every pair of words is at Euclidean distance
2
from each
other. In the embedding space, words that frequently appear in similar contexts
459
CHAPTER 12. APPLICATIONS
(or any pair of words sharing some “features” learned by the model) are close to
each other. This often results in words with similar meanings being neighbors.
Figure 12.3 zooms in on specific areas of a learned word embedding space to show
how semantically similar words map to representations that are close to each other.
Neural networks in other domains also define embeddings. For example, a
hidden layer of a convolutional network provides an “image embedding.” Usually
NLP practitioners are much more interested in this idea of embeddings because
natural language does not originally lie in a real-valued vector space. The hidden
layer has provided a more qualitatively dramatic change in the way the data is
represented.
The basic idea of using distributed representations to improve models for
natural language processing is not restricted to neural networks. It may also be
used with graphical models that have distributed representations in the form of
multiple latent variables (Mnih and Hinton, 2007).
12.4.3 High-Dimensional Outputs
In many natural language applications, we often want our models to produce
words (rather than characters) as the fundamental unit of the output. For large
vocabularies, it can be very computationally expensive to represent an output
distribution over the choice of a word, because the vocabulary size is large. In many
applications,
V
contains hundreds of thousands of words. The naive approach to
representing such a distribution is to apply an affine transformation from a hidden
representation to the output space, then apply the softmax function. Suppose
we have a vocabulary
V
with size
|V|
. The weight matrix describing the linear
component of this affine transformation is very large, because its output dimension
is
|V|
. This imposes a high memory cost to represent the matrix, and a high
computational cost to multiply by it. Because the softmax is normalized across all
|V|
outputs, it is necessary to perform the full matrix multiplication at training
time as well as test time—we cannot calculate only the dot product with the weight
vector for the correct output. The high computational costs of the output layer
thus arise both at training time (to compute the likelihood and its gradient) and
at test time (to compute probabilities for all or selected words). For specialized
loss functions, the gradient can be computed efficiently (Vincent et al., 2015), but
the standard cross-entropy loss applied to a traditional softmax output layer poses
many difficulties.
Suppose that
h
is the top hidden layer used to predict the output probabilities
ˆ
y
. If we parametrize the transformation from
h
to
ˆ
y
with learned weights
W
460
CHAPTER 12. APPLICATIONS
and learned biases
b
, then the affine-softmax output layer performs the following
computations:
a
i
= b
i
+
j
W
ij
h
j
i {1, . . . , |V|}, (12.8)
ˆy
i
=
e
a
i
|V|
i
=1
e
a
i
. (12.9)
If
h
contains
n
h
elements, then the above operation is
O
(
|V|n
h
). With
n
h
in the
thousands and
|V|
in the hundreds of thousands, this operation dominates the
computation of most neural language models.
12.4.3.1 Use of a Short List
The first neural language models (Bengio et al., 2001, 2003) dealt with the high cost
of using a softmax over a large number of output words by limiting the vocabulary
size to 10,000 or 20,000 words. Schwenk and Gauvain (2002) and Schwenk (2007)
built upon this approach by splitting the vocabulary
V
into a
shortlist L
of most
frequent words (handled by the neural net) and a tail
T
=
V\L
of more rare words
(handled by an
n
-gram model). To be able to combine the two predictions, the
neural net also has to predict the probability that a word appearing after context
C
belongs to the tail list. This may be achieved by adding an extra sigmoid output
unit to provide an estimate of
P
(
i T | C
). The extra output can then be used to
achieve an estimate of the probability distribution over all words in V as follows:
P (y = i | C) =1
iL
P (y = i | C, i L)(1 P (i T | C))
+ 1
iT
P (y = i | C, i T)P (i T | C), (12.10)
where
P
(
y
=
i | C, i L
) is provided by the neural language model and
P
(
y
=
i |
C, i T
) is provided by the
n
-gram model. With slight modification, this approach
can also work using an extra output value in the neural language model’s softmax
layer, rather than a separate sigmoid unit.
An obvious disadvantage of the short list approach is that the potential gener-
alization advantage of the neural language models is limited to the most frequent
words, where, arguably, it is the least useful. This disadvantage has stimulated
the exploration of alternative methods to deal with high-dimensional outputs,
described below.
461
CHAPTER 12. APPLICATIONS
12.4.3.2 Hierarchical Softmax
A classical approach (Goodman, 2001) to reducing the computational burden
of high-dimensional output layers over large vocabulary sets
V
is to decompose
probabilities hierarchically. Instead of necessitating a number of computations
proportional to
|V|
(and also proportional to the number of hidden units,
n
h
),
the
|V|
factor can be reduced to as low as
log |V|
. Bengio (2002) and Morin and
Bengio (2005) introduced this factorized approach to the context of neural language
models.
One can think of this hierarchy as building categories of words, then categories
of categories of words, then categories of categories of categories of words, and so
on. These nested categories form a tree, with words at the leaves. In a balanced
tree, the tree has depth
O
(
log |V|
). The probability of choosing a word is given
by the product of the probabilities of choosing the branch leading to that word
at every node on a path from the root of the tree to the leaf containing the word.
Figure 12.4 illustrates a simple example. Mnih and Hinton (2009) also describe
how to use multiple paths to identify a single word in order to better model words
that have multiple meanings. Computing the probability of a word then involves
summation over all the paths that lead to that word.
To predict the conditional probabilities required at each node of the tree, we
typically use a logistic regression model at each node of the tree, and provide the
same context
C
as input to all these models. Because the correct output is encoded
in the training set, we can use supervised learning to train the logistic regression
models. This is typically done using a standard cross-entropy loss, corresponding
to maximizing the log-likelihood of the correct sequence of decisions.
Because the output log-likelihood can be computed efficiently (as low as
log |V|
rather than
|V|
), its gradients may also be computed efficiently. This includes not
only the gradient with respect to the output parameters but also the gradients
with respect to the hidden layer activations.
It is possible but usually not practical to optimize the tree structure to minimize
the expected number of computations. Tools from information theory specify how
to choose the optimal binary code given the relative frequencies of the words. To
do so, we could structure the tree so that the number of bits associated with a
word is approximately equal to the logarithm of the frequency of that word. In
practice, however, the computational savings are typically not worth the effort
because the computation of the output probabilities is only one part of the total
computation in the neural language model. For example, suppose there are
l
fully
connected hidden layers of width
n
h
. Let
n
b
be the weighted average of the number
462
CHAPTER 12. APPLICATIONS
(1)(0)
(0,0,0) (0,0,1) (0,1,0) (0,1,1) (1,0,0) (1,0,1) (1,1,0) (1,1,1)
(1,1)(1,0)(0,1)(0,0)
w
0
w
0
w
1
w
1
w
2
w
2
w
3
w
3
w
4
w
4
w
5
w
5
w
6
w
6
w
7
w
7
Figure 12.4: Illustration of a simple hierarchy of word categories, with 8 words
w
0
,
. . . , w
7
organized into a three-level hierarchy. The leaves of the tree represent actual
specific words. Internal nodes represent groups of words. Any node can be indexed by
the sequence of binary decisions (0 = left, 1 = right) to reach the node from the root.
Super-class (0) contains the classes (0
,
0) and (0
,
1), which respectively contain the sets
of words
{w
0
, w
1
}
and
{w
2
, w
3
}
, and similarly super-class (1) contains the classes (1
,
0)
and (1
,
1), which respectively contain the words (
w
4
, w
5
) and (
w
6
, w
7
). If the tree is
sufficiently balanced, the maximum depth (number of binary decisions) is on the order
of the logarithm of the number of words
|V|
: the choice of one out of
|V|
words can be
obtained by doing O(log |V|) operations (one for each of the nodes on the path from the
root). In this example, computing the probability of a word
y
can be done by multiplying
three probabilities, associated with the binary decisions to move left or right at each node
on the path from the root to a node
y
. Let
b
i
(
y
) be the
i
-th binary decision when traversing
the tree toward the value
y
. The probability of sampling an output
y
decomposes into a
product of conditional probabilities, using the chain rule for conditional probabilities, with
each node indexed by the prefix of these bits. For example, node (1
,
0) corresponds to the
prefix (b
0
(w
4
) = 1, b
1
(w
4
) = 0), and the probability of w
4
can be decomposed as follows:
P (y = w
4
) = P (b
0
= 1, b
1
= 0, b
2
= 0) (12.11)
= P (b
0
= 1)P (b
1
= 0 | b
0
= 1)P (b
2
= 0 | b
0
= 1, b
1
= 0). (12.12)
463
CHAPTER 12. APPLICATIONS
of bits required to identify a word, with the weighting given by the frequency
of these words. In this example, the number of operations needed to compute
the hidden activations grows as
O
(
ln
2
h
), while the output computations grow as
O
(
n
h
n
b
). As long as
n
b
ln
h
, we can reduce computation more by shrinking
n
h
than by shrinking
n
b
. Indeed,
n
b
is often small. Because the size of the vocabulary
rarely exceeds a million words and
log
2
(10
6
)
20, it is possible to reduce
n
b
to
about 20, but
n
h
is often much larger, around 10
3
or more. Rather than carefully
optimizing a tree with a branching factor of 2, one can instead define a tree with
depth two and a branching factor of
|V|
. Such a tree corresponds to simply
defining a set of mutually exclusive word classes. The simple approach based on a
tree of depth two captures most of the computational benefit of the hierarchical
strategy.
One question that remains somewhat open is how to best define these word
classes, or how to define the word hierarchy in general. Early work used existing
hierarchies (Morin and Bengio, 2005), but the hierarchy can also be learned, ideally
jointly with the neural language model. Learning the hierarchy is difficult. An exact
optimization of the log-likelihood appears intractable because the choice of a word
hierarchy is a discrete one, not amenable to gradient-based optimization. However,
one could use discrete optimization to approximately optimize the partition of
words into word classes.
An important advantage of the hierarchical softmax is that it brings computa-
tional benefits both at training time and at test time, if at test time we want to
compute the probability of specific words.
Of course, computing the probability of all
|V|
words will remain expensive
even with the hierarchical softmax. Another important operation is selecting the
most likely word in a given context. Unfortunately the tree structure does not
provide an efficient and exact solution to this problem.
A disadvantage is that in practice the hierarchical softmax tends to give worse
test results than sampling-based methods, which we describe next. This may be
due to a poor choice of word classes.
12.4.3.3 Importance Sampling
One way to speed up the training of neural language models is to avoid explicitly
computing the contribution of the gradient from all the words that do not appear
in the next position. Every incorrect word should have low probability under the
model. It can be computationally costly to enumerate all these words. Instead, it
is possible to sample only a subset of the words. Using the notation introduced in
464
CHAPTER 12. APPLICATIONS
equation 12.8, the gradient can be written as follows:
log P (y | C)
θ
=
log softmax
y
(a)
θ
(12.13)
=
θ
log
e
a
y
i
e
a
i
(12.14)
=
θ
(a
y
log
i
e
a
i
) (12.15)
=
a
y
θ
i
P (y = i | C)
a
i
θ
, (12.16)
where
a
is the vector of presoftmax activations (or scores), with one element
per word. The first term is the
positive phase
term, pushing
a
y
up, while the
second term is the
negative phase
term, pushing
a
i
down for all
i
, with weight
P
(
i | C
). Since the negative phase term is an expectation, we can estimate it with
a Monte Carlo sample. However, that would require sampling from the model itself.
Sampling from the model requires computing
P
(
i | C
) for all
i
in the vocabulary,
which is precisely what we are trying to avoid.
Instead of sampling from the model, we can sample from another distribution,
called the proposal distribution (denoted
q
), and use appropriate weights to
correct for the bias introduced by sampling from the wrong distribution (Bengio
and Sénécal, 2003; Bengio and Sénécal, 2008). This is an application of a more
general technique called
importance sampling
, which we describe in more detail
in section 17.2. Unfortunately, even exact importance sampling is not efficient
because it requires computing weights
p
i
/q
i
, where
p
i
=
P
(
i | C
), which can
only be computed if all the scores
a
i
are computed. The solution adopted for
this application is called
biased importance sampling
, where the importance
weights are normalized to sum to 1. When negative word
n
i
is sampled, the
associated gradient is weighted by
w
i
=
p
n
i
/q
n
i
N
j=1
p
n
j
/q
n
j
. (12.17)
These weights are used to give the appropriate importance to the
m
negative
samples from
q
used to form the estimated negative phase contribution to the
gradient:
|V|
i=1
P (i | C)
a
i
θ
1
m
m
i=1
w
i
a
n
i
θ
. (12.18)
465
CHAPTER 12. APPLICATIONS
A unigram or a bigram distribution works well as the proposal distribution
q
. It is
easy to estimate the parameters of such a distribution from data. After estimating
the parameters, it is also possible to sample from such a distribution very efficiently.
Importance sampling is not only useful for speeding up models with large
softmax outputs. More generally, it is useful for accelerating training with large
sparse output layers, where the output is a sparse vector rather than a 1-of-
n
choice. An example is a
bag of words
. A bag of words is a sparse vector
v
where
v
i
indicates the presence or absence of word
i
from the vocabulary in the
document. Alternately,
v
i
can indicate the number of times that word
i
appears.
Machine learning models that emit such sparse vectors can be expensive to train
for a variety of reasons. Early in learning, the model may not actually choose to
make the output truly sparse. Moreover, the loss function we use for training might
most naturally be described in terms of comparing every element of the output to
every element of the target. This means that it is not always clear that there is a
computational benefit to using sparse outputs, because the model may choose to
make the majority of the output nonzero, and all these non-zero values need to be
compared to the corresponding training target, even if the training target is zero.
Dauphin et al. (2011) demonstrated that such models can be accelerated using
importance sampling. The efficient algorithm minimizes the loss reconstruction for
the “positive words” (those that are nonzero in the target) and an equal number of
“negative words.” The negative words are chosen randomly, using a heuristic to
sample words that are more likely to be mistaken. The bias introduced by this
heuristic oversampling can then be corrected using importance weights.
In all these cases, the computational complexity of gradient estimation for
the output layer is reduced to be proportional to the number of negative samples
rather than proportional to the size of the output vector.
12.4.3.4 Noise-Contrastive Estimation and Ranking Loss
Other approaches based on sampling have been proposed to reduce the computa-
tional cost of training neural language models with large vocabularies. An early
example is the ranking loss proposed by Collobert and Weston (2008a), which
views the output of the neural language model for each word as a score and tries to
make the score of the correct word
a
y
be ranked high in comparison to the other
scores a
i
. The ranking loss proposed then is
L =
i
max(0, 1 a
y
+ a
i
). (12.19)
466
CHAPTER 12. APPLICATIONS
The gradient is zero for the
i
-th term if the score of the observed word,
a
y
, is
greater than the score of the negative word
a
i
by a margin of 1. One issue with
this criterion is that it does not provide estimated conditional probabilities, which
are useful in some applications, including speech recognition and text generation
(including conditional text generation tasks such as translation).
A more recently used training objective for neural language models is noise-
contrastive estimation, which is introduced in section 18.6. This approach has
been successfully applied to neural language models (Mnih and Teh, 2012; Mnih
and Kavukcuoglu, 2013).
12.4.4 Combining Neural Language Models with n-grams
A major advantage of
n
-gram models over neural networks is that
n
-gram models
achieve high model capacity (by storing the frequencies of very many tuples),
while requiring very little computation to process an example (by looking up
only a few tuples that match the current context). If we use hash tables or trees
to access the counts, the computation used for
n
-grams is almost independent
of capacity. In comparison, doubling a neural network’s number of parameters
typically also roughly doubles its computation time. Exceptions include models
that avoid using all parameters on each pass. Embedding layers index only a single
embedding in each pass, so we can increase the vocabulary size without increasing
the computation time per example. Some other models, such as tiled convolutional
networks, can add parameters while reducing the degree of parameter sharing to
maintain the same amount of computation. Typical neural network layers based
on matrix multiplication, however, use an amount of computation proportional to
the number of parameters.
One easy way to add capacity is thus to combine both approaches in an ensemble
consisting of a neural language model and an
n
-gram language model (Bengio
et al., 2001, 2003). As with any ensemble, this technique can reduce test error if
the ensemble members make independent mistakes. The field of ensemble learning
provides many ways of combining the ensemble members’ predictions, including
uniform weighting and weights chosen on a validation set. Mikolov et al. (2011a)
extended the ensemble to include not just two models but a large array of models.
It is also possible to pair a neural network with a maximum entropy model and
train both jointly (Mikolov et al., 2011b). This approach can be viewed as training
a neural network with an extra set of inputs that are connected directly to the
output and not connected to any other part of the model. The extra inputs are
indicators for the presence of particular
n
-grams in the input context, so these
variables are very high dimensional and very sparse. The increase in model capacity
467
CHAPTER 12. APPLICATIONS
is huge—the new portion of the architecture contains up to
|sV |
n
parameters—but
the amount of added computation needed to process an input is minimal because
the extra inputs are very sparse.
12.4.5 Neural Machine Translation
Machine translation is the task of reading a sentence in one natural language and
emitting a sentence with the equivalent meaning in another language. Machine
translation systems often involve many components. At a high level, there is
often one component that proposes many candidate translations. Many of these
translations will not be grammatical due to differences between the languages. For
example, many languages put adjectives after nouns, so when translated to English
directly they yield phrases such as “apple red.” The proposal mechanism suggests
many variants of the suggested translation, ideally including “red apple.” A second
component of the translation system, a language model, evaluates the proposed
translations and can score “red apple” as better than “apple red.”
The earliest exploration of neural networks for machine translation already
incorporated the idea of encoder and decoder (Allen 1987; Chrisman 1991; Forcada
and Ñeco 1997), while the first large-scale competitive use of neural networks in
translation was to upgrade the language model of a translation system by using a
neural language model (Schwenk et al., 2006; Schwenk, 2010). Previously, most
machine translation systems had used an
n
-gram model for this component. The
n
-gram-based models used for machine translation include not just traditional
back-off
n
-gram models (Jelinek and Mercer, 1980; Katz, 1987; Chen and Goodman,
1999) but also
maximum entropy language models
(Berger et al., 1996), in
which an affine-softmax layer predicts the next word given the presence of frequent
n-grams in the context.
Traditional language models simply report the probability of a natural language
sentence. Because machine translation involves producing an output sentence given
an input sentence, it makes sense to extend the natural language model to be
conditional. As described in section 6.2.1.1, it is straightforward to extend a model
that defines a marginal distribution over some variable to define a conditional
distribution over that variable given a context
C
, where
C
might be a single variable
or a list of variables. Devlin et al. (2014) beat the state-of-the-art in some statistical
machine translation benchmarks by using an MLP to score a phrase
t
1
, t
2
, . . . , t
k
in the target language given a phrase
s
1
, s
2
, . . . , s
n
in the source language. The
MLP estimates
P
(
t
1
, t
2
, . . . , t
k
| s
1
, s
2
, . . . , s
n
). The estimate formed by this MLP
replaces the estimate provided by conditional n-gram models.
468
CHAPTER 12. APPLICATIONS
Decoder
Output object (English
sentence)
Intermediate, semantic representation
Source object (French sentence or image)
Encoder
Figure 12.5: The encoder-decoder architecture to map back and forth between a surface
representation (such as a sequence of words or an image) and a semantic representation.
By using the output of an encoder of data from one modality (such as the encoder mapping
from French sentences to hidden representations capturing the meaning of sentences) as
the input to a decoder for another modality (such as the decoder mapping from hidden
representations capturing the meaning of sentences to English), we can train systems that
translate from one modality to another. This idea has been applied successfully not just
to machine translation but also to caption generation from images.
A drawback of the MLP-based approach is that it requires the sequences to be
preprocessed to be of fixed length. To make the translation more flexible, we would
like to use a model that can accommodate variable length inputs and variable
length outputs. An RNN provides this ability. Section 10.2.4 describes several ways
of constructing an RNN that represents a conditional distribution over a sequence
given some input, and section 10.4 describes how to accomplish this conditioning
when the input is a sequence. In all cases, one model first reads the input sequence
and emits a data structure that summarizes the input sequence. We call this
summary the “context”
C
. The context
C
may be a list of vectors, or it may be a
vector or tensor. The model that reads the input to produce
C
may be an RNN
(Cho et al., 2014a; Sutskever et al., 2014; Jean et al., 2014) or a convolutional
network (Kalchbrenner and Blunsom, 2013). A second model, usually an RNN,
then reads the context
C
and generates a sentence in the target language. This
general idea of an encoder-decoder framework for machine translation is illustrated
in figure 12.5.
To generate an entire sentence conditioned on the source sentence, the model
must have a way to represent the entire source sentence. Earlier models were able
to represent only individual words or phrases. From a representation learning point
469
CHAPTER 12. APPLICATIONS
of view, it can be useful to learn a representation in which sentences that have the
same meaning have similar representations regardless of whether they were written
in the source language or in the target language. This strategy was explored first
using a combination of convolutions and RNNs (Kalchbrenner and Blunsom, 2013).
Later work introduced the use of an RNN for scoring proposed translations (Cho
et al., 2014a) and for generating translated sentences (Sutskever et al., 2014). Jean
et al. (2014) scaled these models to larger vocabularies.
12.4.5.1 Using an Attention Mechanism and Aligning Pieces of Data
Using a fixed-size representation to capture all the semantic details of a very
long sentence of, say, 60 words is very difficult. It can be achieved by training a
sufficiently large RNN well enough and for long enough, as demonstrated by Cho
et al. (2014a) and Sutskever et al. (2014). A more efficient approach, however, is
to read the whole sentence or paragraph (to get the context and the gist of what
is being expressed), then produce the translated words one at a time, each time
focusing on a different part of the input sentence to gather the semantic details
required to produce the next output word. That is exactly the idea that Bahdanau
et al. (2015) first introduced. The attention mechanism used to focus on specific
parts of the input sequence at each time step is illustrated in figure 12.6.
We can think of an attention-based system as having three components:
1.
A process that reads raw data (such as source words in a source sentence)
and converts them into distributed representations, with one feature vector
associated with each word position.
2.
A list of feature vectors storing the output of the reader. This can be
understood as a memory containing a sequence of facts, which can be retrieved
later, not necessarily in the same order, without having to visit all of them.
3.
A process that exploits the content of the memory to sequentially perform
a task, at each time step having the ability put attention on the content of
one memory element (or a few, with a different weight).
The third component generates the translated sentence.
When words in a sentence written in one language are aligned with correspond-
ing words in a translated sentence in another language, it becomes possible to relate
the corresponding word embeddings. Earlier work showed that one could learn a
kind of translation matrix relating the word embeddings in one language with the
word embeddings in another (Kočiský et al., 2014), yielding lower alignment error
470
CHAPTER 12. APPLICATIONS
α
(t1)
α
(t1)
α
(t)
α
(t)
α
(t+1)
α
(t+1)
h
(t1)
h
(t1)
h
(t+1)
h
(t+1)
cc
××
××
××
+
Figure 12.6: A modern attention mechanism, as introduced by Bahdanau et al. (2015), is
essentially a weighted average. A context vector
c
is formed by taking a weighted average
of feature vectors
h
(t)
with weights
α
(t)
. In some applications, the feature vectors
h
are
hidden units of a neural network, but they may also be raw input to the model. The
weights
α
(t)
are produced by the model itself. They are usually values in the interval
[0
,
1] and are intended to concentrate around just one
h
(t)
so that the weighted average
approximates reading that one specific time step precisely. The weights
α
(t)
are usually
produced by applying a softmax function to relevance scores emitted by another portion
of the model. The attention mechanism is more expensive computationally than directly
indexing the desired
h
(t)
, but direct indexing cannot be trained with gradient descent. The
attention mechanism based on weighted averages is a smooth, differentiable approximation
that can be trained with existing optimization algorithms.
471
CHAPTER 12. APPLICATIONS
rates than traditional approaches based on the frequency counts in the phrase table.
There is even earlier work on learning cross-lingual word vectors (Klementiev et al.,
2012). Many extensions to this approach are possible. For example, more efficient
cross-lingual alignment (Gouws et al., 2014) allows training on larger datasets.
12.4.6 Historical Perspective
The idea of distributed representations for symbols was introduced by Rumelhart
et al. (1986a) in one of the first explorations of back-propagation, with symbols
corresponding to the identity of family members, and the neural network capturing
the relationships between family members, with training examples forming triplets
such as (Colin, Mother, Victoria). The first layer of the neural network learned a
representation of each family member. For example, the features for Colin might
represent which family tree Colin was in, what branch of that tree he was in,
what generation he was from, and so on. One can think of the neural network as
computing learned rules relating these attributes together to obtain the desired
predictions. The model can then make predictions such as inferring who is the
mother of Colin.
The idea of forming an embedding for a symbol was extended to the idea of an
embedding for a word by Deerwester et al. (1990). These embeddings were learned
using the SVD. Later, embeddings would be learned by neural networks.
The history of natural language processing is marked by transitions in the
popularity of different ways of representing the input to the model. Following
this early work on symbols and words, some of the earliest applications of neural
networks to NLP (Miikkulainen and Dyer, 1991; Schmidhuber and Heil, 1996)
represented the input as a sequence of characters.
Bengio et al. (2001) returned the focus to modeling words and introduced
neural language models, which produce interpretable word embeddings. These
neural models have scaled up from defining representations of a small set of symbols
in the 1980s to millions of words (including proper nouns and misspellings) in
modern applications. This computational scaling effort led to the invention of the
techniques described in section 12.4.3.
Initially, the use of words as the fundamental units of language models yielded
improved language modeling performance (Bengio et al., 2001). To this day,
new techniques continually push both character-based models (Sutskever et al.,
2011) and word-based models forward, with recent work (Gillick et al., 2015) even
modeling individual bytes of Unicode characters.
The ideas behind neural language models have been extended into several
472
CHAPTER 12. APPLICATIONS
natural language processing applications, such as parsing (Henderson, 2003, 2004;
Collobert, 2011), part-of-speech tagging, semantic role labeling, chunking, and
so on, sometimes using a single multitask learning architecture (Collobert and
Weston, 2008a; Collobert et al., 2011a) in which the word embeddings are shared
across tasks.
Two-dimensional visualizations of embeddings became a popular tool for an-
alyzing language models following the development of the t-SNE dimensionality
reduction algorithm (van der Maaten and Hinton, 2008) and its high-profile appli-
cation to visualization word embeddings by Joseph Turian in 2009.
12.5 Other Applications
In this section we cover a few other types of applications of deep learning that
are different from the standard object recognition, speech recognition, and natural
language processing tasks discussed above. Part III of this book will expand that
scope even further to tasks that remain primarily research areas.
12.5.1 Recommender Systems
One of the major families of applications of machine learning in the information
technology sector is the ability to make recommendations of items to potential
users or customers. Two major types of applications can be distinguished: online
advertising and item recommendations (often these recommendations are still for
the purpose of selling a product). Both rely on predicting the association between
a user and an item, either to predict the probability of some action (the user
buying the product, or some proxy for this action) or the expected gain (which
may depend on the value of the product) if an ad is shown or a recommendation
is made regarding that product to that user. The internet is currently financed
in great part by various forms of online advertising. Major parts of the economy
rely on online shopping. Companies including Amazon and eBay use machine
learning, including deep learning, for their product recommendations. Sometimes,
the items are not products that are actually for sale. Examples include selecting
posts to display on social network news feeds, recommending movies to watch,
recommending jokes, recommending advice from experts, matching players for
video games, or matching people in dating services.
Often, this association problem is handled like a supervised learning problem:
given some information about the item and about the user, predict the proxy of
interest (user clicks on ad, user enters a rating, user clicks on a “like” button, user
473
CHAPTER 12. APPLICATIONS
buys product, user spends some amount of money on the product, user spends
time visiting a page for the product, and so forth). This often ends up being
either a regression problem (predicting some conditional expected value) or a
probabilistic classification problem (predicting the conditional probability of some
discrete event).
The early work on recommender systems relied on minimal information as
inputs for these predictions: the user ID and the item ID. In this context, the
only way to generalize is to rely on the similarity between the patterns of values of
the target variable for different users or for different items. Suppose that user 1
and user 2 both like items A, B and C. From this, we may infer that user 1 and
user 2 have similar tastes. If user 1 likes item D, then this should be a strong
cue that user 2 will also like D. Algorithms based on this principle come under
the name of
collaborative filtering
. Both nonparametric approaches (such as
nearest neighbor methods based on the estimated similarity between patterns of
preferences) and parametric methods are possible. Parametric methods often rely
on learning a distributed representation (also called an embedding) for each user
and for each item. Bilinear prediction of the target variable (such as a rating) is a
simple parametric method that is highly successful and often found as a component
of state-of-the-art systems. The prediction is obtained by the dot product between
the user embedding and the item embedding (possibly corrected by constants that
depend only on either the user ID or the item ID). Let
ˆ
R
be the matrix containing
our predictions,
A
a matrix with user embeddings in its rows, and
B
a matrix with
item embeddings in its columns. Let
b
and
c
be vectors that contain respectively
a kind of bias for each user (representing how grumpy or positive that user is
in general) and for each item (representing its general popularity). The bilinear
prediction is thus obtained as follows:
ˆ
R
u,i
= b
u
+ c
i
+
j
A
u,j
B
j,i
. (12.20)
Typically one wants to minimize the squared error between predicted ratings
ˆ
R
u,i
and actual ratings
R
u,i
. User embeddings and item embeddings can then be
conveniently visualized when they are first reduced to a low dimension (two or
three), or they can be used to compare users or items against each other, just
like word embeddings. One way to obtain these embeddings is by performing a
singular value decomposition of the matrix
R
of actual targets (such as ratings).
This corresponds to factorizing
R
=
UDV
(or a normalized variant) into the
product of two factors, the lower rank matrices
A
=
UD
and
B
=
V
. One
problem with the SVD is that it treats the missing entries in an arbitrary way,
as if they corresponded to a target value of 0. Instead we would like to avoid
474
CHAPTER 12. APPLICATIONS
paying any cost for the predictions made on missing entries. Fortunately, the sum
of squared errors on the observed ratings can also be easily minimized by gradient-
based optimization. The SVD and the bilinear prediction of equation 12.20 both
performed very well in the competition for the Netflix prize (Bennett and Lanning,
2007), aiming at predicting ratings for films, based only on previous ratings by
a large set of anonymous users. Many machine learning experts participated in
this competition, which took place between 2006 and 2009. It raised the level of
research in recommender systems using advanced machine learning and yielded
improvements in recommender systems. Even though the simple bilinear prediction,
or SVD, did not win by itself, it was a component of the ensemble models presented
by most of the competitors, including the winners (Töscher et al., 2009; Koren,
2009).
Beyond these bilinear models with distributed representations, one of the first
uses of neural networks for collaborative filtering is based on the RBM undirected
probabilistic model (Salakhutdinov et al., 2007). RBMs were an important element
of the ensemble of methods that won the Netflix competition (Töscher et al., 2009;
Koren, 2009). More advanced variants on the idea of factorizing the ratings matrix
have also been explored in the neural networks community (Salakhutdinov and
Mnih, 2008).
Collaborative filtering systems have a basic limitation, however: when a new
item or a new user is introduced, its lack of rating history means that there is no
way to evaluate its similarity with other items or users, or the degree of association
between, say, that new user and existing items. This is called the problem of cold-
start recommendations. A general way of solving the cold-start recommendation
problem is to introduce extra information about the individual users and items. For
example, this extra information could be user profile information or features of each
item. Systems that use such information are called
content-based recommender
systems
. The mapping from a rich set of user features or item features to an
embedding can be learned through a deep learning architecture (Huang et al., 2013;
Elkahky et al., 2015).
Specialized deep learning architectures, such as convolutional networks, have
also been applied to learn to extract features from rich content, such as from
musical audio tracks for music recommendation (van den Oörd et al., 2013). In
that work, the convolutional net takes acoustic features as input and computes an
embedding for the associated song. The dot product between this song embedding
and the embedding for a user is then used to predict whether a user will listen to
the song.
475
CHAPTER 12. APPLICATIONS
12.5.1.1 Exploration versus Exploitation
When making recommendations to users, an issue arises that goes beyond ordinary
supervised learning and into the realm of reinforcement learning. Many recom-
mendation problems are most accurately described theoretically as
contextual
bandits
(Langford and Zhang, 2008; Lu et al., 2010). The issue is that when we
use the recommendation system to collect data, we get a biased and incomplete
view of the preferences of users: we see the responses of users only to the items
recommended to them and not to the other items. In addition, in some cases we
may not get any information on users for whom no recommendation has been
made (for example, with ad auctions, it may be that the price proposed for an ad
was below a minimum price threshold, or does not win the auction, so the ad is
not shown at all). More importantly, we get no information about what outcome
would have resulted from recommending any of the other items. This would be like
training a classifier by picking one class
ˆy
for each training example
x
(typically the
class with the highest probability according to the model) and then only getting as
feedback whether this was the correct class or not. Clearly, each example conveys
less information than in the supervised case, where the true label
y
is directly
accessible, so more examples are necessary. Worse, if we are not careful, we could
end up with a system that continues picking the wrong decisions even as more
and more data is collected, because the correct decision initially had a very low
probability: until the learner picks that correct decision, it does not learn about
the correct decision. This is similar to the situation in reinforcement learning
where only the reward for the selected action is observed. In general, reinforcement
learning can involve a sequence of many actions and many rewards. The bandits
scenario is a special case of reinforcement learning, in which the learner takes only
a single action and receives a single reward. The bandit problem is easier in the
sense that the learner knows which reward is associated with which action. In
the general reinforcement learning scenario, a high reward or a low reward might
have been caused by a recent action or by an action in the distant past. The term
contextual bandits
refers to the case where the action is taken in the context of
some input variable that can inform the decision. For example, we at least know
the user identity, and we want to pick an item. The mapping from context to
action is also called a
policy
. The feedback loop between the learner and the data
distribution (which now depends on the actions of the learner) is a central research
issue in the reinforcement learning and bandits literature.
Reinforcement learning requires choosing a trade-off between
exploration
and
exploitation
. Exploitation refers to taking actions that come from the current,
best version of the learned policy—actions that we know will achieve a high reward.
476
CHAPTER 12. APPLICATIONS
Exploration refers to taking actions specifically to obtain more training data. If
we know that given context
x
, action
a
gives us a reward of 1, we do not know
whether that is the best possible reward. We may want to exploit our current
policy and continue taking action
a
to be relatively sure of obtaining a reward of 1.
However, we may also want to explore by trying action
a
. We do not know what
will happen if we try action
a
. We hope to get a reward of 2, but we run the risk
of getting a reward of 0. Either way, we at least gain some knowledge.
Exploration can be implemented in many ways, ranging from occasionally
taking random actions intended to cover the entire space of possible actions, to
model-based approaches that compute a choice of action based on its expected
reward and the model’s amount of uncertainty about that reward.
Many factors determine the extent to which we prefer exploration or exploitation.
One of the most prominent factors is the time scale we are interested in. If the
agent has only a short amount of time to accrue reward, then we prefer more
exploitation. If the agent has a long time to accrue reward, then we begin with
more exploration so that future actions can be planned more effectively with more
knowledge. As time progresses and our learned policy improves, we move toward
more exploitation.
Supervised learning has no trade-off between exploration and exploitation
because the supervision signal always specifies which output is correct for each
input. There is no need to try out different outputs to determine if one is better
than the model’s current output—we always know that the label is the best output.
Another difficulty arising in the context of reinforcement learning, besides the
exploration-exploitation trade-off, is the difficulty of evaluating and comparing
different policies. Reinforcement learning involves interaction between the learner
and the environment. This feedback loop means that it is not straightforward to
evaluate the learner’s performance using a fixed set of test set input values. The
policy itself determines which inputs will be seen. Dudik et al. (2011) present
techniques for evaluating contextual bandits.
12.5.2 Knowledge Representation, Reasoning and Question
Answering
Deep learning approaches have been very successful in language modeling, machine
translation and natural language processing because of the use of embeddings for
symbols (Rumelhart et al., 1986a) and words (Deerwester et al., 1990; Bengio et al.,
2001). These embeddings represent semantic knowledge about individual words
and concepts. A research frontier is to develop embeddings for phrases and for
477
CHAPTER 12. APPLICATIONS
relations between words and facts. Search engines already use machine learning for
this purpose, but much more remains to be done to improve these more advanced
representations.
12.5.2.1 Knowledge, Relations and Question Answering
One interesting research direction is determining how distributed representations
can be trained to capture the
relations
between two entities. These relations
allow us to formalize facts about objects and how objects interact with each other.
In mathematics, a
binary relation
is a set of ordered pairs of objects. Pairs
that are in the set are said to have the relation while those not in the set do not.
For example, we can define the relation “is less than” on the set of entities
{
1
,
2
,
3
}
by defining the set of ordered pairs
S
=
{
(1
,
2)
,
(1
,
3)
,
(2
,
3)
}
. Once this relation is
defined, we can use it like a verb. Because (1
,
2)
S
, we say that 1 is less than
2. Because (2
,
1)
∈ S
, we cannot say that 2 is less than 1. Of course, the entities
that are related to one another need not be numbers. We could define a relation
is_a_type_of containing tuples like (dog, mammal).
In the context of AI, we think of a relation as a sentence in a syntactically
simple and highly structured language. The relation plays the role of a verb,
while two arguments to the relation play the role of its subject and object. These
sentences take the form of a triplet of tokens
(subject, verb, object) (12.21)
with values
(entity
i
, relation
j
, entity
k
). (12.22)
We can also define an
attribute
, a concept analogous to a relation, but taking
only one argument:
(entity
i
, attribute
j
). (12.23)
For example, we could define the
has_fur
attribute, and apply it to entities like
dog.
Many applications require representing relations and reasoning about them.
How should we best do this within the context of neural networks?
Machine learning models of course require training data. We can infer relations
between entities from training datasets consisting of unstructured natural language.
There are also structured databases that identify relations explicitly. A common
structure for these databases is the
relational database
, which stores this same
kind of information, albeit not formatted as three token sentences. When a
478
CHAPTER 12. APPLICATIONS
database is intended to convey commonsense knowledge about everyday life or
expert knowledge about an application area to an artificial intelligence system, we
call the database a
knowledge base
. Knowledge bases range from general ones
like
Freebase
,
OpenCyc
,
WordNet
,
Wikibase
,
1
and so forth, to more specialized
knowledge bases like
GeneOntology
.
2
Representations for entities and relations can
be learned by considering each triplet in a knowledge base as a training example
and maximizing a training objective that captures their joint distribution (Bordes
et al., 2013a).
In addition to training data, we also need to define a model family to train.
A common approach is to extend neural language models to model entities and
relations. Neural language models learn a vector that provides a distributed
representation of each word. They also learn about interactions between words,
such as which word is likely to come after a sequence of words, by learning functions
of these vectors. We can extend this approach to entities and relations by learning
an embedding vector for each relation. In fact, the parallel between modeling
language and modeling knowledge encoded as relations is so close that researchers
have trained representations of such entities by using both knowledge bases and
natural language sentences (Bordes et al., 2011, 2012; Wang et al., 2014a), or by
combining data from multiple relational databases (Bordes et al., 2013b). Many
possibilities exist for the particular parametrization associated with such a model.
Early work on learning about relations between entities (Paccanaro and Hinton,
2000) posited highly constrained parametric forms (“linear relational embeddings”),
often using a different form of representation for the relation than for the entities.
For example, Paccanaro and Hinton (2000) and Bordes et al. (2011) used vectors for
entities and matrices for relations, with the idea that a relation acts like an operator
on entities. Alternatively, relations can be considered as any other entity (Bordes
et al., 2012), allowing us to make statements about relations, but more flexibility is
put in the machinery that combines them in order to model their joint distribution.
A practical short-term application of such models is
link prediction
: predict-
ing missing arcs in the knowledge graph. This is a form of generalization to new
facts based on old facts. Most of the knowledge bases that currently exist have
been constructed through manual labor, which tends to leave many and probably
the majority of true relations absent from the knowledge base. See Wang et al.
(2014b), Lin et al. (2015), and Garcia-Duran et al. (2015) for examples of such an
application.
1
Respectively available from these web sites:
freebase.com
,
cyc.com/opencyc
,
wordnet.
princeton.edu, wikiba.se
2
geneontology.org
479
CHAPTER 12. APPLICATIONS
Evaluating the performance of a model on a link prediction task is difficult
because we have only a dataset of positive examples (facts that are known to
be true). If the model proposes a fact that is not in the dataset, we are unsure
whether the model has made a mistake or discovered a new, previously unknown
fact. The metrics are thus somewhat imprecise and are based on testing how the
model ranks a held-out set of known true positive facts compared to other facts
that are less likely to be true. A common way to construct interesting examples
that are probably negative (facts that are probably false) is to begin with a true
fact and create corrupted versions of that fact, for example, by replacing one entity
in the relation with a different entity selected at random. The popular precision at
10 percent metric counts how many times the model ranks a “correct” fact among
the top 10 percent of all corrupted versions of that fact.
Another application of knowledge bases and distributed representations for
them is
word-sense disambiguation
(Navigli and Velardi, 2005; Bordes et al.,
2012), which is the task of deciding which sense of a word is the appropriate one
in some context.
Eventually, knowledge of relations combined with a reasoning process and an
understanding of natural language could allow us to build a general question-
answering system. A general question-answering system must be able to process
input information and remember important facts, organized in a way that enables
it to retrieve and reason about them later. This remains a difficult open problem
that can only be solved in restricted “toy” environments. Currently, the best
approach to remembering and retrieving specific declarative facts is to use an
explicit memory mechanism, as described in section 10.12. Memory networks were
first proposed to solve a toy question-answering task (Weston et al., 2014). Kumar
et al. (2015) have proposed an extension that uses GRU recurrent nets to read
the input into the memory and to produce the answer given the contents of the
memory.
Deep learning has been applied to many other applications besides the ones
described here, and will surely be applied to even more after this writing. It would
be impossible to describe anything remotely resembling a comprehensive coverage
of such a topic. This survey provides a representative sample of what is possible
as of this writing.
This concludes part II, which has described modern practices involving deep
networks, comprising all the most successful methods. Generally speaking, these
methods involve using the gradient of a cost function to find the parameters of a
model that approximates some desired function. With enough training data, this
approach is extremely powerful. We now turn to part III, in which we step into the
480
CHAPTER 12. APPLICATIONS
territory of research—methods that are designed to work with less training data
or to perform a greater variety of tasks, where the challenges are more difficult
and not as close to being solved as the situations we have described so far.
481