Chapter 12

Applications

In this chapter, we describe how to use deep learning to solve applications in

computer vision, speech recognition, natural language processing, and other areas

of commercial interest. We begin by discussing the large-scale neural network

implementations required for most serious AI applications. Next, we review several

speciﬁc application areas that deep learning has been used to solve. While one

goal of deep learning is to design algorithms that are capable of solving a broad

variety of tasks, so far some degree of specialization is needed. For example, vision

tasks require processing a large number of input features (pixels) per example.

Language tasks require modeling a large number of possible values (words in the

vocabulary) per input feature.

12.1 Large-Scale Deep Learning

Deep learning is based on the philosophy of connectionism: while an individual

biological neuron or an individual feature in a machine learning model is not

intelligent, a large population of these neurons or features acting together can

exhibit intelligent behavior. It truly is important to emphasize the fact that the

number of neurons must be large. One of the key factors responsible for the

improvement in neural network’s accuracy and the improvement of the complexity

of tasks they can solve between the 1980s and today is the dramatic increase in

the size of the networks we use. As we saw in section 1.2.3, network sizes have

grown exponentially for the past three decades, yet artiﬁcial neural networks are

only as large as the nervous systems of insects.

Because the size of neural networks is critical, deep learning requires high

438

CHAPTER 12. APPLICATIONS

performance hardware and software infrastructure.

12.1.1 Fast CPU Implementations

Traditionally, neural networks were trained using the CPU of a single machine.

Today, this approach is generally considered insuﬃcient. We now mostly use GPU

computing or the CPUs of many machines networked together. Before moving to

these expensive setups, researchers worked hard to demonstrate that CPUs could

not manage the high computational workload required by neural networks.

A description of how to implement eﬃcient numerical CPU code is beyond

the scope of this book, but we emphasize here that careful implementation for

speciﬁc CPU families can yield large improvements. For example, in 2011, the best

CPUs available could run neural network workloads faster when using ﬁxed-point

arithmetic rather than ﬂoating-point arithmetic. By creating a carefully tuned ﬁxed-

point implementation, Vanhoucke et al. (2011) obtained a threefold speedup over

a strong ﬂoating-point system. Each new model of CPU has diﬀerent performance

characteristics, so sometimes ﬂoating-point implementations can be faster too.

The important principle is that careful specialization of numerical computation

routines can yield a large payoﬀ. Other strategies, besides choosing whether to use

ﬁxed or ﬂoating point, include optimizing data structures to avoid cache misses

and using vector instructions. Many machine learning researchers neglect these

implementation details, but when the performance of an implementation restricts

the size of the model, the accuracy of the model suﬀers.

12.1.2 GPU Implementations

Most modern neural network implementations are based on graphics processing

units. Graphics processing units (GPUs) are specialized hardware components

that were originally developed for graphics applications. The consumer market for

video gaming systems spurred development of graphics processing hardware. The

performance characteristics needed for good video gaming systems turn out to be

beneﬁcial for neural networks as well.

Video game rendering requires performing many operations in parallel quickly.

Models of characters and environments are speciﬁed via lists of 3-D coordinates of

vertices. Graphics cards must perform matrix multiplication and division on many

vertices in parallel to convert these 3-D coordinates into 2-D on-screen coordinates.

The graphics card must then perform many computations at each pixel in parallel to

determine the color of each pixel. In both cases, the computations are fairly simple

439

CHAPTER 12. APPLICATIONS

and do not involve much branching compared to the computational workload that

a CPU usually encounters. For example, each vertex in the same rigid object will

be multiplied by the same matrix; there is no need to evaluate an if statement per

vertex to determine which matrix to multiply by. The computations are also entirely

independent of each other, and thus may be parallelized easily. The computations

also involve processing massive buﬀers of memory, containing bitmaps describing

the texture (color pattern) of each object to be rendered. Together, this results in

graphics cards having been designed to have a high degree of parallelism and high

memory bandwidth, at the cost of having a lower clock speed and less branching

capability relative to traditional CPUs.

Neural network algorithms require the same performance characteristics as the

real-time graphics algorithms described above. Neural networks usually involve

large and numerous buﬀers of parameters, activation values, and gradient values,

each of which must be completely updated during every step of training. These

buﬀers are large enough to fall outside the cache of a traditional desktop computer,

so the memory bandwidth of the system often becomes the rate-limiting factor.

GPUs oﬀer a compelling advantage over CPUs because of their high memory

bandwidth. Neural network training algorithms typically do not involve much

branching or sophisticated control, so they are appropriate for GPU hardware.

Since neural networks can be divided into multiple individual “neurons” that can be

processed independently from the other neurons in the same layer, neural networks

easily beneﬁt from the parallelism of GPU computing.

GPU hardware was originally so specialized that it could be used only for

graphics tasks. Over time, GPU hardware became more ﬂexible, allowing custom

subroutines to be used to transform the coordinates of vertices or to assign colors

to pixels. In principle, there was no requirement that these pixel values actually

be based on a rendering task. These GPUs could be used for scientiﬁc computing

by writing the output of a computation to a buﬀer of pixel values. Steinkrau et al.

(2005) implemented a two-layer fully connected neural network on a GPU and

reported a three-times speedup over their CPU-based baseline. Shortly thereafter,

Chellapilla et al. (2006) demonstrated that the same technique could be used to

accelerate supervised convolutional networks.

The popularity of graphics cards for neural network training exploded after

the advent of

general purpose GPUs

. These GP-GPUs could execute arbitrary

code, not just rendering subroutines. NVIDIA’s CUDA programming language

provided a way to write this arbitrary code in a C-like language. With their

relatively convenient programming model, massive parallelism, and high memory

bandwidth, GP-GPUs now oﬀer an ideal platform for neural network programming.

440

CHAPTER 12. APPLICATIONS

This platform was rapidly adopted by deep learning researchers soon after it became

available (Raina et al., 2009; Ciresan et al., 2010).

Writing eﬃcient code for GP-GPUs remains a diﬃcult task best left to special-

ists. The techniques required to obtain good performance on GPU are very diﬀerent

from those used on CPU. For example, good CPU-based code is usually designed

to read information from the cache as much as possible. On GPU, most writable

memory locations are not cached, so it can actually be faster to compute the same

value twice, rather than compute it once and read it back from memory. GPU code

is also inherently multithreaded and the diﬀerent threads must be coordinated

with each other carefully. For example, memory operations are faster if they can

coalesced

. Coalesced reads or writes occur when several threads can each

read or write a value that they need simultaneously, as part of a single memory

transaction. Diﬀerent models of GPUs are able to coalesce diﬀerent kinds of read

patterns and diﬀerent kinds of write patterns. Typically, memory operations are

easier to coalesce if among

threads, thread

accesses byte

of memory, and

is a multiple of some power of 2. The exact speciﬁcations diﬀer between models of

GPU. Another common consideration for GPUs is making sure that each thread in

a group executes the same instruction simultaneously. This means that branching

can be diﬃcult on GPU. Threads are divided into small groups called

warps

. Each

thread in a warp executes the same instruction during each cycle, so if diﬀerent

threads within the same warp need to execute diﬀerent code paths, these diﬀerent

code paths must be traversed sequentially rather than in parallel.

Because of the diﬃculty of writing high-performance GPU code, researchers

should structure their workﬂow to avoid needing to write new GPU code to test

new models or algorithms. Typically, one can do this by building a software library

of high-performance operations like convolution and matrix multiplication, then

specifying models in terms of calls to this library of operations. For example,

the machine learning library Pylearn2 (Goodfellow et al., 2013c) speciﬁes all its

machine learning algorithms in terms of calls to Theano (Bergstra et al., 2010;

Bastien et al., 2012) and cuda-convnet (Krizhevsky, 2010), which provide these

high-performance operations. This factored approach can also ease support for

multiple kinds of hardware. For example, the same Theano program can run on

either CPU or GPU, without needing to change any of the calls to Theano itself.

Other libraries like TensorFlow (Abadi et al., 2015) and Torch (Collobert et al.,

2011b) provide similar features.

441

CHAPTER 12. APPLICATIONS

12.1.3 Large-Scale Distributed Implementations

In many cases, the computational resources available on a single machine are

insuﬃcient. We therefore want to distribute the workload of training and inference

across many machines.

Distributing inference is simple, because each input example we want to process

can be run by a separate machine. This is known as data parallelism.

It is also possible to get

model parallelism

, where multiple machines work

together on a single data point, with each machine running a diﬀerent part of the

model. This is feasible for both inference and training.

Data parallelism during training is somewhat harder. We can increase the size

of the minibatch used for a single SGD step, but usually we get less than linear

returns in terms of optimization performance. It would be better to allow multiple

machines to compute multiple gradient descent steps in parallel. Unfortunately,

the standard deﬁnition of gradient descent is as a completely sequential algorithm:

the gradient at step t is a function of the parameters produced by step t −1.

This can be solved using

asynchronous stochastic gradient descent

(Ben-

gio et al., 2001; Recht et al., 2011). In this approach, several processor cores share

the memory representing the parameters. Each core reads parameters without a

lock, then computes a gradient, then increments the parameters without a lock.

This reduces the average amount of improvement that each gradient descent step

yields, because some of the cores overwrite each other’s progress, but the increased

rate of production of steps causes the learning process to be faster overall. Dean

et al. (2012) pioneered the multimachine implementation of this lock-free approach

to gradient descent, where the parameters are managed by a

parameter server

rather than stored in shared memory. Distributed asynchronous gradient descent

remains the primary strategy for training large deep networks and is used by

most major deep learning groups in industry (Chilimbi et al., 2014; Wu et al.,

2015). Academic deep learning researchers typically cannot aﬀord the same scale

of distributed learning systems, but some research has focused on how to build

distributed networks with relatively low-cost hardware available in the university

setting (Coates et al., 2013).

12.1.4 Model Compression

In many commercial applications, it is much more important that the time and

memory cost of running inference in a machine learning model be low than that

the time and memory cost of training be low. For applications that do not require

442

CHAPTER 12. APPLICATIONS

personalization, it is possible to train a model once, then deploy it to be used by

billions of users. In many cases, the end user is more resource constrained than

the developer. For example, one might train a speech recognition network with a

powerful computer cluster, then deploy it on mobile phones.

A key strategy for reducing the cost of inference is

model compression

(Bu-

ciluˇa et al., 2006). The basic idea of model compression is to replace the original,

expensive model with a smaller model that requires less memory and runtime to

store and evaluate.

Model compression is applicable when the size of the original model is driven

primarily by a need to prevent overﬁtting. In most cases, the model with the

lowest generalization error is an ensemble of several independently trained models.

Evaluating all

ensemble members is expensive. Sometimes, even a single model

generalizes better if it is large (for example, if it is regularized with dropout).

These large models learn some function

(

), but do so using many more

parameters than are necessary for the task. Their size is necessary only because of

the limited number of training examples. As soon as we have ﬁt this function

(

we can generate a training set containing inﬁnitely many examples, simply by

applying

to randomly sampled points

. We then train the new, smaller model

to match

(

) on these points. To most eﬃciently use the capacity of the new,

small model, it is best to sample the new

points from a distribution resembling

the actual test inputs that will be supplied to the model later. This can be done

by corrupting training examples or by drawing points from a generative model

trained on the original training set.

Alternatively, one can train the smaller model only on the original training

points, but train it to copy other features of the model, such as its posterior

distribution over the incorrect classes (Hinton et al., 2014, 2015).

12.1.5 Dynamic Structure

One strategy for accelerating data-processing systems in general is to build systems

that have

dynamic structure

in the graph describing the computation needed

to process an input. Data-processing systems can dynamically determine which

subset of many neural networks should be run on a given input. Individual neural

networks can also exhibit dynamic structure internally by determining which subset

of features (hidden units) to compute given information from the input. This

form of dynamic structure inside neural networks is sometimes called

conditional

computation

(Bengio, 2013; Bengio et al., 2013b). Since many components of

the architecture may be relevant only for a small amount of possible inputs, the

443

CHAPTER 12. APPLICATIONS

system can run faster by computing these features only when they are needed.

Dynamic structure of computations is a basic computer science principle applied

generally throughout the software engineering discipline. The simplest versions

of dynamic structure applied to neural networks are based on determining which

subset of some group of neural networks (or other machine learning models) should

be applied to a particular input.

A venerable strategy for accelerating inference in a classiﬁer is to use a

cascade

of classiﬁers. The cascade strategy may be applied when the goal is to detect the

presence of a rare object (or event). To know for sure that the object is present,

we must use a sophisticated classiﬁer with high capacity, which is expensive to run.

Because the object is rare, however, we can usually use much less computation

to reject inputs as not containing the object. In these situations, we can train a

sequence of classiﬁers. The ﬁrst classiﬁers in the sequence have low capacity and

are trained to have high recall. In other words, they are trained to make sure

we do not wrongly reject an input when the object is present. The ﬁnal classiﬁer

is trained to have high precision. At test time, we run inference by running the

classiﬁers in a sequence, abandoning any example as soon as any one element in

the cascade rejects it. Overall, this allows us to verify the presence of objects with

high conﬁdence, using a high capacity model, but does not force us to pay the cost

of full inference for every example. There are two diﬀerent ways that the cascade

can achieve high capacity. One way is to make the later members of the cascade

individually have high capacity. In this case, the system as a whole obviously has

high capacity, because some of its individual members do. It is also possible to

make a cascade in which every individual model has low capacity but the system

as a whole has high capacity as a result of the combination of many small models.

Viola and Jones (2001) used a cascade of boosted decision trees to implement a

fast and robust face detector suitable for use in handheld digital cameras. Their

classiﬁer localizes a face using essentially a sliding window approach in which many

windows are examined and rejected if they do not contain faces. Another version of

cascades uses the earlier models to implement a sort of hard attention mechanism:

the early members of the cascade localize an object, and later members of the

cascade perform further processing given the location of the object. For example,

Google transcribes address numbers from Street View imagery using a two-step

cascade that ﬁrst locates the address number with one machine learning model

and then transcribes it with another (Goodfellow et al., 2014d).

Decision trees themselves are an example of dynamic structure, because each

node in the tree determines which of its subtrees should be evaluated for each input.

A simple way to accomplish the union of deep learning and dynamic structure

444

CHAPTER 12. APPLICATIONS

is to train a decision tree in which each node uses a neural network to make the

splitting decision (Guo and Gelfand, 1992), though this has typically not been

done with the primary goal of accelerating inference computations.

In the same spirit, one can use a neural network called the

gater

to select

which one out of several

expert networks

will be used to compute the output,

given the current input. The ﬁrst version of this idea is called the

mixture of

experts

(Nowlan, 1990; Jacobs et al., 1991), in which the gater outputs a set

of probabilities or weights (obtained via a softmax nonlinearity), one per expert,

and the ﬁnal output is obtained by the weighted combination of the output of

the experts. In that case, the use of the gater does not oﬀer a reduction in

computational cost, but if a single expert is chosen by the gater for each example,

we obtain the

hard mixture of experts

(Collobert et al., 2001, 2002), which

can considerably accelerate training and inference time. This strategy works well

when the number of gating decisions is small because it is not combinatorial. But

when we want to select diﬀerent subsets of units or parameters, it is not possible

to use a “soft switch” because it requires enumerating (and computing outputs for)

all the gater conﬁgurations. To deal with this problem, several approaches have

been explored to train combinatorial gaters. Bengio et al. (2013b) experiment with

several estimators of the gradient on the gating probabilities, while Bacon et al.

(2015) and Bengio et al. (2015a) use reinforcement learning techniques (policy

gradient) to learn a form of conditional dropout on blocks of hidden units and get

an actual reduction in computational cost without negatively aﬀecting the quality

of the approximation.

Another kind of dynamic structure is a switch, where a hidden unit can

receive input from diﬀerent units depending on the context. This dynamic routing

approach can be interpreted as an attention mechanism (Olshausen et al., 1993).

So far, the use of a hard switch has not proven eﬀective on large-scale applications.

Contemporary approaches instead use a weighted average over many possible

inputs, and thus do not achieve all the possible computational beneﬁts of dynamic

structure. Contemporary attention mechanisms are described in section 12.4.5.1.

One major obstacle to using dynamically structured systems is the decreased

degree of parallelism that results from the system following diﬀerent code branches

for diﬀerent inputs. This means that few operations in the network can be described

as matrix multiplication or batch convolution on a minibatch of examples. We

can write more specialized subroutines that convolve each example with diﬀerent

kernels or multiply each row of a design matrix by a diﬀerent set of columns

of weights. Unfortunately, these more specialized subroutines are diﬃcult to

implement eﬃciently. CPU implementations will be slow as a result of the lack

445

CHAPTER 12. APPLICATIONS

of cache coherence, and GPU implementations will be slow because of the lack of

coalesced memory transactions and the need to serialize warps when members of

a warp take diﬀerent branches. In some cases, these issues can be mitigated by

partitioning the examples into groups that all take the same branch, then processing

these groups of examples simultaneously. This can be an acceptable strategy for

minimizing the time required to process a ﬁxed amount of examples in an oﬄine

setting. In a real-time setting where examples must be processed continuously,

partitioning the workload can result in load-balancing issues. For example, if we

assign one machine to process the ﬁrst step in a cascade and another machine to

process the last step in a cascade, then the ﬁrst will tend to be overloaded, and the

last will tend to be underloaded. Similar issues arise if each machine is assigned to

implement diﬀerent nodes of a neural decision tree.

12.1.6 Specialized Hardware Implementations of Deep Networks

Since the early days of neural networks research, hardware designers have worked

on specialized hardware implementations that could speed up training and/or

inference of neural network algorithms. See early and more recent reviews of

specialized hardware for deep networks (Lindsey and Lindblad, 1994; Beiu et al.,

2003; Misra and Saha, 2010).

Diﬀerent forms of specialized hardware (Graf and Jackel, 1989; Mead and Ismail,

2012; Kim et al., 2009; Pham et al., 2012; Chen et al., 2014a,b) have been developed

over the last decades with ASICs (application-speciﬁc integrated circuits), either

with digital (based on binary representations of numbers), analog (Graf and Jackel,

1989; Mead and Ismail, 2012) (based on physical implementations of continuous

values as voltages or currents), or hybrid implementations (combining digital and

analog components). In recent years more ﬂexible FPGA (ﬁeld programmable

gated array) implementations (where the particulars of the circuit can be written

on the chip after it has been built) have been developed.

Though software implementations on general-purpose processing units (CPUs

and GPUs) typically use 32 or 64 bits of precision to represent ﬂoating-point

numbers, it has long been known that it was possible to use less precision, at

least at inference time (Holt and Baker, 1991; Holi and Hwang, 1993; Presley

and Haggard, 1994; Simard and Graf, 1994; Wawrzynek et al., 1996; Savich et al.,

2007). This has become a more pressing issue in recent years as deep learning

has gained in popularity in industrial products, and as the great impact of faster

hardware was demonstrated with GPUs. Another factor that motivates current

research on specialized hardware for deep networks is that the rate of progress of

a single CPU or GPU core has slowed down, and most recent improvements in

446

CHAPTER 12. APPLICATIONS

computing speed have come from parallelization across cores (either in CPUs or

GPUs). This is very diﬀerent from the situation of the 1990s (the previous neural

network era), when the hardware implementations of neural networks (which might

take two years from inception to availability of a chip) could not keep up with

the rapid progress and low prices of general-purpose CPUs. Building specialized

hardware is thus a way to push the envelope further, at a time when new hardware

designs are being developed for low-power devices such as phones, aiming for

general-public applications of deep learning (e.g., with speech, computer vision or

natural language).

Recent work on low-precision implementations of backprop-based neural nets

(Vanhoucke et al., 2011; Courbariaux et al., 2015; Gupta et al., 2015) suggests

that between 8 and 16 bits of precision can suﬃce for using or training deep

neural networks with back-propagation. What is clear is that more precision is

required during training than at inference time, and that some forms of dynamic

ﬁxed-point representation of numbers can be used to reduce how many bits are

required per number. Traditional ﬁxed-point numbers are restricted to a ﬁxed

range (which corresponds to a given exponent in a ﬂoating-point representation).

Dynamic ﬁxed-point representations share that range among a set of numbers

(such as all the weights in one layer). Using ﬁxed-point rather than ﬂoating-point

representations and using fewer bits per number reduces the hardware surface area,

power requirements, and computing time needed for performing multiplications,

and multiplications are the most demanding of the operations needed to use or

train a modern deep network with backprop.

12.2 Computer Vision

Computer vision has traditionally been one of the most active research areas for

deep learning applications, because vision is a task that is eﬀortless for humans

and many animals but challenging for computers (Ballard et al., 1983). Many of

the most popular standard benchmark tasks for deep learning algorithms are forms

of object recognition or optical character recognition.

Computer vision is a very broad ﬁeld encompassing a wide variety of ways to

process images and an amazing diversity of applications. Applications of computer

vision range from reproducing human visual abilities, such as recognizing faces,

to creating entirely new categories of visual abilities. As an example of the latter

category, one recent computer vision application is to recognize sound waves

from the vibrations they induce in objects visible in a video (Davis et al., 2014).

Most deep learning research on computer vision has focused not on such exotic

447

CHAPTER 12. APPLICATIONS

applications that expand the realm of what is possible with imagery but rather

on a small core of AI goals aimed at replicating human abilities. Most deep

learning for computer vision is used for object recognition or detection of some

form, whether this means reporting which object is present in an image, annotating

an image with bounding boxes around each object, transcribing a sequence of

symbols from an image, or labeling each pixel in an image with the identity of the

object it belongs to. Because generative modeling has been a guiding principle

of deep learning research, there is also a large body of work on image synthesis

using deep models. While image synthesis ex nihilo is usually not considered a

computer vision endeavor, models capable of image synthesis are usually useful for

image restoration, a computer vision task involving repairing defects in images or

removing objects from images.

12.2.1 Preprocessing

Many application areas require sophisticated preprocessing because the original

input comes in a form that is diﬃcult for many deep learning architectures to

represent. Computer vision usually requires relatively little of this kind of pre-

processing. The images should be standardized so that their pixels all lie in the

same reasonable range, like [0,1] or [-1, 1]. Mixing images that lie in [0,1] with

images that lie in [0, 255] will usually result in failure. Formatting images to have

the same scale is the only kind of preprocessing that is strictly necessary. Many

computer vision architectures require images of a standard size, so images must be

cropped or scaled to ﬁt that size. Even this rescaling is not always strictly necessary.

Some convolutional models accept variably sized inputs and dynamically adjust

the size of their pooling regions to keep the output size constant (Waibel et al.,

1989). Other convolutional models have variably sized output that automatically

scales in size with the input, such as models that denoise or label each pixel in an

image (Hadsell et al., 2007).

Dataset augmentation may be seen as a way of preprocessing the training set

only. Dataset augmentation is an excellent way to reduce the generalization error

of most computer vision models. A related idea applicable at test time is to show

the model many diﬀerent versions of the same input (for example, the same image

cropped at slightly diﬀerent locations) and have the diﬀerent instantiations of the

model vote to determine the output. This latter idea can be interpreted as an

ensemble approach, and it helps to reduce generalization error.

Other kinds of preprocessing are applied to both the training and the test set

with the goal of putting each example into a more canonical form to reduce the

amount of variation that the model needs to account for. Reducing the amount of

448

CHAPTER 12. APPLICATIONS

variation in the data can reduce both generalization error and the size of the model

needed to ﬁt the training set. Simpler tasks may be solved by smaller models, and

simpler solutions are more likely to generalize well. Preprocessing of this kind is

usually designed to remove some kind of variability in the input data that is easy

for a human designer to describe and that the human designer is conﬁdent has no

relevance to the task. When training with large datasets and large models, this

kind of preprocessing is often unnecessary, and it is best to just let the model learn

which kinds of variability it should become invariant to. For example, the AlexNet

system for classifying ImageNet has only one preprocessing step: subtracting the

mean across training examples of each pixel (Krizhevsky et al., 2012).

12.2.1.1 Contrast Normalization

One of the most obvious sources of variation that can be safely removed for

many tasks is the amount of contrast in the image. Contrast simply refers to the

magnitude of the diﬀerence between the bright and the dark pixels in an image.

There are many ways of quantifying the contrast of an image. In the context of

deep learning, contrast usually refers to the standard deviation of the pixels in an

image or region of an image. Suppose we have an image represented by a tensor

X ∈ R

r×c×3

, with X

i,j,1

being the red intensity at row

, and column

, X

i,j,2

giving

the green intensity, and X

i,j,3

giving the blue intensity. Then the contrast of the

entire image is given by







3rc



i=1



j=1



k=1



i,j,k

−



, (12.1)

where

X is the mean intensity of the entire image:

X =

3rc



i=1



j=1



k=1

i,j,k

. (12.2)

Global contrast normalization

(GCN) aims to prevent images from having

varying amounts of contrast by subtracting the mean from each image, then

rescaling it so that the standard deviation across its pixels is equal to some

constant

. This approach is complicated by the fact that no scaling factor can

change the contrast of a zero-contrast image (one whose pixels all have equal

intensity). Images with very low but nonzero contrast often have little information

content. Dividing by the true standard deviation usually accomplishes nothing

more than amplifying sensor noise or compression artifacts in such cases. This

449

CHAPTER 12. APPLICATIONS

−1.5 0.0 1.5

−1.5

0.0

1.5

Raw input

−1.5 0.0 1.5

GCN, λ = 10

−2

−1.5 0.0 1.5

GCN, λ = 0

Figure 12.1: GCN maps examples onto a sphere. (Left)Raw input data may have any norm.

(Center)GCN with

= 0 maps all nonzero examples perfectly onto a sphere. Here we use

= 1 and



= 10

−8

. Because we use GCN based on normalizing the standard deviation

rather than the

norm, the resulting sphere is not the unit sphere. (Right)Regularized

GCN, with

λ >

0, draws examples toward the sphere but does not completely discard the

variation in their norm. We leave s and  the same as before.

motivates introducing a small positive regularization parameter

to bias the

estimate of the standard deviation. Alternately, one can constrain the denominator

to be at least



. Given an input image

, GCN produces an output image



deﬁned such that



i,j,k

= s

i,j,k

−

max



,



λ +

3rc



i=1



j=1



k=1



i,j,k

−





. (12.3)

Datasets consisting of large images cropped to interesting objects are unlikely

to contain any images with nearly constant intensity. In these cases, it is safe

to practically ignore the small denominator problem by setting

= 0 and avoid

division by 0 in extremely rare cases by setting



to an extremely low value like

−8

. This is the approach used by Goodfellow et al. (2013a) on the CIFAR-10

dataset. Small images cropped randomly are more likely to have nearly constant

intensity, making aggressive regularization more useful. Coates et al. (2011) used

 = 0 and λ = 10 on small, randomly selected patches drawn from CIFAR-10.

The scale parameter

can usually be set to 1, as done by Coates et al. (2011),

or chosen to make each individual pixel have standard deviation across examples

close to 1, as done by Goodfellow et al. (2013a).

The standard deviation in equation 12.3 is just a rescaling of the

norm

450

CHAPTER 12. APPLICATIONS

of the image (assuming the mean of the image has already been removed). It is

preferable to deﬁne GCN in terms of standard deviation rather than

norm

because the standard deviation includes division by the number of pixels, so GCN

based on standard deviation allows the same

to be used regardless of image

size. However, the observation that the

norm is proportional to the standard

deviation can help build a useful intuition. One can understand GCN as mapping

examples to a spherical shell. See ﬁgure 12.1 for an illustration. This can be a

useful property because neural networks are often better at responding to directions

in space rather than to exact locations. Responding to multiple distances in the

same direction requires hidden units with collinear weight vectors but diﬀerent

biases. Such coordination can be diﬃcult for the learning algorithm to discover.

Additionally, many shallow graphical models have problems with representing

multiple separated modes along the same line. GCN avoids these problems by

reducing each example to a direction rather than a direction and a distance.

Counterintuitively, there is a preprocessing operation known as

sphering

that

is not the same operation as GCN. Sphering does not refer to making the data

lie on a spherical shell, but rather refers to rescaling the principal components to

have equal variance, so that the multivariate normal distribution used by PCA has

spherical contours. Sphering is more commonly known as whitening.

Global contrast normalization will often fail to highlight image features we

would like to have stand out, such as edges and corners. If we have a scene with a

large dark area and a large bright area (such as a city square with half the image in

the shadow of a building), then global contrast normalization will ensure that there

is a large diﬀerence between the brightness of the dark area and the brightness

of the light area. It will not, however, ensure that edges within the dark region

stand out.

This motivates

local contrast normalization

. Local contrast normalization

ensures that the contrast is normalized across each small window, rather than over

the image as a whole. See ﬁgure 12.2 for a comparison of global and local contrast

normalization.

Various deﬁnitions of local contrast normalization are possible. In all cases,

one modiﬁes each pixel by subtracting a mean of nearby pixels and dividing by

a standard deviation of nearby pixels. In some cases, this is literally the mean

and standard deviation of all pixels in a rectangular window centered on the

pixel to be modiﬁed (Pinto et al., 2008). In other cases, this is a weighted mean

and weighted standard deviation using Gaussian weights centered on the pixel to

be modiﬁed. In the case of color images, some strategies process diﬀerent color

channels separately while others combine information from diﬀerent channels to

451

CHAPTER 12. APPLICATIONS

Input image GCN LCN

Figure 12.2: A comparison of global and local contrast normalization. Visually, the eﬀects

of global contrast normalization are subtle. It places all images on roughly the same

scale, which reduces the burden on the learning algorithm to handle multiple scales. Local

contrast normalization modiﬁes the image much more, discarding all regions of constant

intensity. This allows the model to focus on just the edges. Regions of ﬁne texture,

such as the houses in the second row, may lose some detail due to the bandwidth of the

normalization kernel being too high.

normalize each pixel (Sermanet et al., 2012).

Local contrast normalization can usually be implemented eﬃciently by using

separable convolution (see section 9.8) to compute feature maps of local means and

local standard deviations, then using element-wise subtraction and element-wise

division on diﬀerent feature maps.

Local contrast normalization is a diﬀerentiable operation and can also be used as

a nonlinearity applied to the hidden layers of a network, as well as a preprocessing

operation applied to the input.

As with global contrast normalization, we typically need to regularize local

contrast normalization to avoid division by zero. In fact, because local contrast

normalization typically acts on smaller windows, it is even more important to

regularize. Smaller windows are more likely to contain values that are all nearly

the same as each other, and thus more likely to have zero standard deviation.

12.2.1.2 Dataset Augmentation

As described in section 7.4, it is easy to improve the generalization of a classiﬁer

by increasing the size of the training set by adding extra copies of the training

452

CHAPTER 12. APPLICATIONS

examples that have been modiﬁed with transformations that do not change the

class. Object recognition is a classiﬁcation task that is especially amenable to

this form of dataset augmentation because the class is invariant to so many

transformations and the input can be easily transformed with many geometric

operations. As described before, classiﬁers can beneﬁt from random translations,

rotations, and in some cases, ﬂips of the input to augment the dataset. In specialized

computer vision applications, more advanced transformations are commonly used

for dataset augmentation. These schemes include random perturbation of the

colors in an image (Krizhevsky et al., 2012) and nonlinear geometric distortions of

the input (LeCun et al., 1998b).

12.3 Speech Recognition

The task of speech recognition is to map an acoustic signal containing a spoken

natural language utterance into the corresponding sequence of words intended by

the speaker. Let

= (

(1)

, x

(2)

, . . . , x

(T )

) denote the sequence of acoustic input

vectors (traditionally produced by splitting the audio into 20ms frames). Most

speech recognition systems preprocess the input using specialized hand-designed

features, but some (Jaitly and Hinton, 2011) deep learning systems learn features

from raw input. Let

= (

, y

, . . . , y

) denote the target output sequence (usually

a sequence of words or characters). The

automatic speech recognition

(ASR)

task consists of creating a function

∗

ASR

that computes the most probable linguistic

sequence y given the acoustic sequence X:

∗

ASR

(X) = arg max

∗

(y | X = X), (12.4)

where

∗

is the true conditional distribution relating the inputs

to the targets

Since the 1980s and until about 2009–2012, state-of-the-art speech recognition

systems primarily combined hidden Markov models (HMMs) and Gaussian mixture

models (GMMs). GMMs modeled the association between acoustic features and

phonemes (Bahl et al., 1987), while HMMs modeled the sequence of phonemes.

The GMM-HMM model family treats acoustic waveforms as being generated

by the following process: ﬁrst an HMM generates a sequence of phonemes and

discrete subphonemic states (such as the beginning, middle, and end of each

phoneme), then a GMM transforms each discrete symbol into a brief segment of

audio waveform. Although GMM-HMM systems dominated ASR until recently,

speech recognition was actually one of the ﬁrst areas where neural networks were

applied, and numerous ASR systems from the late 1980s and early 1990s used

453

CHAPTER 12. APPLICATIONS

neural nets (Bourlard and Wellekens, 1989; Waibel et al., 1989; Robinson and

Fallside, 1991; Bengio et al., 1991, 1992; Konig et al., 1996). At the time, the

performance of ASR based on neural nets approximately matched the performance

of GMM-HMM systems. For example, Robinson and Fallside (1991) achieved 26

percent phoneme error rate on the TIMIT (Garofolo et al., 1993) corpus (with

39 phonemes to discriminate among), which was better than or comparable to

HMM-based systems. Since then, TIMIT has been a benchmark for phoneme

recognition, playing a role similar to the role MNIST plays for object recognition.

Nonetheless, because of the complex engineering involved in software systems for

speech recognition and the eﬀort that had been invested in building these systems

on the basis of GMM-HMMs, the industry did not see a compelling argument

for switching to neural networks. As a consequence, until the late 2000s, both

academic and industrial research in using neural nets for speech recognition mostly

focused on using neural nets to learn extra features for GMM-HMM systems.

Later, with much larger and deeper models and much larger datasets, recognition

accuracy was dramatically improved by using neural networks to replace GMMs

for the task of associating acoustic features to phonemes (or subphonemic states).

Starting in 2009, speech researchers applied a form of deep learning based on

unsupervised learning to speech recognition. This approach to deep learning was

based on training undirected probabilistic models called restricted Boltzmann

machines (RBMs) to model the input data. RBMs are described in part III.

To solve speech recognition tasks, unsupervised pretraining was used to build

deep feedforward networks whose layers were each initialized by training an RBM.

These networks take spectral acoustic representations in a ﬁxed-size input window

(around a center frame) and predict the conditional probabilities of HMM states

for that center frame. Training such deep networks helped to signiﬁcantly improve

the recognition rate on TIMIT (Mohamed et al., 2009, 2012a), bringing down the

phoneme error rate from about 26 percent to 20.7 percent. See Mohamed et al.

(2012b) for an analysis of reasons for the success of these models. Extensions to the

basic phone recognition pipeline included the addition of speaker-adaptive features

(Mohamed et al., 2011) that further reduced the error rate. This was quickly

followed up by work to expand the architecture from phoneme recognition (which

is what TIMIT is focused on) to large-vocabulary speech recognition (Dahl et al.,

2012), which involves not just recognizing phonemes but also recognizing sequences

of words from a large vocabulary. Deep networks for speech recognition eventually

shifted from being based on pretraining and Boltzmann machines to being based

on techniques such as rectiﬁed linear units and dropout (Zeiler et al., 2013; Dahl

et al., 2013). By that time, several of the major speech groups in industry had

started exploring deep learning in collaboration with academic researchers. Hinton

454

CHAPTER 12. APPLICATIONS

et al. (2012a) describe the breakthroughs achieved by these collaborators, which

are now deployed in products such as mobile phones.

Later, as these groups explored larger and larger labeled datasets and incorpo-

rated some of the methods for initializing, training, and setting up the architecture

of deep nets, they realized that the unsupervised pretraining phase was either

unnecessary or did not bring any signiﬁcant improvement.

These breakthroughs in recognition performance for word error rate in speech

recognition were unprecedented (around 30 percent improvement) and were follow-

ing a long period, of about ten years, during which error rates did not improve

much with the traditional GMM-HMM technology, in spite of the continuously

growing size of training sets (see ﬁgure 2.4 of Deng and Yu [2014]). This created

a rapid shift in the speech recognition community toward deep learning. In a

matter of roughly two years, most of the industrial products for speech recognition

incorporated deep neural networks, and this success spurred a new wave of research

into deep learning algorithms and architectures for ASR, which is ongoing today.

One of these innovations was the use of convolutional networks (Sainath et al.,

2013) that replicate weights across time and frequency, improving over the earlier

time-delay neural networks that replicated weights only across time. The new

two-dimensional convolutional models regard the input spectrogram not as one

long vector but as an image, with one axis corresponding to time and the other to

frequency of spectral components.

Another important push, still ongoing, has been toward end-to-end deep learning

speech recognition systems that completely remove the HMM. The ﬁrst major

breakthrough in this direction came from Graves et al. (2013), who trained a deep

LSTM RNN (see section 10.10), using MAP inference over the frame-to-phoneme

alignment, as in LeCun et al. (1998b) and in the CTC framework (Graves et al.,

2006; Graves, 2012). A deep RNN (Graves et al., 2013) has state variables from

several layers at each time step, giving the unfolded graph two kinds of depth:

ordinary depth due to a stack of layers, and depth due to time unfolding. This

work brought the phoneme error rate on TIMIT to a record low of 17.7 percent.

See Pascanu et al. (2014a) and Chung et al. (2014) for other variants of deep RNNs,

applied in other settings.

Another contemporary step toward end-to-end deep learning ASR is to let the

system learn how to “align” the acoustic-level information with the phonetic-level

information (Chorowski et al., 2014; Lu et al., 2015).

455

CHAPTER 12. APPLICATIONS

12.4 Natural Language Processing

Natural language processing

(NLP) is the use of human languages, such as

English or French, by a computer. Computer programs typically read and emit

specialized languages designed to allow eﬃcient and unambiguous parsing by simple

programs. More naturally occurring languages are often ambiguous and defy formal

description. Natural language processing includes applications such as machine

translation, in which the learner must read a sentence in one human language and

emit an equivalent sentence in another human language. Many NLP applications

are based on language models that deﬁne a probability distribution over sequences

of words, characters, or bytes in a natural language.

As with the other applications discussed in this chapter, very generic neural

network techniques can be successfully applied to natural language processing.

To achieve excellent performance and to scale well to large applications, however,

some domain-speciﬁc strategies become important. To build an eﬃcient model of

natural language, we must usually use techniques that are specialized for processing

sequential data. In many cases, we choose to regard natural language as a sequence

of words, rather than a sequence of individual characters or bytes. Because the total

number of possible words is so large, word-based language models must operate on

an extremely high-dimensional and sparse discrete space. Several strategies have

been developed to make models of such a space eﬃcient, in both a computational

and a statistical sense.

12.4.1 n-grams

language model

deﬁnes a probability distribution over sequences of tokens

in a natural language. Depending on how the model is designed, a token may

be a word, a character, or even a byte. Tokens are always discrete entities. The

earliest successful language models were based on models of ﬁxed-length sequences

of tokens called n-grams. An n-gram is a sequence of n tokens.

Models based on

-grams deﬁne the conditional probability of the

-th token

given the preceding

n −

1 tokens. The model uses products of these conditional

distributions to deﬁne the probability distribution over longer sequences:

P (x

, . . . , x

) = P (x

, . . . , x

n−1

)



t=n

P (x

| x

t−n+1

, . . . , x

t−1

). (12.5)

This decomposition is justiﬁed by the chain rule of probability. The probability

distribution over the initial sequence

(

, . . . , x

n−1

) may be modeled by a diﬀerent

model with a smaller value of n.

456

CHAPTER 12. APPLICATIONS

Training

-gram models is straightforward because the maximum likelihood

estimate can be computed simply by counting how many times each possible

-gram occurs in the training set. Models based on

-grams have been the core

building block of statistical language modeling for many decades (Jelinek and

Mercer, 1980; Katz, 1987; Chen and Goodman, 1999).

For small values of

, models have particular names:

unigram

for

= 1,

bigram

for

= 2, and

trigram

for

= 3. These names derive from the Latin

preﬁxes for the corresponding numbers and the Greek suﬃx “-gram,” denoting

something that is written.

Usually we train both an

-gram model and an

n−

1 gram model simultaneously.

This makes it easy to compute

P (x

| x

t−n+1

, . . . , x

t−1

) =

t−n+1

, . . . , x

)

n−1

t−n+1

, . . . , x

t−1

)

(12.6)

simply by looking up two stored probabilities. For this to exactly reproduce

inference in

, we must omit the ﬁnal character from each sequence when we

train P

n−1

As an example, we demonstrate how a trigram model computes the probability

of the sentence “

THE DOG RAN AWAY

.” The ﬁrst words of the sentence cannot be

handled by the default formula based on conditional probability because there is no

context at the beginning of the sentence. Instead, we must use the marginal prob-

ability over words at the start of the sentence. We thus evaluate

(

THE DOG RAN

Finally, the last word may be predicted using the typical case, of using the condi-

tional distribution

(

AWAY | DOG RAN

). Putting this together with equation 12.6,

we obtain:

P (THE DOG RAN AWAY) = P

(THE DOG RAN)P

(DOG RAN AWAY)/P

(DOG RAN).

(12.7)

A fundamental limitation of maximum likelihood for

-gram models is that

as estimated from training set counts is very likely to be zero in many cases, even

though the tuple (

t−n+1

, . . . , x

) may appear in the test set. This can cause two

diﬀerent kinds of catastrophic outcomes. When

n−1

is zero, the ratio is undeﬁned,

so the model does not even produce a sensible output. When

n−1

is nonzero but

is zero, the test log-likelihood is

−∞

. To avoid such catastrophic outcomes,

most

-gram models employ some form of

smoothing

. Smoothing techniques

shift probability mass from the observed tuples to unobserved ones that are similar.

See Chen and Goodman (1999) for a review and empirical comparisons. One basic

technique consists of adding nonzero probability mass to all the possible next

symbol values. This method can be justiﬁed as Bayesian inference with a uniform

457

CHAPTER 12. APPLICATIONS

or Dirichlet prior over the count parameters. Another very popular idea is to form

a mixture model containing higher-order and lower-order

-gram models, with the

higher-order models providing more capacity and the lower-order models being

more likely to avoid counts of zero.

Back-oﬀ methods

look up the lower-order

-grams if the frequency of the context

t−1

, . . . , x

t−n+1

is too small to use the

higher-order model. More formally, they estimate the distribution over

by using

contexts

t−n+k

, . . . , x

t−1

, for increasing

, until a suﬃciently reliable estimate is

found.

Classical

-gram models are particularly vulnerable to the curse of dimension-

ality. There are

|V|

possible

-grams and

|V|

is often very large. Even with a

massive training set and modest

, most

-grams will not occur in the training set.

One way to view a classical

-gram model is that it is performing nearest neighbor

lookup. In other words, it can be viewed as a local nonparametric predictor, similar

-nearest neighbors. The statistical problems facing these extremely local pre-

dictors are described in section 5.11.2. The problem for a language model is even

more severe than usual, because any two diﬀerent words have the same distance

from each other in one-hot vector space. It is thus diﬃcult to leverage much

information from any “neighbors”—only training examples that repeat literally the

same context are useful for local generalization. To overcome these problems, a

language model must be able to share knowledge between one word and other

semantically similar words.

To improve the statistical eﬃciency of

-gram models,

class-based language

models

(Brown et al., 1992; Ney and Kneser, 1993; Niesler et al., 1998) introduce

the notion of word categories and then share statistical strength between words that

are in the same category. The idea is to use a clustering algorithm to partition the

set of words into clusters or classes, based on their co-occurrence frequencies with

other words. The model can then use word class IDs rather than individual word

IDs to represent the context on the right side of the conditioning bar. Composite

models combining word-based and class-based models via mixing or back-oﬀ are

also possible. Although word classes provide a way to generalize between sequences

in which some word is replaced by another of the same class, much information is

lost in this representation.

12.4.2 Neural Language Models

Neural language models

, or NLMs, are a class of language model designed

to overcome the curse of dimensionality problem for modeling natural language

sequences by using a distributed representation of words (Bengio et al., 2001).

Unlike class-based

-gram models, neural language models are able to recognize

458

CHAPTER 12. APPLICATIONS

−34 −32 −30 −28 −26

−14

−13

−12

−11

−10

−9

−8

−7

−6

Canada

Europe

Ontario

North

English

Canadian

Union

African

Africa

British

France

Russian

China

Germany

French

Assembly

Japan

Iraq

South

European

35.0 35.5 36.0 36.5 37.0 37.5 38.0

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

Figure 12.3: Two-dimensional visualizations of word embeddings obtained from a neural

machine translation model (Bahdanau et al., 2015), zooming in on speciﬁc areas where

semantically related words have embedding vectors that are close to each other. Countries

appear on the left and numbers on the right. Keep in mind that these embeddings are 2-D

for the purpose of visualization. In real applications, embeddings typically have higher

dimensionality and can simultaneously capture many kinds of similarity between words.

that two words are similar without losing the ability to encode each word as distinct

from the other. Neural language models share statistical strength between one

word (and its context) and other similar words and contexts. The distributed

representation the model learns for each word enables this sharing by allowing the

model to treat words that have features in common similarly. For example, if the

word

dog

and the word

cat

map to representations that share many attributes, then

sentences that contain the word

cat

can inform the predictions that will be made

by the model for sentences that contain the word

dog

, and vice versa. Because there

are many such attributes, there are many ways in which generalization can happen,

transferring information from each training sentence to an exponentially large

number of semantically related sentences. The curse of dimensionality requires the

model to generalize to a number of sentences that is exponential in the sentence

length. The model counters this curse by relating each training sentence to an

exponential number of similar sentences.

We sometimes call these word representations

word embeddings

. In this

interpretation, we view the raw symbols as points in a space of dimension equal

to the vocabulary size. The word representations embed those points in a feature

space of lower dimension. In the original space, every word is represented by

a one-hot vector, so every pair of words is at Euclidean distance

√

from each

other. In the embedding space, words that frequently appear in similar contexts

459

CHAPTER 12. APPLICATIONS

(or any pair of words sharing some “features” learned by the model) are close to

each other. This often results in words with similar meanings being neighbors.

Figure 12.3 zooms in on speciﬁc areas of a learned word embedding space to show

how semantically similar words map to representations that are close to each other.

Neural networks in other domains also deﬁne embeddings. For example, a

hidden layer of a convolutional network provides an “image embedding.” Usually

NLP practitioners are much more interested in this idea of embeddings because

natural language does not originally lie in a real-valued vector space. The hidden

layer has provided a more qualitatively dramatic change in the way the data is

represented.

The basic idea of using distributed representations to improve models for

natural language processing is not restricted to neural networks. It may also be

used with graphical models that have distributed representations in the form of

multiple latent variables (Mnih and Hinton, 2007).

12.4.3 High-Dimensional Outputs

In many natural language applications, we often want our models to produce

words (rather than characters) as the fundamental unit of the output. For large

vocabularies, it can be very computationally expensive to represent an output

distribution over the choice of a word, because the vocabulary size is large. In many

applications,

contains hundreds of thousands of words. The naive approach to

representing such a distribution is to apply an aﬃne transformation from a hidden

representation to the output space, then apply the softmax function. Suppose

we have a vocabulary

with size

|V|

. The weight matrix describing the linear

component of this aﬃne transformation is very large, because its output dimension

|V|

. This imposes a high memory cost to represent the matrix, and a high

computational cost to multiply by it. Because the softmax is normalized across all

|V|

outputs, it is necessary to perform the full matrix multiplication at training

time as well as test time—we cannot calculate only the dot product with the weight

vector for the correct output. The high computational costs of the output layer

thus arise both at training time (to compute the likelihood and its gradient) and

at test time (to compute probabilities for all or selected words). For specialized

loss functions, the gradient can be computed eﬃciently (Vincent et al., 2015), but

the standard cross-entropy loss applied to a traditional softmax output layer poses

many diﬃculties.

Suppose that

is the top hidden layer used to predict the output probabilities

. If we parametrize the transformation from

with learned weights

460

CHAPTER 12. APPLICATIONS

and learned biases

, then the aﬃne-softmax output layer performs the following

computations:

= b



∀i ∈ {1, . . . , |V|}, (12.8)

ˆy



|V|



. (12.9)

contains

elements, then the above operation is

(

|V|n

). With

in the

thousands and

|V|

in the hundreds of thousands, this operation dominates the

computation of most neural language models.

12.4.3.1 Use of a Short List

The ﬁrst neural language models (Bengio et al., 2001, 2003) dealt with the high cost

of using a softmax over a large number of output words by limiting the vocabulary

size to 10,000 or 20,000 words. Schwenk and Gauvain (2002) and Schwenk (2007)

built upon this approach by splitting the vocabulary

into a

shortlist L

of most

frequent words (handled by the neural net) and a tail

V\L

of more rare words

(handled by an

-gram model). To be able to combine the two predictions, the

neural net also has to predict the probability that a word appearing after context

belongs to the tail list. This may be achieved by adding an extra sigmoid output

unit to provide an estimate of

(

i ∈ T | C

). The extra output can then be used to

achieve an estimate of the probability distribution over all words in V as follows:

P (y = i | C) =1

i∈L

P (y = i | C, i ∈ L)(1 − P (i ∈ T | C))

+ 1

i∈T

P (y = i | C, i ∈ T)P (i ∈ T | C), (12.10)

where

(

i | C, i ∈ L

) is provided by the neural language model and

(

i |

C, i ∈ T

) is provided by the

-gram model. With slight modiﬁcation, this approach

can also work using an extra output value in the neural language model’s softmax

layer, rather than a separate sigmoid unit.

An obvious disadvantage of the short list approach is that the potential gener-

alization advantage of the neural language models is limited to the most frequent

words, where, arguably, it is the least useful. This disadvantage has stimulated

the exploration of alternative methods to deal with high-dimensional outputs,

described below.

461

CHAPTER 12. APPLICATIONS

12.4.3.2 Hierarchical Softmax

A classical approach (Goodman, 2001) to reducing the computational burden

of high-dimensional output layers over large vocabulary sets

is to decompose

probabilities hierarchically. Instead of necessitating a number of computations

proportional to

|V|

(and also proportional to the number of hidden units,

the

|V|

factor can be reduced to as low as

log |V|

. Bengio (2002) and Morin and

Bengio (2005) introduced this factorized approach to the context of neural language

models.

One can think of this hierarchy as building categories of words, then categories

of categories of words, then categories of categories of categories of words, and so

on. These nested categories form a tree, with words at the leaves. In a balanced

tree, the tree has depth

(

log |V|

). The probability of choosing a word is given

by the product of the probabilities of choosing the branch leading to that word

at every node on a path from the root of the tree to the leaf containing the word.

Figure 12.4 illustrates a simple example. Mnih and Hinton (2009) also describe

how to use multiple paths to identify a single word in order to better model words

that have multiple meanings. Computing the probability of a word then involves

summation over all the paths that lead to that word.

To predict the conditional probabilities required at each node of the tree, we

typically use a logistic regression model at each node of the tree, and provide the

same context

as input to all these models. Because the correct output is encoded

in the training set, we can use supervised learning to train the logistic regression

models. This is typically done using a standard cross-entropy loss, corresponding

to maximizing the log-likelihood of the correct sequence of decisions.

Because the output log-likelihood can be computed eﬃciently (as low as

log |V|

rather than

|V|

), its gradients may also be computed eﬃciently. This includes not

only the gradient with respect to the output parameters but also the gradients

with respect to the hidden layer activations.

It is possible but usually not practical to optimize the tree structure to minimize

the expected number of computations. Tools from information theory specify how

to choose the optimal binary code given the relative frequencies of the words. To

do so, we could structure the tree so that the number of bits associated with a

word is approximately equal to the logarithm of the frequency of that word. In

practice, however, the computational savings are typically not worth the eﬀort

because the computation of the output probabilities is only one part of the total

computation in the neural language model. For example, suppose there are

fully

connected hidden layers of width

. Let

be the weighted average of the number

462

CHAPTER 12. APPLICATIONS

(1)(0)

(0,0,0) (0,0,1) (0,1,0) (0,1,1) (1,0,0) (1,0,1) (1,1,0) (1,1,1)

(1,1)(1,0)(0,1)(0,0)

Figure 12.4: Illustration of a simple hierarchy of word categories, with 8 words

. . . , w

organized into a three-level hierarchy. The leaves of the tree represent actual

speciﬁc words. Internal nodes represent groups of words. Any node can be indexed by

the sequence of binary decisions (0 = left, 1 = right) to reach the node from the root.

Super-class (0) contains the classes (0

0) and (0

1), which respectively contain the sets

of words

, w

}

and

, w

}

, and similarly super-class (1) contains the classes (1

and (1

1), which respectively contain the words (

, w

) and (

, w

). If the tree is

suﬃciently balanced, the maximum depth (number of binary decisions) is on the order

of the logarithm of the number of words

|V|

: the choice of one out of

|V|

words can be

obtained by doing O(log |V|) operations (one for each of the nodes on the path from the

root). In this example, computing the probability of a word

can be done by multiplying

three probabilities, associated with the binary decisions to move left or right at each node

on the path from the root to a node

. Let

(

) be the

-th binary decision when traversing

the tree toward the value

. The probability of sampling an output

decomposes into a

product of conditional probabilities, using the chain rule for conditional probabilities, with

each node indexed by the preﬁx of these bits. For example, node (1

0) corresponds to the

preﬁx (b

) = 1, b

) = 0), and the probability of w

can be decomposed as follows:

P (y = w

) = P (b

= 1, b

= 0, b

= 0) (12.11)

= P (b

= 1)P (b

= 0 | b

= 1)P (b

= 0 | b

= 1, b

= 0). (12.12)

463

CHAPTER 12. APPLICATIONS

of bits required to identify a word, with the weighting given by the frequency

of these words. In this example, the number of operations needed to compute

the hidden activations grows as

(

), while the output computations grow as

(

). As long as

≤ ln

, we can reduce computation more by shrinking

than by shrinking

. Indeed,

is often small. Because the size of the vocabulary

rarely exceeds a million words and

log

(10

)

≈

20, it is possible to reduce

about 20, but

is often much larger, around 10

or more. Rather than carefully

optimizing a tree with a branching factor of 2, one can instead deﬁne a tree with

depth two and a branching factor of



|V|

. Such a tree corresponds to simply

deﬁning a set of mutually exclusive word classes. The simple approach based on a

tree of depth two captures most of the computational beneﬁt of the hierarchical

strategy.

One question that remains somewhat open is how to best deﬁne these word

classes, or how to deﬁne the word hierarchy in general. Early work used existing

hierarchies (Morin and Bengio, 2005), but the hierarchy can also be learned, ideally

jointly with the neural language model. Learning the hierarchy is diﬃcult. An exact

optimization of the log-likelihood appears intractable because the choice of a word

hierarchy is a discrete one, not amenable to gradient-based optimization. However,

one could use discrete optimization to approximately optimize the partition of

words into word classes.

An important advantage of the hierarchical softmax is that it brings computa-

tional beneﬁts both at training time and at test time, if at test time we want to

compute the probability of speciﬁc words.

Of course, computing the probability of all

|V|

words will remain expensive

even with the hierarchical softmax. Another important operation is selecting the

most likely word in a given context. Unfortunately the tree structure does not

provide an eﬃcient and exact solution to this problem.

A disadvantage is that in practice the hierarchical softmax tends to give worse

test results than sampling-based methods, which we describe next. This may be

due to a poor choice of word classes.

12.4.3.3 Importance Sampling

One way to speed up the training of neural language models is to avoid explicitly

computing the contribution of the gradient from all the words that do not appear

in the next position. Every incorrect word should have low probability under the

model. It can be computationally costly to enumerate all these words. Instead, it

is possible to sample only a subset of the words. Using the notation introduced in

464

CHAPTER 12. APPLICATIONS

equation 12.8, the gradient can be written as follows:

∂ log P (y | C)

∂θ

∂ log softmax

(a)

∂θ

(12.13)

∂

∂θ

log



(12.14)

∂

∂θ

− log



) (12.15)

∂a

∂θ

−



P (y = i | C)

∂a

∂θ

, (12.16)

where

is the vector of presoftmax activations (or scores), with one element

per word. The ﬁrst term is the

positive phase

term, pushing

up, while the

second term is the

negative phase

term, pushing

down for all

, with weight

(

i | C

). Since the negative phase term is an expectation, we can estimate it with

a Monte Carlo sample. However, that would require sampling from the model itself.

Sampling from the model requires computing

(

i | C

) for all

in the vocabulary,

which is precisely what we are trying to avoid.

Instead of sampling from the model, we can sample from another distribution,

called the proposal distribution (denoted

), and use appropriate weights to

correct for the bias introduced by sampling from the wrong distribution (Bengio

and Sénécal, 2003; Bengio and Sénécal, 2008). This is an application of a more

general technique called

importance sampling

, which we describe in more detail

in section 17.2. Unfortunately, even exact importance sampling is not eﬃcient

because it requires computing weights

, where

(

i | C

), which can

only be computed if all the scores

are computed. The solution adopted for

this application is called

biased importance sampling

, where the importance

weights are normalized to sum to 1. When negative word

is sampled, the

associated gradient is weighted by



j=1

. (12.17)

These weights are used to give the appropriate importance to the

negative

samples from

used to form the estimated negative phase contribution to the

gradient:

|V|



i=1

P (i | C)

∂a

∂θ

≈



i=1

∂a

∂θ

. (12.18)

465

CHAPTER 12. APPLICATIONS

A unigram or a bigram distribution works well as the proposal distribution

. It is

easy to estimate the parameters of such a distribution from data. After estimating

the parameters, it is also possible to sample from such a distribution very eﬃciently.

Importance sampling is not only useful for speeding up models with large

softmax outputs. More generally, it is useful for accelerating training with large

sparse output layers, where the output is a sparse vector rather than a 1-of-

choice. An example is a

bag of words

. A bag of words is a sparse vector

where

indicates the presence or absence of word

from the vocabulary in the

document. Alternately,

can indicate the number of times that word

appears.

Machine learning models that emit such sparse vectors can be expensive to train

for a variety of reasons. Early in learning, the model may not actually choose to

make the output truly sparse. Moreover, the loss function we use for training might

most naturally be described in terms of comparing every element of the output to

every element of the target. This means that it is not always clear that there is a

computational beneﬁt to using sparse outputs, because the model may choose to

make the majority of the output nonzero, and all these non-zero values need to be

compared to the corresponding training target, even if the training target is zero.

Dauphin et al. (2011) demonstrated that such models can be accelerated using

importance sampling. The eﬃcient algorithm minimizes the loss reconstruction for

the “positive words” (those that are nonzero in the target) and an equal number of

“negative words.” The negative words are chosen randomly, using a heuristic to

sample words that are more likely to be mistaken. The bias introduced by this

heuristic oversampling can then be corrected using importance weights.

In all these cases, the computational complexity of gradient estimation for

the output layer is reduced to be proportional to the number of negative samples

rather than proportional to the size of the output vector.

12.4.3.4 Noise-Contrastive Estimation and Ranking Loss

Other approaches based on sampling have been proposed to reduce the computa-

tional cost of training neural language models with large vocabularies. An early

example is the ranking loss proposed by Collobert and Weston (2008a), which

views the output of the neural language model for each word as a score and tries to

make the score of the correct word

be ranked high in comparison to the other

scores a

. The ranking loss proposed then is

L =



max(0, 1 − a

+ a

). (12.19)

466

CHAPTER 12. APPLICATIONS

The gradient is zero for the

-th term if the score of the observed word,

, is

greater than the score of the negative word

by a margin of 1. One issue with

this criterion is that it does not provide estimated conditional probabilities, which

are useful in some applications, including speech recognition and text generation

(including conditional text generation tasks such as translation).

A more recently used training objective for neural language models is noise-

contrastive estimation, which is introduced in section 18.6. This approach has

been successfully applied to neural language models (Mnih and Teh, 2012; Mnih

and Kavukcuoglu, 2013).

12.4.4 Combining Neural Language Models with n-grams

A major advantage of

-gram models over neural networks is that

-gram models

achieve high model capacity (by storing the frequencies of very many tuples),

while requiring very little computation to process an example (by looking up

only a few tuples that match the current context). If we use hash tables or trees

to access the counts, the computation used for

-grams is almost independent

of capacity. In comparison, doubling a neural network’s number of parameters

typically also roughly doubles its computation time. Exceptions include models

that avoid using all parameters on each pass. Embedding layers index only a single

embedding in each pass, so we can increase the vocabulary size without increasing

the computation time per example. Some other models, such as tiled convolutional

networks, can add parameters while reducing the degree of parameter sharing to

maintain the same amount of computation. Typical neural network layers based

on matrix multiplication, however, use an amount of computation proportional to

the number of parameters.

One easy way to add capacity is thus to combine both approaches in an ensemble

consisting of a neural language model and an

-gram language model (Bengio

et al., 2001, 2003). As with any ensemble, this technique can reduce test error if

the ensemble members make independent mistakes. The ﬁeld of ensemble learning

provides many ways of combining the ensemble members’ predictions, including

uniform weighting and weights chosen on a validation set. Mikolov et al. (2011a)

extended the ensemble to include not just two models but a large array of models.

It is also possible to pair a neural network with a maximum entropy model and

train both jointly (Mikolov et al., 2011b). This approach can be viewed as training

a neural network with an extra set of inputs that are connected directly to the

output and not connected to any other part of the model. The extra inputs are

indicators for the presence of particular

-grams in the input context, so these

variables are very high dimensional and very sparse. The increase in model capacity

467

CHAPTER 12. APPLICATIONS

is huge—the new portion of the architecture contains up to

|sV |

parameters—but

the amount of added computation needed to process an input is minimal because

the extra inputs are very sparse.

12.4.5 Neural Machine Translation

Machine translation is the task of reading a sentence in one natural language and

emitting a sentence with the equivalent meaning in another language. Machine

translation systems often involve many components. At a high level, there is

often one component that proposes many candidate translations. Many of these

translations will not be grammatical due to diﬀerences between the languages. For

example, many languages put adjectives after nouns, so when translated to English

directly they yield phrases such as “apple red.” The proposal mechanism suggests

many variants of the suggested translation, ideally including “red apple.” A second

component of the translation system, a language model, evaluates the proposed

translations and can score “red apple” as better than “apple red.”

The earliest exploration of neural networks for machine translation already

incorporated the idea of encoder and decoder (Allen 1987; Chrisman 1991; Forcada

and Ñeco 1997), while the ﬁrst large-scale competitive use of neural networks in

translation was to upgrade the language model of a translation system by using a

neural language model (Schwenk et al., 2006; Schwenk, 2010). Previously, most

machine translation systems had used an

-gram model for this component. The

-gram-based models used for machine translation include not just traditional

back-oﬀ

-gram models (Jelinek and Mercer, 1980; Katz, 1987; Chen and Goodman,

1999) but also

maximum entropy language models

(Berger et al., 1996), in

which an aﬃne-softmax layer predicts the next word given the presence of frequent

n-grams in the context.

Traditional language models simply report the probability of a natural language

sentence. Because machine translation involves producing an output sentence given

an input sentence, it makes sense to extend the natural language model to be

conditional. As described in section 6.2.1.1, it is straightforward to extend a model

that deﬁnes a marginal distribution over some variable to deﬁne a conditional

distribution over that variable given a context

, where

might be a single variable

or a list of variables. Devlin et al. (2014) beat the state-of-the-art in some statistical

machine translation benchmarks by using an MLP to score a phrase

, t

, . . . , t

in the target language given a phrase

, s

, . . . , s

in the source language. The

MLP estimates

(

, t

, . . . , t

| s

, s

, . . . , s

). The estimate formed by this MLP

replaces the estimate provided by conditional n-gram models.

468

CHAPTER 12. APPLICATIONS

Decoder

Output object (English

sentence)

Intermediate, semantic representation

Source object (French sentence or image)

Encoder

Figure 12.5: The encoder-decoder architecture to map back and forth between a surface

representation (such as a sequence of words or an image) and a semantic representation.

By using the output of an encoder of data from one modality (such as the encoder mapping

from French sentences to hidden representations capturing the meaning of sentences) as

the input to a decoder for another modality (such as the decoder mapping from hidden

representations capturing the meaning of sentences to English), we can train systems that

translate from one modality to another. This idea has been applied successfully not just

to machine translation but also to caption generation from images.

A drawback of the MLP-based approach is that it requires the sequences to be

preprocessed to be of ﬁxed length. To make the translation more ﬂexible, we would

like to use a model that can accommodate variable length inputs and variable

length outputs. An RNN provides this ability. Section 10.2.4 describes several ways

of constructing an RNN that represents a conditional distribution over a sequence

given some input, and section 10.4 describes how to accomplish this conditioning

when the input is a sequence. In all cases, one model ﬁrst reads the input sequence

and emits a data structure that summarizes the input sequence. We call this

summary the “context”

. The context

may be a list of vectors, or it may be a

vector or tensor. The model that reads the input to produce

may be an RNN

(Cho et al., 2014a; Sutskever et al., 2014; Jean et al., 2014) or a convolutional

network (Kalchbrenner and Blunsom, 2013). A second model, usually an RNN,

then reads the context

and generates a sentence in the target language. This

general idea of an encoder-decoder framework for machine translation is illustrated

in ﬁgure 12.5.

To generate an entire sentence conditioned on the source sentence, the model

must have a way to represent the entire source sentence. Earlier models were able

to represent only individual words or phrases. From a representation learning point

469

CHAPTER 12. APPLICATIONS

of view, it can be useful to learn a representation in which sentences that have the

same meaning have similar representations regardless of whether they were written

in the source language or in the target language. This strategy was explored ﬁrst

using a combination of convolutions and RNNs (Kalchbrenner and Blunsom, 2013).

Later work introduced the use of an RNN for scoring proposed translations (Cho

et al., 2014a) and for generating translated sentences (Sutskever et al., 2014). Jean

et al. (2014) scaled these models to larger vocabularies.

12.4.5.1 Using an Attention Mechanism and Aligning Pieces of Data

Using a ﬁxed-size representation to capture all the semantic details of a very

long sentence of, say, 60 words is very diﬃcult. It can be achieved by training a

suﬃciently large RNN well enough and for long enough, as demonstrated by Cho

et al. (2014a) and Sutskever et al. (2014). A more eﬃcient approach, however, is

to read the whole sentence or paragraph (to get the context and the gist of what

is being expressed), then produce the translated words one at a time, each time

focusing on a diﬀerent part of the input sentence to gather the semantic details

required to produce the next output word. That is exactly the idea that Bahdanau

et al. (2015) ﬁrst introduced. The attention mechanism used to focus on speciﬁc

parts of the input sequence at each time step is illustrated in ﬁgure 12.6.

We can think of an attention-based system as having three components:

A process that reads raw data (such as source words in a source sentence)

and converts them into distributed representations, with one feature vector

associated with each word position.

A list of feature vectors storing the output of the reader. This can be

understood as a memory containing a sequence of facts, which can be retrieved

later, not necessarily in the same order, without having to visit all of them.

A process that exploits the content of the memory to sequentially perform

a task, at each time step having the ability put attention on the content of

one memory element (or a few, with a diﬀerent weight).

The third component generates the translated sentence.

When words in a sentence written in one language are aligned with correspond-

ing words in a translated sentence in another language, it becomes possible to relate

the corresponding word embeddings. Earlier work showed that one could learn a

kind of translation matrix relating the word embeddings in one language with the

word embeddings in another (Kočiský et al., 2014), yielding lower alignment error

470

CHAPTER 12. APPLICATIONS

(t−1)

(t)

(t+1)

(t−1)

(t)

(t+1)

××

Figure 12.6: A modern attention mechanism, as introduced by Bahdanau et al. (2015), is

essentially a weighted average. A context vector

is formed by taking a weighted average

of feature vectors

(t)

with weights

(t)

. In some applications, the feature vectors

are

hidden units of a neural network, but they may also be raw input to the model. The

weights

(t)

are produced by the model itself. They are usually values in the interval

1] and are intended to concentrate around just one

(t)

so that the weighted average

approximates reading that one speciﬁc time step precisely. The weights

(t)

are usually

produced by applying a softmax function to relevance scores emitted by another portion

of the model. The attention mechanism is more expensive computationally than directly

indexing the desired

(t)

, but direct indexing cannot be trained with gradient descent. The

attention mechanism based on weighted averages is a smooth, diﬀerentiable approximation

that can be trained with existing optimization algorithms.

471

CHAPTER 12. APPLICATIONS

rates than traditional approaches based on the frequency counts in the phrase table.

There is even earlier work on learning cross-lingual word vectors (Klementiev et al.,

2012). Many extensions to this approach are possible. For example, more eﬃcient

cross-lingual alignment (Gouws et al., 2014) allows training on larger datasets.

12.4.6 Historical Perspective

The idea of distributed representations for symbols was introduced by Rumelhart

et al. (1986a) in one of the ﬁrst explorations of back-propagation, with symbols

corresponding to the identity of family members, and the neural network capturing

the relationships between family members, with training examples forming triplets

such as (Colin, Mother, Victoria). The ﬁrst layer of the neural network learned a

representation of each family member. For example, the features for Colin might

represent which family tree Colin was in, what branch of that tree he was in,

what generation he was from, and so on. One can think of the neural network as

computing learned rules relating these attributes together to obtain the desired

predictions. The model can then make predictions such as inferring who is the

mother of Colin.

The idea of forming an embedding for a symbol was extended to the idea of an

embedding for a word by Deerwester et al. (1990). These embeddings were learned

using the SVD. Later, embeddings would be learned by neural networks.

The history of natural language processing is marked by transitions in the

popularity of diﬀerent ways of representing the input to the model. Following

this early work on symbols and words, some of the earliest applications of neural

networks to NLP (Miikkulainen and Dyer, 1991; Schmidhuber and Heil, 1996)

represented the input as a sequence of characters.

Bengio et al. (2001) returned the focus to modeling words and introduced

neural language models, which produce interpretable word embeddings. These

neural models have scaled up from deﬁning representations of a small set of symbols

in the 1980s to millions of words (including proper nouns and misspellings) in

modern applications. This computational scaling eﬀort led to the invention of the

techniques described in section 12.4.3.

Initially, the use of words as the fundamental units of language models yielded

improved language modeling performance (Bengio et al., 2001). To this day,

new techniques continually push both character-based models (Sutskever et al.,

2011) and word-based models forward, with recent work (Gillick et al., 2015) even

modeling individual bytes of Unicode characters.

The ideas behind neural language models have been extended into several

472

CHAPTER 12. APPLICATIONS

natural language processing applications, such as parsing (Henderson, 2003, 2004;

Collobert, 2011), part-of-speech tagging, semantic role labeling, chunking, and

so on, sometimes using a single multitask learning architecture (Collobert and

Weston, 2008a; Collobert et al., 2011a) in which the word embeddings are shared

across tasks.

Two-dimensional visualizations of embeddings became a popular tool for an-

alyzing language models following the development of the t-SNE dimensionality

reduction algorithm (van der Maaten and Hinton, 2008) and its high-proﬁle appli-

cation to visualization word embeddings by Joseph Turian in 2009.

12.5 Other Applications

In this section we cover a few other types of applications of deep learning that

are diﬀerent from the standard object recognition, speech recognition, and natural

language processing tasks discussed above. Part III of this book will expand that

scope even further to tasks that remain primarily research areas.

12.5.1 Recommender Systems

One of the major families of applications of machine learning in the information

technology sector is the ability to make recommendations of items to potential

users or customers. Two major types of applications can be distinguished: online

advertising and item recommendations (often these recommendations are still for

the purpose of selling a product). Both rely on predicting the association between

a user and an item, either to predict the probability of some action (the user

buying the product, or some proxy for this action) or the expected gain (which

may depend on the value of the product) if an ad is shown or a recommendation

is made regarding that product to that user. The internet is currently ﬁnanced

in great part by various forms of online advertising. Major parts of the economy

rely on online shopping. Companies including Amazon and eBay use machine

learning, including deep learning, for their product recommendations. Sometimes,

the items are not products that are actually for sale. Examples include selecting

posts to display on social network news feeds, recommending movies to watch,

recommending jokes, recommending advice from experts, matching players for

video games, or matching people in dating services.

Often, this association problem is handled like a supervised learning problem:

given some information about the item and about the user, predict the proxy of

interest (user clicks on ad, user enters a rating, user clicks on a “like” button, user

473

CHAPTER 12. APPLICATIONS

buys product, user spends some amount of money on the product, user spends

time visiting a page for the product, and so forth). This often ends up being

either a regression problem (predicting some conditional expected value) or a

probabilistic classiﬁcation problem (predicting the conditional probability of some

discrete event).

The early work on recommender systems relied on minimal information as

inputs for these predictions: the user ID and the item ID. In this context, the

only way to generalize is to rely on the similarity between the patterns of values of

the target variable for diﬀerent users or for diﬀerent items. Suppose that user 1

and user 2 both like items A, B and C. From this, we may infer that user 1 and

user 2 have similar tastes. If user 1 likes item D, then this should be a strong

cue that user 2 will also like D. Algorithms based on this principle come under

the name of

collaborative ﬁltering

. Both nonparametric approaches (such as

nearest neighbor methods based on the estimated similarity between patterns of

preferences) and parametric methods are possible. Parametric methods often rely

on learning a distributed representation (also called an embedding) for each user

and for each item. Bilinear prediction of the target variable (such as a rating) is a

simple parametric method that is highly successful and often found as a component

of state-of-the-art systems. The prediction is obtained by the dot product between

the user embedding and the item embedding (possibly corrected by constants that

depend only on either the user ID or the item ID). Let

be the matrix containing

our predictions,

a matrix with user embeddings in its rows, and

a matrix with

item embeddings in its columns. Let

and

be vectors that contain respectively

a kind of bias for each user (representing how grumpy or positive that user is

in general) and for each item (representing its general popularity). The bilinear

prediction is thus obtained as follows:

u,i

= b

+ c



u,j

j,i

. (12.20)

Typically one wants to minimize the squared error between predicted ratings

u,i

and actual ratings

u,i

. User embeddings and item embeddings can then be

conveniently visualized when they are ﬁrst reduced to a low dimension (two or

three), or they can be used to compare users or items against each other, just

like word embeddings. One way to obtain these embeddings is by performing a

singular value decomposition of the matrix

of actual targets (such as ratings).

This corresponds to factorizing

UDV



(or a normalized variant) into the

product of two factors, the lower rank matrices

and



. One

problem with the SVD is that it treats the missing entries in an arbitrary way,

as if they corresponded to a target value of 0. Instead we would like to avoid

474

CHAPTER 12. APPLICATIONS

paying any cost for the predictions made on missing entries. Fortunately, the sum

of squared errors on the observed ratings can also be easily minimized by gradient-

based optimization. The SVD and the bilinear prediction of equation 12.20 both

performed very well in the competition for the Netﬂix prize (Bennett and Lanning,

2007), aiming at predicting ratings for ﬁlms, based only on previous ratings by

a large set of anonymous users. Many machine learning experts participated in

this competition, which took place between 2006 and 2009. It raised the level of

research in recommender systems using advanced machine learning and yielded

improvements in recommender systems. Even though the simple bilinear prediction,

or SVD, did not win by itself, it was a component of the ensemble models presented

by most of the competitors, including the winners (Töscher et al., 2009; Koren,

2009).

Beyond these bilinear models with distributed representations, one of the ﬁrst

uses of neural networks for collaborative ﬁltering is based on the RBM undirected

probabilistic model (Salakhutdinov et al., 2007). RBMs were an important element

of the ensemble of methods that won the Netﬂix competition (Töscher et al., 2009;

Koren, 2009). More advanced variants on the idea of factorizing the ratings matrix

have also been explored in the neural networks community (Salakhutdinov and

Mnih, 2008).

Collaborative ﬁltering systems have a basic limitation, however: when a new

item or a new user is introduced, its lack of rating history means that there is no

way to evaluate its similarity with other items or users, or the degree of association

between, say, that new user and existing items. This is called the problem of cold-

start recommendations. A general way of solving the cold-start recommendation

problem is to introduce extra information about the individual users and items. For

example, this extra information could be user proﬁle information or features of each

item. Systems that use such information are called

content-based recommender

systems

. The mapping from a rich set of user features or item features to an

embedding can be learned through a deep learning architecture (Huang et al., 2013;

Elkahky et al., 2015).

Specialized deep learning architectures, such as convolutional networks, have

also been applied to learn to extract features from rich content, such as from

musical audio tracks for music recommendation (van den Oörd et al., 2013). In

that work, the convolutional net takes acoustic features as input and computes an

embedding for the associated song. The dot product between this song embedding

and the embedding for a user is then used to predict whether a user will listen to

the song.

475

CHAPTER 12. APPLICATIONS

12.5.1.1 Exploration versus Exploitation

When making recommendations to users, an issue arises that goes beyond ordinary

supervised learning and into the realm of reinforcement learning. Many recom-

mendation problems are most accurately described theoretically as

contextual

bandits

(Langford and Zhang, 2008; Lu et al., 2010). The issue is that when we

use the recommendation system to collect data, we get a biased and incomplete

view of the preferences of users: we see the responses of users only to the items

recommended to them and not to the other items. In addition, in some cases we

may not get any information on users for whom no recommendation has been

made (for example, with ad auctions, it may be that the price proposed for an ad

was below a minimum price threshold, or does not win the auction, so the ad is

not shown at all). More importantly, we get no information about what outcome

would have resulted from recommending any of the other items. This would be like

training a classiﬁer by picking one class

ˆy

for each training example

(typically the

class with the highest probability according to the model) and then only getting as

feedback whether this was the correct class or not. Clearly, each example conveys

less information than in the supervised case, where the true label

is directly

accessible, so more examples are necessary. Worse, if we are not careful, we could

end up with a system that continues picking the wrong decisions even as more

and more data is collected, because the correct decision initially had a very low

probability: until the learner picks that correct decision, it does not learn about

the correct decision. This is similar to the situation in reinforcement learning

where only the reward for the selected action is observed. In general, reinforcement

learning can involve a sequence of many actions and many rewards. The bandits

scenario is a special case of reinforcement learning, in which the learner takes only

a single action and receives a single reward. The bandit problem is easier in the

sense that the learner knows which reward is associated with which action. In

the general reinforcement learning scenario, a high reward or a low reward might

have been caused by a recent action or by an action in the distant past. The term

contextual bandits

refers to the case where the action is taken in the context of

some input variable that can inform the decision. For example, we at least know

the user identity, and we want to pick an item. The mapping from context to

action is also called a

policy

. The feedback loop between the learner and the data

distribution (which now depends on the actions of the learner) is a central research

issue in the reinforcement learning and bandits literature.

Reinforcement learning requires choosing a trade-oﬀ between

exploration

and

exploitation

. Exploitation refers to taking actions that come from the current,

best version of the learned policy—actions that we know will achieve a high reward.

476

CHAPTER 12. APPLICATIONS

Exploration refers to taking actions speciﬁcally to obtain more training data. If

we know that given context

, action

gives us a reward of 1, we do not know

whether that is the best possible reward. We may want to exploit our current

policy and continue taking action

to be relatively sure of obtaining a reward of 1.

However, we may also want to explore by trying action



. We do not know what

will happen if we try action



. We hope to get a reward of 2, but we run the risk

of getting a reward of 0. Either way, we at least gain some knowledge.

Exploration can be implemented in many ways, ranging from occasionally

taking random actions intended to cover the entire space of possible actions, to

model-based approaches that compute a choice of action based on its expected

reward and the model’s amount of uncertainty about that reward.

Many factors determine the extent to which we prefer exploration or exploitation.

One of the most prominent factors is the time scale we are interested in. If the

agent has only a short amount of time to accrue reward, then we prefer more

exploitation. If the agent has a long time to accrue reward, then we begin with

more exploration so that future actions can be planned more eﬀectively with more

knowledge. As time progresses and our learned policy improves, we move toward

more exploitation.

Supervised learning has no trade-oﬀ between exploration and exploitation

because the supervision signal always speciﬁes which output is correct for each

input. There is no need to try out diﬀerent outputs to determine if one is better

than the model’s current output—we always know that the label is the best output.

Another diﬃculty arising in the context of reinforcement learning, besides the

exploration-exploitation trade-oﬀ, is the diﬃculty of evaluating and comparing

diﬀerent policies. Reinforcement learning involves interaction between the learner

and the environment. This feedback loop means that it is not straightforward to

evaluate the learner’s performance using a ﬁxed set of test set input values. The

policy itself determines which inputs will be seen. Dudik et al. (2011) present

techniques for evaluating contextual bandits.

12.5.2 Knowledge Representation, Reasoning and Question

Answering

Deep learning approaches have been very successful in language modeling, machine

translation and natural language processing because of the use of embeddings for

symbols (Rumelhart et al., 1986a) and words (Deerwester et al., 1990; Bengio et al.,

2001). These embeddings represent semantic knowledge about individual words

and concepts. A research frontier is to develop embeddings for phrases and for

477

CHAPTER 12. APPLICATIONS

relations between words and facts. Search engines already use machine learning for

this purpose, but much more remains to be done to improve these more advanced

representations.

12.5.2.1 Knowledge, Relations and Question Answering

One interesting research direction is determining how distributed representations

can be trained to capture the

relations

between two entities. These relations

allow us to formalize facts about objects and how objects interact with each other.

In mathematics, a

binary relation

is a set of ordered pairs of objects. Pairs

that are in the set are said to have the relation while those not in the set do not.

For example, we can deﬁne the relation “is less than” on the set of entities

{

}

by deﬁning the set of ordered pairs

{

}

. Once this relation is

deﬁned, we can use it like a verb. Because (1

∈ S

, we say that 1 is less than

2. Because (2

∈ S

, we cannot say that 2 is less than 1. Of course, the entities

that are related to one another need not be numbers. We could deﬁne a relation

is_a_type_of containing tuples like (dog, mammal).

In the context of AI, we think of a relation as a sentence in a syntactically

simple and highly structured language. The relation plays the role of a verb,

while two arguments to the relation play the role of its subject and object. These

sentences take the form of a triplet of tokens

(subject, verb, object) (12.21)

with values

(entity

, relation

, entity

). (12.22)

We can also deﬁne an

attribute

, a concept analogous to a relation, but taking

only one argument:

(entity

, attribute

). (12.23)

For example, we could deﬁne the

has_fur

attribute, and apply it to entities like

dog.

Many applications require representing relations and reasoning about them.

How should we best do this within the context of neural networks?

Machine learning models of course require training data. We can infer relations

between entities from training datasets consisting of unstructured natural language.

There are also structured databases that identify relations explicitly. A common

structure for these databases is the

relational database

, which stores this same

kind of information, albeit not formatted as three token sentences. When a

478

CHAPTER 12. APPLICATIONS

database is intended to convey commonsense knowledge about everyday life or

expert knowledge about an application area to an artiﬁcial intelligence system, we

call the database a

knowledge base

. Knowledge bases range from general ones

Freebase

OpenCyc

WordNet

Wikibase

and so forth, to more specialized

knowledge bases like

GeneOntology

Representations for entities and relations can

be learned by considering each triplet in a knowledge base as a training example

and maximizing a training objective that captures their joint distribution (Bordes

et al., 2013a).

In addition to training data, we also need to deﬁne a model family to train.

A common approach is to extend neural language models to model entities and

relations. Neural language models learn a vector that provides a distributed

representation of each word. They also learn about interactions between words,

such as which word is likely to come after a sequence of words, by learning functions

of these vectors. We can extend this approach to entities and relations by learning

an embedding vector for each relation. In fact, the parallel between modeling

language and modeling knowledge encoded as relations is so close that researchers

have trained representations of such entities by using both knowledge bases and

natural language sentences (Bordes et al., 2011, 2012; Wang et al., 2014a), or by

combining data from multiple relational databases (Bordes et al., 2013b). Many

possibilities exist for the particular parametrization associated with such a model.

Early work on learning about relations between entities (Paccanaro and Hinton,

2000) posited highly constrained parametric forms (“linear relational embeddings”),

often using a diﬀerent form of representation for the relation than for the entities.

For example, Paccanaro and Hinton (2000) and Bordes et al. (2011) used vectors for

entities and matrices for relations, with the idea that a relation acts like an operator

on entities. Alternatively, relations can be considered as any other entity (Bordes

et al., 2012), allowing us to make statements about relations, but more ﬂexibility is

put in the machinery that combines them in order to model their joint distribution.

A practical short-term application of such models is

link prediction

: predict-

ing missing arcs in the knowledge graph. This is a form of generalization to new

facts based on old facts. Most of the knowledge bases that currently exist have

been constructed through manual labor, which tends to leave many and probably

the majority of true relations absent from the knowledge base. See Wang et al.

(2014b), Lin et al. (2015), and Garcia-Duran et al. (2015) for examples of such an

application.

Respectively available from these web sites:

freebase.com

cyc.com/opencyc

wordnet.

princeton.edu, wikiba.se

geneontology.org

479

CHAPTER 12. APPLICATIONS

Evaluating the performance of a model on a link prediction task is diﬃcult

because we have only a dataset of positive examples (facts that are known to

be true). If the model proposes a fact that is not in the dataset, we are unsure

whether the model has made a mistake or discovered a new, previously unknown

fact. The metrics are thus somewhat imprecise and are based on testing how the

model ranks a held-out set of known true positive facts compared to other facts

that are less likely to be true. A common way to construct interesting examples

that are probably negative (facts that are probably false) is to begin with a true

fact and create corrupted versions of that fact, for example, by replacing one entity

in the relation with a diﬀerent entity selected at random. The popular precision at

10 percent metric counts how many times the model ranks a “correct” fact among

the top 10 percent of all corrupted versions of that fact.

Another application of knowledge bases and distributed representations for

them is

word-sense disambiguation

(Navigli and Velardi, 2005; Bordes et al.,

2012), which is the task of deciding which sense of a word is the appropriate one

in some context.

Eventually, knowledge of relations combined with a reasoning process and an

understanding of natural language could allow us to build a general question-

answering system. A general question-answering system must be able to process

input information and remember important facts, organized in a way that enables

it to retrieve and reason about them later. This remains a diﬃcult open problem

that can only be solved in restricted “toy” environments. Currently, the best

approach to remembering and retrieving speciﬁc declarative facts is to use an

explicit memory mechanism, as described in section 10.12. Memory networks were

ﬁrst proposed to solve a toy question-answering task (Weston et al., 2014). Kumar

et al. (2015) have proposed an extension that uses GRU recurrent nets to read

the input into the memory and to produce the answer given the contents of the

memory.

Deep learning has been applied to many other applications besides the ones

described here, and will surely be applied to even more after this writing. It would

be impossible to describe anything remotely resembling a comprehensive coverage

of such a topic. This survey provides a representative sample of what is possible

as of this writing.

This concludes part II, which has described modern practices involving deep

networks, comprising all the most successful methods. Generally speaking, these

methods involve using the gradient of a cost function to ﬁnd the parameters of a

model that approximates some desired function. With enough training data, this

approach is extremely powerful. We now turn to part III, in which we step into the

480

CHAPTER 12. APPLICATIONS

territory of research—methods that are designed to work with less training data

or to perform a greater variety of tasks, where the challenges are more diﬃcult

and not as close to being solved as the situations we have described so far.

481