Chapter 7

Regularization for Deep Learning

A central problem in machine learning is how to make an algorithm that will

perform well not just on the training data, but also on new inputs. Many strategies

used in machine learning are explicitly designed to reduce the test error, possibly

at the expense of increased training error. These strategies are known collectively

as regularization. A great many forms of regularization are available to the deep

learning practitioner. In fact, developing more eﬀective regularization strategies

has been one of the major research eﬀorts in the ﬁeld.

Chapter 5 introduced the basic concepts of generalization, underﬁtting, overﬁt-

ting, bias, variance and regularization. If you are not already familiar with these

notions, please refer to that chapter before continuing with this one.

In this chapter, we describe regularization in more detail, focusing on regular-

ization strategies for deep models or models that may be used as building blocks

to form deep models.

Some sections of this chapter deal with standard concepts in machine learning.

If you are already familiar with these concepts, feel free to skip the relevant

sections. However, most of this chapter is concerned with the extension of these

basic concepts to the particular case of neural networks.

In section 5.2.2, we deﬁned regularization as “any modiﬁcation we make to a

learning algorithm that is intended to reduce its generalization error but not its

training error.” There are many regularization strategies. Some put extra con-

straints on a machine learning model, such as adding restrictions on the

parameter

values. Some add extra terms in the objective function that can be thought of as

corresponding to a soft constraint on the parameter values. If chosen carefully,

these extra constraints and penalties can lead to improved performance on the

224

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

test set. Sometimes these constraints and penalties are designed to encode speciﬁc

kinds of prior knowledge. Other times, these constraints and penalties are designed

to express a generic preference for a simpler model class in order to promote

generalization. Sometimes penalties and constraints are necessary to make an

underdetermined problem determined. Other forms of regularization, known as

ensemble methods, combine multiple hypotheses that explain the training data.

In the context of deep learning, most regularization strategies are based on

regularizing estimators. Regularization of an estimator works by trading increased

bias for reduced variance. An eﬀective regularizer is one that makes a proﬁtable

trade, reducing variance signiﬁcantly while not overly increasing the bias. When we

discussed generalization and overﬁtting in chapter 5, we focused on three situations,

where the model family being trained either (1) excluded the true data-generating

process—corresponding to underﬁtting and inducing bias, or (2) matched the true

data-generating process, or (3) included the generating process but also many

other possible generating processes—the overﬁtting regime where variance rather

than bias dominates the estimation error. The goal of regularization is to take a

model from the third regime into the second regime.

In practice, an overly complex model family does not necessarily include the

target function or the true data-generating process, or even a close approximation

of either. We almost never have access to the true data-generating process so

we can never know for sure if the model family being estimated includes the

generating process or not. Most applications of deep learning algorithms, however,

are to domains where the true data-generating process is almost certainly outside

the model family. Deep learning algorithms are typically applied to extremely

complicated domains such as images, audio sequences and text, for which the true

generation process essentially involves simulating the entire universe. To some

extent, we are always trying to ﬁt a square peg (the data-generating process) into

a round hole (our model family).

What this means is that controlling the complexity of the model is not a

simple matter of ﬁnding the model of the right size, with the right number of

parameters. Instead, we might ﬁnd—and indeed in practical deep learning scenarios,

we almost always do ﬁnd—that the best ﬁtting model (in the sense of minimizing

generalization error) is a large model that has been regularized appropriately.

We now review several strategies for how to create such a large, deep regularized

model.

225

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

7.1 Parameter Norm Penalties

Regularization has been used for decades prior to the advent of deep learning. Linear

models such as linear regression and logistic regression allow simple, straightforward,

and eﬀective regularization strategies.

Many regularization approaches are based on limiting the capacity of models,

such as neural networks, linear regression, or logistic regression, by adding a pa-

rameter norm penalty Ω(

) to the objective function

. We denote the regularized

objective function by

J(θ; X, y) = J(θ; X, y) + αΩ(θ), (7.1)

where

α ∈

, ∞

) is a hyperparameter that weights the relative contribution of the

norm penalty term, Ω, relative to the standard objective function

. Setting

to 0

results in no regularization. Larger values of

correspond to more regularization.

When our training algorithm minimizes the regularized objective function

will decrease both the original objective

on the training data and some measure

of the size of the parameters

(or some subset of the parameters). Diﬀerent

choices for the parameter norm Ω can result in diﬀerent solutions being preferred.

In this section, we discuss the eﬀects of the various norms when used as penalties

on the model parameters.

Before delving into the regularization behavior of diﬀerent norms, we note

that for neural networks, we typically choose to use a parameter norm penalty

Ω that penalizes only the weights of the aﬃne transformation at each layer and

leaves the biases unregularized. The biases typically require less data than the

weights to ﬁt accurately. Each weight speciﬁes how two variables interact. Fitting

the weight well requires observing both variables in a variety of conditions. Each

bias controls only a single variable. This means that we do not induce too much

variance by leaving the biases unregularized. Also, regularizing the bias parameters

can introduce a signiﬁcant amount of underﬁtting. We therefore use the vector

to indicate all the weights that should be aﬀected by a norm penalty, while

the vector

denotes all the parameters, including both

and the unregularized

parameters.

In the context of neural networks, it is sometimes desirable to use a separate

penalty with a diﬀerent

coeﬃcient for each layer of the network. Because it can

be expensive to search for the correct value of multiple hyperparameters, it is still

reasonable to use the same weight decay at all layers just to reduce the size of

search space.

226

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

7.1.1 L

Parameter Regularization

We have already seen, in section 5.2.2, one of the simplest and most common kinds

of parameter norm penalty: the

parameter norm penalty commonly known as

weight decay

. This regularization strategy drives the weights closer to the origin

by adding a regularization term Ω(

) =

w

to the objective function. In other

academic communities,

regularization is also known as

ridge regression

Tikhonov regularization.

We can gain some insight into the behavior of weight decay regularization

by studying the gradient of the regularized objective function. To simplify the

presentation, we assume no bias parameter, so

is just

. Such a model has the

following total objective function:

J(w; X, y) =



w + J(w; X, y), (7.2)

with the corresponding parameter gradient

∇

J(w; X, y) = αw + ∇

J(w; X, y). (7.3)

To take a single gradient step to update the weights, we perform this update:

w ← w − (αw + ∇

J(w; X, y)) . (7.4)

Written another way, the update is

w ← (1 − α)w −∇

J(w; X, y). (7.5)

We can see that the addition of the weight decay term has modiﬁed the learning

rule to multiplicatively shrink the weight vector by a constant factor on each step,

just before performing the usual gradient update. This describes what happens in

a single step. But what happens over the entire course of training?

We will further simplify the analysis by making a quadratic approximation

to the objective function in the neighborhood of the value of the weights that

obtains minimal unregularized training cost,

∗

arg min

(

). If the objective

function is truly quadratic, as in the case of ﬁtting a linear regression model with

More generally, we could regularize the parameters to be near any speciﬁc point in space

and, surprisingly, still get a regularization eﬀect, but better results will be obtained for a value

closer to the true one, with zero being a default value that makes sense when we do not know if

the correct value should be positive or negative. Since it is far more common to regularize the

model parameters toward zero, we will focus on this special case in our exposition.

227

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

mean squared error, then the approximation is perfect. The approximation

given by

J(θ) = J(w

∗

) +

(w −w

∗

)



H(w − w

∗

), (7.6)

where

is the Hessian matrix of

with respect to

evaluated at

∗

. There is

no ﬁrst-order term in this quadratic approximation, because

∗

is deﬁned to be a

minimum, where the gradient vanishes. Likewise, because

∗

is the location of a

minimum of J, we can conclude that H is positive semideﬁnite.

The minimum of

J occurs where its gradient

∇

J(w) = H(w − w

∗

) (7.7)

is equal to 0.

To study the eﬀect of weight decay, we modify equation 7.7 by adding the

weight decay gradient. We can now solve for the minimum of the regularized

version of

J. We use the variable

w to represent the location of the minimum.

w + H(

w −w

∗

) = 0 (7.8)

(H + αI)

w = Hw

∗

(7.9)

w = (H + αI)

−1

∗

(7.10)

approaches 0, the regularized solution

approaches

∗

. But what

happens as

grows? Because

is real and symmetric, we can decompose it

into a diagonal matrix

and an orthonormal basis of eigenvectors,

, such that

H = QΛQ



. Applying the decomposition to equation 7.10, we obtain

w = (QΛQ



+ αI)

−1

QΛQ



∗

(7.11)



Q(Λ + αI)Q





−1

QΛQ



∗

(7.12)

= Q(Λ + αI)

−1

ΛQ



∗

. (7.13)

We see that the eﬀect of weight decay is to rescale

∗

along the axes deﬁned by

the eigenvectors of

. Speciﬁcally, the component of

∗

that is aligned with the

-th eigenvector of

is rescaled by a factor of

+α

. (You may wish to review

how this kind of scaling works, ﬁrst explained in ﬁgure 2.3).

Along the directions where the eigenvalues of

are relatively large, for example,

where

 α

, the eﬀect of regularization is relatively small. Yet components with

 α

will be shrunk to have nearly zero magnitude. This eﬀect is illustrated in

ﬁgure 7.1.

228

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

∗

Figure 7.1: An illustration of the eﬀect of

(or weight decay) regularization on the value

of the optimal

. The solid ellipses represent contours of equal value of the unregularized

objective. The dotted circles represent contours of equal value of the

regularizer. At

the point

, these competing objectives reach an equilibrium. In the ﬁrst dimension, the

eigenvalue of the Hessian of

is small. The objective function does not increase much

when moving horizontally away from

∗

. Because the objective function does not express

a strong preference along this direction, the regularizer has a strong eﬀect on this axis.

The regularizer pulls

close to zero. In the second dimension, the objective function

is very sensitive to movements away from

∗

. The corresponding eigenvalue is large,

indicating high curvature. As a result, weight decay aﬀects the position of

relatively

little.

229

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Only directions along which the parameters contribute signiﬁcantly to reducing

the objective function are preserved relatively intact. In directions that do not

contribute to reducing the objective function, a small eigenvalue of the Hessian

tells us that movement in this direction will not signiﬁcantly increase the gradient.

Components of the weight vector corresponding to such unimportant directions

are decayed away through the use of the regularization throughout training.

So far we have discussed weight decay in terms of its eﬀect on the optimization

of an abstract, general quadratic cost function. How do these eﬀects relate to

machine learning in particular? We can ﬁnd out by studying linear regression, a

model for which the true cost function is quadratic and therefore amenable to the

same kind of analysis we have used so far. Applying the analysis again, we will

be able to obtain a special case of the same results, but with the solution now

phrased in terms of the training data. For linear regression, the cost function is

the sum of squared errors:

(Xw − y)



(Xw − y). (7.14)

When we add L

regularization, the objective function changes to

(Xw − y)



(Xw − y) +

αw



w. (7.15)

This changes the normal equations for the solution from

w = (X



−1



y (7.16)

w = (X



X + αI)

−1



y. (7.17)

The matrix



in equation 7.16 is proportional to the covariance matrix



Using

regularization replaces this matrix with





X + αI



−1

in equation 7.17.

The new matrix is the same as the original one, but with the addition of

to the

diagonal. The diagonal entries of this matrix correspond to the variance of each

input feature. We can see that

regularization causes the learning algorithm

to “perceive” the input

as having higher variance, which makes it shrink the

weights on features whose covariance with the output target is low compared to

this added variance.

7.1.2 L

Regularization

While

weight decay is the most common form of weight decay, there are other

ways to penalize the size of the model parameters. Another option is to use

regularization.

230

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Formally, L

regularization on the model parameter w is deﬁned as

Ω(θ) = ||w||



|, (7.18)

that is, as the sum of absolute values of the individual parameters.

We will

now discuss the eﬀect of

regularization on the simple linear regression model,

with no bias parameter, that we studied in our analysis of

regularization. In

particular, we are interested in delineating the diﬀerences between

and

forms

of regularization. As with

weight decay,

weight decay controls the strength

of the regularization by scaling the penalty Ω using a positive hyperparameter

Thus, the regularized objective function

J(w; X, y) is given by

J(w; X, y) = α||w||

+ J(w; X, y), (7.19)

with the corresponding gradient (actually, sub gradient)

∇

J(w; X, y) = αsign(w) + ∇

J(X, y; w), (7.20)

where sign(w) is simply the sign of w applied element-wise.

By inspecting equation 7.20, we can see immediately that the eﬀect of

regularization is quite diﬀerent from that of

regularization. Speciﬁcally, we can

see that the regularization contribution to the gradient no longer scales linearly

with each

; instead it is a constant factor with a sign equal to

sign

(

). One

consequence of this form of the gradient is that we will not necessarily see clean

algebraic solutions to quadratic approximations of

(

X, y

;

) as we did for

regularization.

Our simple linear model has a quadratic cost function that we can represent

via its Taylor series. Alternately, we could imagine that this is a truncated Taylor

series approximating the cost function of a more sophisticated model. The gradient

in this setting is given by

∇

J(w) = H(w − w

∗

), (7.21)

where, again, H is the Hessian matrix of J with respect to w evaluated at w

∗

Because the

penalty does not admit clean algebraic expressions in the case

of a fully general Hessian, we will also make the further simplifying assumption

that the Hessian is diagonal,

diag

([

1,1

, . . . , H

n,n

]), where each

i,i

As with

regularization, we could regularize the parameters toward a value that is not

zero, but instead toward some parameter value

(o)

. In that case the

regularization would

introduce the term Ω(θ) = ||w − w

(o)



− w

(o)

231

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

This assumption holds if the data for the linear regression problem has been

preprocessed to remove all correlation between the input features, which may be

accomplished using PCA.

Our quadratic approximation of the

regularized objective function decom-

poses into a sum over the parameters:

J(w; X, y) = J(w

∗

; X, y) +





i,i

− w

∗

)

+ α|w



. (7.22)

The problem of minimizing this approximate cost function has an analytical solution

(for each dimension i), with the following form:

= sign(w

∗

) max



∗

| −

i,i

, 0



. (7.23)

Consider the situation where

∗

0 for all

. There are two possible outcomes:

The case where

∗

≤

i,i

. Here the optimal value of

under the regularized

objective is simply

= 0. This occurs because the contribution of

(

;

X, y

)

to the regularized objective

(

;

X, y

) is overwhelmed—in direction

—by

the L

regularization, which pushes the value of w

to zero.

The case where

∗

i,i

. In this case, the regularization does not move the

optimal value of

to zero but instead just shifts it in that direction by a

distance equal to

i,i

A similar process happens when

∗

0, but with the

penalty making

less

negative by

i,i

, or 0.

In comparison to

regularization,

regularization results in a solution that

is more

sparse

. Sparsity in this context refers to the fact that some parameters

have an optimal value of zero. The sparsity of

regularization is a qualitatively

diﬀerent behavior than arises with

regularization. Equation 7.13 gave the

solution

˜w

for

regularization. If we revisit that equation using the assumption

of a diagonal and positive deﬁnite Hessian

that we introduced for our analysis of

regularization, we ﬁnd that

˜w

i,i

+α

∗

. If

∗

was nonzero, then

˜w

remains

nonzero. This demonstrates that

regularization does not cause the parameters

to become sparse, while L

regularization may do so for large enough α.

The sparsity property induced by

regularization has been used extensively

as a

feature selection

mechanism. Feature selection simpliﬁes a machine learning

problem by choosing which subset of the available features should be used. In

232

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

particular, the well known LASSO (Tibshirani, 1995) (least absolute shrinkage

and selection operator) model integrates an

penalty with a linear model and

a least-squares cost function. The

penalty causes a subset of the weights to

become zero, suggesting that the corresponding features may safely be discarded.

In section 5.6.1, we saw that many regularization strategies can be interpreted

as MAP Bayesian inference, and that in particular,

regularization is equivalent

to MAP Bayesian inference with a Gaussian prior on the weights. For

regu-

larization, the penalty

Ω(

) =



used to regularize a cost function is

equivalent to the log-prior term that is maximized by MAP Bayesian inference

when the prior is an isotropic Laplace distribution (equation 3.26) over w ∈ R

log p(w) =



log Laplace(w

; 0,

) = −α||w||

+ n log α − n log 2. (7.24)

From the point of view of learning via maximization with respect to

, we can

ignore the log α − log 2 terms because they do not depend on w.

7.2 Norm Penalties as Constrained Optimization

Consider the cost function regularized by a parameter norm penalty:

J(θ; X, y) = J(θ; X, y) + αΩ(θ). (7.25)

Recall from section 4.4 that we can minimize a function subject to constraints

by constructing a generalized Lagrange function, consisting of the original objective

function plus a set of penalties. Each penalty is a product between a coeﬃcient,

called a Karush–Kuhn–Tucker (KKT) multiplier, and a function representing

whether the constraint is satisﬁed. If we wanted to constrain Ω(

) to be less than

some constant k, we could construct a generalized Lagrange function

L(θ, α; X, y) = J(θ; X, y) + α(Ω(θ) − k). (7.26)

The solution to the constrained problem is given by

∗

= arg min

max

α,α≥0

L(θ, α). (7.27)

As described in section 4.4, solving this problem requires modifying both

and

. Section 4.5 provides a worked example of linear regression with an

constraint. Many diﬀerent procedures are possible—some may use gradient descent,

233

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

while others may use analytical solutions for where the gradient is zero—but in all

procedures

must increase whenever Ω(

)

> k

and decrease whenever Ω(

)

< k

All positive

encourage Ω(

) to shrink. The optimal value

∗

will encourage Ω(

)

to shrink, but not so strongly to make Ω(θ) become less than k.

To gain some insight into the eﬀect of the constraint, we can ﬁx

∗

and view

the problem as just a function of θ:

∗

= arg min

L(θ, α

∗

) = arg min

J(θ; X, y) + α

∗

Ω(θ). (7.28)

This is exactly the same as the regularized training problem of minimizing

We can thus think of a parameter norm penalty as imposing a constraint on the

weights. If Ω is the

norm, then the weights are constrained to lie in an

ball. If Ω is the

norm, then the weights are constrained to lie in a region of

limited

norm. Usually we do not know the size of the constraint region that we

impose by using weight decay with coeﬃcient

∗

because the value of

∗

does not

directly tell us the value of

. In principle, one can solve for

, but the relationship

between

and

∗

depends on the form of

. While we do not know the exact size

of the constraint region, we can control it roughly by increasing or decreasing

in order to grow or shrink the constraint region. Larger

will result in a smaller

constraint region. Smaller α will result in a larger constraint region.

Sometimes we may wish to use explicit constraints rather than penalties. As

described in section 4.4, we can modify algorithms such as stochastic gradient

descent to take a step downhill on

(

) and then project

back to the nearest

point that satisﬁes Ω(

)

< k

. This can be useful if we have an idea of what value

is appropriate and do not want to spend time searching for the value of

that

corresponds to this k.

Another reason to use explicit constraints and reprojection rather than enforcing

constraints with penalties is that penalties can cause nonconvex optimization

procedures to get stuck in local minima corresponding to small

. When training

neural networks, this usually manifests as neural networks that train with several

“dead units.” These are units that do not contribute much to the behavior of the

function learned by the network because the weights going into or out of them are

all very small. When training with a penalty on the norm of the weights, these

conﬁgurations can be locally optimal, even if it is possible to signiﬁcantly reduce

by making the weights larger. Explicit constraints implemented by reprojection

can work much better in these cases because they do not encourage the weights

to approach the origin. Explicit constraints implemented by reprojection have an

eﬀect only when the weights become large and attempt to leave the constraint

region.

234

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Finally, explicit constraints with reprojection can be useful because they impose

some stability on the optimization procedure. When using high learning rates, it

is possible to encounter a positive feedback loop in which large weights induce

large gradients, which then induce a large update to the weights. If these updates

consistently increase the size of the weights, then

rapidly moves away from

the origin until numerical overﬂow occurs. Explicit constraints with reprojection

prevent this feedback loop from continuing to increase the magnitude of the weights

without bound. Hinton et al. (2012c) recommend using constraints combined with a

high learning rate to enable rapid exploration of parameter space while maintaining

some stability.

In particular, Hinton et al. (2012c) recommend a strategy introduced by Srebro

and Shraibman (2005): constraining the norm of each column of the weight matrix

of a neural net layer, rather than constraining the Frobenius norm of the entire

weight matrix. Constraining the norm of each column separately prevents any one

hidden unit from having very large weights. If we converted this constraint into a

penalty in a Lagrange function, it would be similar to

weight decay but with a

separate KKT multiplier for the weights of each hidden unit. Each of these KKT

multipliers would be dynamically updated separately to make each hidden unit

obey the constraint. In practice, column norm limitation is always implemented as

an explicit constraint with reprojection.

7.3 Regularization and Under-Constrained Problems

In some cases, regularization is necessary for machine learning problems to be

properly deﬁned. Many linear models in machine learning, including linear re-

gression and PCA, depend on inverting the matrix



. This is not possible

when



is singular. This matrix can be singular whenever the data-generating

distribution truly has no variance in some direction, or when no variance is observed

in some direction because there are fewer examples (rows of

) than input features

(columns of

). In this case, many forms of regularization correspond to inverting



X + αI instead. This regularized matrix is guaranteed to be invertible.

These linear problems have closed form solutions when the relevant matrix

is invertible. It is also possible for a problem with no closed form solution to be

underdetermined. An example is logistic regression applied to a problem where

the classes are linearly separable. If a weight vector

is able to achieve perfect

classiﬁcation, then 2

will also achieve perfect classiﬁcation and higher likelihood.

An iterative optimization procedure like stochastic gradient descent will continually

increase the magnitude of

and, in theory, will never halt. In practice, a numerical

235

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

implementation of gradient descent will eventually reach suﬃciently large weights

to cause numerical overﬂow, at which point its behavior will depend on how the

programmer has decided to handle values that are not real numbers.

Most forms of regularization are able to guarantee the convergence of iterative

methods applied to underdetermined problems. For example, weight decay will

cause gradient descent to quit increasing the magnitude of the weights when the

slope of the likelihood is equal to the weight decay coeﬃcient.

The idea of using regularization to solve underdetermined problems extends

beyond machine learning. The same idea is useful for several basic linear algebra

problems.

As we saw in section 2.9, we can solve underdetermined linear equations using

the Moore-Penrose pseudoinverse. Recall that one deﬁnition of the pseudoinverse

of a matrix X is

= lim

α0



X + αI)

−1



. (7.29)

We can now recognize equation 7.29 as performing linear regression with weight

decay. Speciﬁcally, equation 7.29 is the limit of equation 7.17 as the regularization

coeﬃcient shrinks to zero. We can thus interpret the pseudoinverse as stabilizing

underdetermined problems using regularization.

7.4 Dataset Augmentation

The best way to make a machine learning model generalize better is to train it on

more data. Of course, in practice, the amount of data we have is limited. One way

to get around this problem is to create fake data and add it to the training set.

For some machine learning tasks, it is reasonably straightforward to create new

fake data.

This approach is easiest for classiﬁcation. A classiﬁer needs to take a complicat-

ed, high-dimensional input

and summarize it with a single category identity

This means that the main task facing a classiﬁer is to be invariant to a wide variety

of transformations. We can generate new (

x, y

) pairs easily just by transforming

the x inputs in our training set.

This approach is not as readily applicable to many other tasks. For example, it

is diﬃcult to generate new fake data for a density estimation task unless we have

already solved the density estimation problem.

Dataset augmentation has been a particularly eﬀective technique for a speciﬁc

classiﬁcation problem: object recognition. Images are high dimensional and include

236

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

an enormous range of factors of variation, many of which can be easily simulated.

Operations like translating the training images a few pixels in each direction can

often greatly improve generalization, even if the model has already been designed to

be partially translation invariant by using the convolution and pooling techniques

described in chapter 9. Many other operations, such as rotating the image or

scaling the image, have also proved quite eﬀective.

One must be careful not to apply transformations that would change the correct

class. For example, optical character recognition tasks require recognizing the

diﬀerence between “b” and “d” and the diﬀerence between “6” and “9,” so horizontal

ﬂips and 180

◦

rotations are not appropriate ways of augmenting datasets for these

tasks.

There are also transformations that we would like our classiﬁers to be invariant

to but that are not easy to perform. For example, out-of-plane rotation cannot be

implemented as a simple geometric operation on the input pixels.

Dataset augmentation is eﬀective for speech recognition tasks as well (Jaitly

and Hinton, 2013).

Injecting noise in the input to a neural network (Sietsma and Dow, 1991)

can also be seen as a form of data augmentation. For many classiﬁcation and

even some regression tasks, the task should still be possible to solve even if small

random noise is added to the input. Neural networks prove not to be very robust

to noise, however (Tang and Eliasmith, 2010). One way to improve the robustness

of neural networks is simply to train them with random noise applied to their

inputs. Input noise injection is part of some unsupervised learning algorithms,

such as the denoising autoencoder (Vincent et al., 2008). Noise injection also

works when the noise is applied to the hidden units, which can be seen as doing

dataset augmentation at multiple levels of abstraction. Poole et al. (2014) recently

showed that this approach can be highly eﬀective provided that the magnitude of

the noise is carefully tuned. Dropout, a powerful regularization strategy that will

be described in section 7.12, can be seen as a process of constructing new inputs

by multiplying by noise.

When comparing machine learning benchmark results, taking the eﬀect of

dataset augmentation into account is important. Often, hand-designed dataset

augmentation schemes can dramatically reduce the generalization error of a ma-

chine learning technique. To compare the performance of one machine learning

algorithm to another, it is necessary to perform controlled experiments. When

comparing machine learning algorithm A and machine learning algorithm B, make

sure that both algorithms are evaluated using the same hand-designed dataset

augmentation schemes. Suppose that algorithm A performs poorly with no dataset

237

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

augmentation, and algorithm B performs well when combined with numerous syn-

thetic transformations of the input. In such a case the synthetic transformations

likely caused the improved performance, rather than the use of machine learning

algorithm B. Sometimes deciding whether an experiment has been properly con-

trolled requires subjective judgment. For example, machine learning algorithms

that inject noise into the input are performing a form of dataset augmentation.

Usually, operations that are generally applicable (such as adding Gaussian noise to

the input) are considered part of the machine learning algorithm, while operations

that are speciﬁc to one application domain (such as randomly cropping an image)

are considered to be separate preprocessing steps.

7.5 Noise Robustness

Section 7.4 has motivated the use of noise applied to the inputs as a dataset

augmentation strategy. For some models, the addition of noise with inﬁnitesimal

variance at the input of the model is equivalent to imposing a penalty on the

norm of the weights (Bishop, 1995a,b). In the general case, it is important to

remember that noise injection can be much more powerful than simply shrinking

the parameters, especially when the noise is added to the hidden units. Noise

applied to the hidden units is such an important topic that it merits its own

separate discussion; the dropout algorithm described in section 7.12 is the main

development of that approach.

Another way that noise has been used in the service of regularizing models

is by adding it to the weights. This technique has been used primarily in the

context of recurrent neural networks (Jim et al., 1996; Graves, 2011). This can

be interpreted as a stochastic implementation of Bayesian inference over the

weights. The Bayesian treatment of learning would consider the model weights

to be uncertain and representable via a probability distribution that reﬂects this

uncertainty. Adding noise to the weights is a practical, stochastic way to reﬂect

this uncertainty.

Noise applied to the weights can also be interpreted as equivalent (under some

assumptions) to a more traditional form of regularization, encouraging stability of

the function to be learned. Consider the regression setting, where we wish to train

a function

ˆy

(

) that maps a set of features

to a scalar using the least-squares

cost function between the model predictions ˆy(x) and the true values y:

J = E

p(x,y)



(ˆy(x) − y)



. (7.30)

The training set consists of m labeled examples {(x

(1)

, y

(1)

), . . . , (x

(m)

, y

(m)

)}.

238

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

We now assume that with each input presentation we also include a random

perturbation



∼ N

(



;

0, ηI

) of the network weights. Let us imagine that we

have a standard

-layer MLP. We denote the perturbed model as

ˆy



(

). Despite

the injection of noise, we are still interested in minimizing the squared error of the

output of the network. The objective function thus becomes

= E

p(x,y,

)



(ˆy



(x) − y)



(7.31)

= E

p(x,y,

)



ˆy



(x) − 2y ˆy



(x) + y



. (7.32)

For small

, the minimization of

with added weight noise (with covariance

ηI

) is equivalent to minimization of

with an additional regularization ter-

ηE

p(x,y)



∇

ˆy(x)



. This form of regularization encourages the parameters

to go to regions of parameter space where small perturbations of the weights have

a relatively small inﬂuence on the output. In other words, it pushes the model

into regions where the model is relatively insensitive to small variations in the

weights, ﬁnding points that are not merely minima, but minima surrounded by

ﬂat regions (Hochreiter and Schmidhuber, 1995). In the simpliﬁed case of linear

regression (where, for instance,

ˆy

(

) =



), this regularization term collapses

into

ηE

p(x)



x



, which is not a function of parameters and therefore does not

contribute to the gradient of

with respect to the model parameters.

7.5.1 Injecting Noise at the Output Targets

Most datasets have some number of mistakes in the

labels. It can be harmful to

maximize

log p

(

y | x

) when

is a mistake. One way to prevent this is to explicitly

model the noise on the labels. For example, we can assume that for some small

constant



, the training set label

is correct with probability 1

− 

, and otherwise

any of the other possible labels might be correct. This assumption is easy to

incorporate into the cost function analytically, rather than by explicitly drawing

noise samples. For example,

label smoothing

regularizes a model based on a

softmax with

output values by replacing the hard 0 and 1 classiﬁcation targets

with targets of



k−1

and 1

− 

, respectively. The standard cross-entropy loss may

then be used with these soft targets. Maximum likelihood learning with a softmax

classiﬁer and hard targets may actually never converge—the softmax can never

predict a probability of exactly 0 or exactly 1, so it will continue to learn larger

and larger weights, making more extreme predictions forever. It is possible to

prevent this scenario using other regularization strategies like weight decay. Label

smoothing has the advantage of preventing the pursuit of hard probabilities without

discouraging correct classiﬁcation. This strategy has been used since the 1980s

239

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

and continues to be featured prominently in modern neural networks (Szegedy

et al., 2015).

7.6 Semi-Supervised Learning

In the paradigm of semi-supervised learning, both unlabeled examples from

(

)

and labeled examples from

(

x, y

) are used to estimate

(

y | x

) or predict

from x.

In the context of deep learning, semi-supervised learning usually refers to

learning a representation

(

)

The goal is to learn a representation so

that examples from the same class have similar representations. Unsupervised

learning can provide useful clues for how to group examples in representation

space. Examples that cluster tightly in the input space should be mapped to

similar representations. A linear classiﬁer in the new space may achieve better

generalization in many cases (Belkin and Niyogi, 2002; Chapelle et al., 2003). A

long-standing variant of this approach is the application of principal components

analysis as a preprocessing step before applying a classiﬁer (on the projected data).

Instead of having separate unsupervised and supervised components in the

model, one can construct models in which a generative model of either

(

) or

(

x, y

) shares parameters with a discriminative model of

(

y | x

). One can

then trade oﬀ the supervised criterion

−log P

(

y | x

) with the unsupervised or

generative one (such as

−log P

(

) or

−log P

(

x, y

)). The generative criterion then

expresses a particular form of prior belief about the solution to the supervised

learning problem (Lasserre et al., 2006), namely that the structure of

(

) is

connected to the structure of

(

y | x

) in a way that is captured by the shared

parametrization. By controlling how much of the generative criterion is included

in the total criterion, one can ﬁnd a better trade-oﬀ than with a purely generative

or a purely discriminative training criterion (Lasserre et al., 2006; Larochelle and

Bengio, 2008).

Salakhutdinov and Hinton (2008) describe a method for learning the kernel

function of a kernel machine used for regression, in which the usage of unlabeled

examples for modeling P (x) improves P (y | x) quite signiﬁcantly.

See Chapelle et al. (2006) for more information about semi-supervised learning.

240

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

7.7 Multitask Learning

Multitask learning (Caruana, 1993) is a way to improve generalization by pooling

the examples (which can be seen as soft constraints imposed on the parameters)

arising out of several tasks. In the same way that additional training examples

put more pressure on the parameters of the model toward values that generalize

well, when part of a model is shared across tasks, that part of the model is more

constrained toward good values (assuming the sharing is justiﬁed), often yielding

better generalization.

Figure 7.2 illustrates a very common form of multitask learning, in which

diﬀerent supervised tasks (predicting

(i)

given

) share the same input

, as well

as some intermediate-level representation

(shared)

, capturing a common pool of

factors. The model can generally be divided into two kinds of parts and associated

parameters:

Task-speciﬁc parameters (which only beneﬁt from the examples of their task

to achieve good generalization). These are the upper layers of the neural

network in ﬁgure 7.2.

Generic parameters, shared across all the tasks (which beneﬁt from the

pooled data of all the tasks). These are the lower layers of the neural network

in ﬁgure 7.2.

Improved generalization and generalization error bounds (Baxter, 1995) can be

achieved because of the shared parameters, for which statistical strength can be

greatly improved (in proportion with the increased number of examples for the

shared parameters, compared to the scenario of single-task models). Of course this

will happen only if some assumptions about the statistical relationship between

the diﬀerent tasks are valid, meaning that there is something shared across some

of the tasks.

From the point of view of deep learning, the underlying prior belief is the

following: among the factors that explain the variations observed in the data

associated with the diﬀerent tasks, some are shared across two or more tasks.

7.8 Early Stopping

When training large models with suﬃcient representational capacity to overﬁt

the task, we often observe that training error decreases steadily over time, but

241

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

(1)

(2)

(3)

(1)

(2)

(shared)

Figure 7.2: Multitask learning can be cast in several ways in deep learning frameworks,

and this ﬁgure illustrates the common situation where the tasks share a common input but

involve diﬀerent target random variables. The lower layers of a deep network (whether it

is supervised and feedforward or includes a generative component with downward arrows)

can be shared across such tasks, while task-speciﬁc parameters (associated respectively

with the weights into and from

(1)

and

(2)

) can be learned on top of those yielding a

shared representation

(shared)

. The underlying assumption is that there exists a common

pool of factors that explain the variations in the input

, while each task is associated

with a subset of these factors. In this example, it is additionally assumed that top-level

hidden units

(1)

and

(2)

are specialized to each task (respectively predicting

(1)

and

(2)

), while some intermediate-level representation

(shared)

is shared across all tasks. In

the unsupervised learning context, it makes sense for some of the top-level factors to be

associated with none of the output tasks (

(3)

): these are the factors that explain some of

the input variations but are not relevant for predicting y

(1)

or y

(2)

242

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

0 50 100 150 200 250

Time (epochs)

0.00

0.05

0.10

0.15

0.20

Loss (negative log-likelihood)

Training set loss

Validation set loss

Figure 7.3: Learning curves showing how the negative log-likelihood loss changes over

time (indicated as number of training iterations over the dataset, or

epochs

). In this

example, we train a maxout network on MNIST. Observe that the training objective

decreases consistently over time, but the validation set average loss eventually begins to

increase again, forming an asymmetric U-shaped curve.

validation set error begins to rise again. See ﬁgure 7.3 for an example of this

behavior, which occurs reliably.

This means we can obtain a model with better validation set error (and thus,

hopefully better test set error) by returning to the parameter setting at the point in

time with the lowest validation set error. Every time the error on the validation set

improves, we store a copy of the model parameters. When the training algorithm

terminates, we return these parameters, rather than the latest parameters. The

algorithm terminates when no parameters have improved over the best recorded

validation error for some pre-speciﬁed number of iterations. This procedure is

speciﬁed more formally in algorithm 7.1.

This strategy is known as

early stopping

. It is probably the most commonly

used form of regularization in deep learning. Its popularity is due to both its

eﬀectiveness and its simplicity.

One way to think of early stopping is as a very eﬃcient hyperparameter selection

algorithm. In this view, the number of training steps is just another hyperparameter.

We can see in ﬁgure 7.3 that this hyperparameter has a U-shaped validation set

performance curve. Most hyperparameters that control model capacity have such a

U-shaped validation set performance curve, as illustrated in ﬁgure 5.3. In the case of

early stopping, we are controlling the eﬀective capacity of the model by determining

how many steps it can take to ﬁt the training set. Most hyperparameters must be

243

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

chosen using an expensive guess and check process, where we set a hyperparameter

at the start of training, then run training for several steps to see its eﬀect. The

“training time” hyperparameter is unique in that by deﬁnition, a single run of

training tries out many values of the hyperparameter. The only signiﬁcant cost

to choosing this hyperparameter automatically via early stopping is running the

validation set evaluation periodically during training. Ideally, this is done in

parallel to the training process on a separate machine, separate CPU, or separate

GPU from the main training process. If such resources are not available, then the

cost of these periodic evaluations may be reduced by using a validation set that is

Algorithm 7.1

The early stopping meta-algorithm for determining the best

amount of time to train. This meta-algorithm is a general strategy that works

well with a variety of training algorithms and ways of quantifying error on the

validation set.

Let n be the number of steps between evaluations.

Let

be the “patience,” the number of times to observe worsening validation set

error before giving up.

Let θ

be the initial parameters.

θ ← θ

i ← 0

j ← 0

v ← ∞

∗

← θ

∗

← i

while j < p do

Update θ by running the training algorithm for n steps.

i ← i + n



← ValidationSetError(θ)

if v



< v then

j ← 0

∗

← θ

∗

← i

v ← v



else

j ← j + 1

end if

end while

Best parameters are θ

∗

, best number of training steps is i

∗

244

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

small compared to the training set or by evaluating the validation set error less

frequently and obtaining a lower-resolution estimate of the optimal training time.

An additional cost to early stopping is the need to maintain a copy of the

best parameters. This cost is generally negligible, because it is acceptable to store

these parameters in a slower and larger form of memory (for example, training in

GPU memory, but storing the optimal parameters in host memory or on a disk

drive). Since the best parameters are written to infrequently and never read during

training, these occasional slow writes have little eﬀect on the total training time.

Early stopping is an unobtrusive form of regularization, in that it requires

almost no change in the underlying training procedure, the objective function,

or the set of allowable parameter values. This means that it is easy to use early

stopping without damaging the learning dynamics. This is in contrast to weight

decay, where one must be careful not to use too much weight decay and trap the

network in a bad local minimum corresponding to a solution with pathologically

small weights.

Early stopping may be used either alone or in conjunction with other regulariza-

tion strategies. Even when using regularization strategies that modify the objective

function to encourage better generalization, it is rare for the best generalization to

occur at a local minimum of the training objective.

Early stopping requires a validation set, which means some training data is not

fed to the model. To best exploit this extra data, one can perform extra training

after the initial training with early stopping has completed. In the second, extra

training step, all the training data is included. There are two basic strategies one

can use for this second training procedure.

One strategy (algorithm 7.2) is to initialize the model again and retrain on all

the data. In this second training pass, we train for the same number of steps as

the early stopping procedure determined was optimal in the ﬁrst pass. There are

some subtleties associated with this procedure. For example, there is not a good

way of knowing whether to retrain for the same number of parameter updates or

the same number of passes through the dataset. On the second round of training,

each pass through the dataset will require more parameter updates because the

training set is bigger.

Another strategy for using all the data is to keep the parameters obtained from

the ﬁrst round of training and then continue training, but now using all the data.

At this stage, we now no longer have a guide for when to stop in terms of a number

of steps. Instead, we can monitor the average loss function on the validation set

and continue training until it falls below the value of the training set objective at

which the early stopping procedure halted. This strategy avoids the high cost of

245

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Algorithm 7.2

A meta-algorithm for using early stopping to determine how long

to train, then retraining on all the data.

Let X

(train)

and y

(train)

be the training set.

Split

(train)

and

(train)

into (

(subtrain)

(valid)

) and (

(subtrain)

(valid)

)

respectively.

Run early stopping (algorithm 7.1) starting from random

using

(subtrain)

and

(subtrain)

for training data and

(valid)

and

(valid)

for validation data. This

returns i

∗

, the optimal number of steps.

Set θ to random values again.

Train on X

(train)

and y

(train)

for i

∗

steps.

Algorithm 7.3

Meta-algorithm using early stopping to determine at what objec-

tive value we start to overﬁt, then continue training until that value is reached.

Let X

(train)

and y

(train)

be the training set.

Split

(train)

and

(train)

into (

(subtrain)

(valid)

) and (

(subtrain)

(valid)

)

respectively.

Run early stopping (algorithm 7.1) starting from random

using

(subtrain)

and

(subtrain)

for training data and

(valid)

and

(valid)

for validation data. This

updates θ.

 ← J(θ, X

(subtrain)

, y

(subtrain)

)

while J(θ, X

(valid)

, y

(valid)

) >  do

Train on X

(train)

and y

(train)

for n steps.

end while

retraining the model from scratch but is not as well behaved. For example, the

objective on the validation set may not ever reach the target value, so this strategy

is not even guaranteed to terminate. This procedure is presented more formally in

algorithm 7.3.

Early stopping is also useful because it reduces the computational cost of the

training procedure. Besides the obvious reduction in cost due to limiting the number

of training iterations, it also has the beneﬁt of providing regularization without

requiring the addition of penalty terms to the cost function or the computation of

the gradients of such additional terms.

How early stopping acts as a regularizer:

So far we have stated that early

stopping is a regularization strategy, but we have supported this claim only by

showing learning curves where the validation set error has a U-shaped curve. What

246

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

∗

Figure 7.4: An illustration of the eﬀect of early stopping. (Left)The solid contour lines

indicate the contours of the negative log-likelihood. The dashed line indicates the trajectory

taken by SGD beginning from the origin. Rather than stopping at the point

∗

that

minimizes the cost, early stopping results in the trajectory stopping at an earlier point

(Right)An illustration of the eﬀect of

regularization for comparison. The dashed circles

indicate the contours of the

penalty, which causes the minimum of the total cost to lie

nearer the origin than the minimum of the unregularized cost.

is the actual mechanism by which early stopping regularizes the model? Bishop

(1995a) and Sjöberg and Ljung (1995) argued that early stopping has the eﬀect of

restricting the optimization procedure to a relatively small volume of parameter

space in the neighborhood of the initial parameter value

, as illustrated in

ﬁgure 7.4. More speciﬁcally, imagine taking

optimization steps (corresponding

training iterations) and with learning rate



. We can view the product

τ

as a measure of eﬀective capacity. Assuming the gradient is bounded, restricting

both the number of iterations and the learning rate limits the volume of parameter

space reachable from

. In this sense,

τ

behaves as if it were the reciprocal of

the coeﬃcient used for weight decay.

Indeed, we can show how—in the case of a simple linear model with a quadratic

error function and simple gradient descent—early stopping is equivalent to

regularization.

To compare with classical

regularization, we examine a simple setting where

the only parameters are linear weights (

). We can model the cost function

with a quadratic approximation in the neighborhood of the empirically optimal

value of the weights w

∗

J(θ) = J(w

∗

) +

(w −w

∗

)



H(w − w

∗

), (7.33)

where

is the Hessian matrix of

with respect to

evaluated at

∗

. Given the

assumption that

∗

is a minimum of

(

), we know that

is positive semideﬁnite.

247

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Under a local Taylor series approximation, the gradient is given by

∇

J(w) = H(w − w

∗

). (7.34)

We are going to study the trajectory followed by the parameter vector during

training. For simplicity, let us set the initial parameter vector to the origin,

(0)

. Let us study the approximate behavior of gradient descent on

analyzing gradient descent on

(τ)

= w

(τ−1)

− ∇

J(w

(τ−1)

) (7.35)

= w

(τ−1)

− H(w

(τ−1)

− w

∗

), (7.36)

(τ)

− w

∗

= (I − H)(w

(τ−1)

− w

∗

). (7.37)

Let us now rewrite this expression in the space of the eigenvectors of

, exploiting

the eigendecomposition of

QΛQ



, where

is a diagonal matrix and

is an orthonormal basis of eigenvectors.

(τ)

− w

∗

= (I − QΛQ



)(w

(τ−1)

− w

∗

) (7.38)



(τ)

− w

∗

) = (I − Λ)Q



(τ−1)

− w

∗

) (7.39)

Assuming that

(0)

= 0 and that



is chosen to be small enough to guarantee

− λ

| <

1, the parameter trajectory during training after

parameter updates

is as follows:



(τ)

= [I − (I − Λ)



∗

. (7.40)

Now, the expression for



in equation 7.13 for

regularization can be rear-

ranged as



w = (Λ + αI)

−1

ΛQ



∗

, (7.41)



w = [I − (Λ + αI)

−1

α]Q



∗

. (7.42)

Comparing equation 7.40 and equation 7.42, we see that if the hyperparameters



α, and τ are chosen such that

(I − Λ)

= (Λ + αI)

−1

α, (7.43)

For neural networks, to obtain symmetry breaking between hidden units, we cannot initialize

all the parameters to

, as discussed in section 6.2. However, the argument holds for any other

initial value w

(0)

248

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

then

regularization and early stopping can be seen as equivalent (at least under

the quadratic approximation of the objective function). Going even further, by

taking logarithms and using the series expansion for

log

(1 +

), we can conclude

that if all λ

are small (that is, λ

 1 and λ

/α  1) then

τ ≈

α

, (7.44)

α ≈

τ

. (7.45)

That is, under these assumptions, the number of training iterations

plays a role

inversely proportional to the

regularization parameter, and the inverse of

τ

plays the role of the weight decay coeﬃcient.

Parameter values corresponding to directions of signiﬁcant curvature (of the

objective function) are regularized less than directions of less curvature. Of course,

in the context of early stopping, this really means that parameters that correspond

to directions of signiﬁcant curvature tend to learn early relative to parameters

corresponding to directions of less curvature.

The derivations in this section have shown that a trajectory of length

ends

at a point that corresponds to a minimum of the

-regularized objective. Early

stopping is of course more than the mere restriction of the trajectory length;

instead, early stopping typically involves monitoring the validation set error in

order to stop the trajectory at a particularly good point in space. Early stopping

therefore has the advantage over weight decay in that it automatically determines

the correct amount of regularization while weight decay requires many training

experiments with diﬀerent values of its hyperparameter.

7.9 Parameter Tying and Parameter Sharing

Thus far, in this chapter, when we have discussed adding constraints or penalties

to the parameters, we have always done so with respect to a ﬁxed region or point.

For example,

regularization (or weight decay) penalizes model parameters for

deviating from the ﬁxed value of zero. Sometimes, however, we may need other

ways to express our prior knowledge about suitable values of the model parameters.

Sometimes we might not know precisely what values the parameters should take,

but we know, from knowledge of the domain and model architecture, that there

should be some dependencies between the model parameters.

A common type of dependency that we often want to express is that certain

parameters should be close to one another. Consider the following scenario:

249

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

we have two models performing the same classiﬁcation task (with the same set

of classes) but with somewhat diﬀerent input distributions. Formally, we have

model

with parameters

(A)

and model

with parameters

(B)

. The two

models map the input to two diﬀerent but related outputs:

ˆy

(A)

(

(A)

, x

) and

ˆy

(B)

= g(w

(B)

, x).

Let us imagine that the tasks are similar enough (perhaps with similar input

and output distributions) that we believe the model parameters should be close

to each other:

∀i

(A)

should be close to

(B)

. We can leverage this information

through regularization. Speciﬁcally, we can use a parameter norm penalty of the

form Ω(

(A)

, w

(B)

) =

w

(A)

− w

(B)



. Here we used an

penalty, but other

choices are also possible.

This kind of approach was proposed by Lasserre et al. (2006), who regularized

the parameters of one model, trained as a classiﬁer in a supervised paradigm, to

be close to the parameters of another model, trained in an unsupervised paradigm

(to capture the distribution of the observed input data). The architectures were

constructed such that many of the parameters in the classiﬁer model could be

paired to corresponding parameters in the unsupervised model.

While a parameter norm penalty is one way to regularize parameters to be

close to one another, the more popular way is to use constraints: to force sets

of parameters to be equal. This method of regularization is often referred to as

parameter sharing

, because we interpret the various models or model components

as sharing a unique set of parameters. A signiﬁcant advantage of parameter sharing

over regularizing the parameters to be close (via a norm penalty) is that only a

subset of the parameters (the unique set) needs to be stored in memory. In certain

models—such as the convolutional neural network—this can lead to signiﬁcant

reduction in the memory footprint of the model.

7.9.1 Convolutional Neural Networks

By far the most popular and extensive use of parameter sharing occurs in

convo-

lutional neural networks (CNNs) applied to computer vision.

Natural images have many statistical properties that are invariant to translation.

For example, a photo of a cat remains a photo of a cat if it is translated one pixel

to the right. CNNs take this property into account by sharing parameters across

multiple image locations. The same feature (a hidden unit with the same weights)

is computed over diﬀerent locations in the input. This means that we can ﬁnd a

cat with the same cat detector whether the cat appears at column

or column

i + 1 in the image.

250

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Parameter sharing has enabled CNNs to dramatically lower the number of

unique model parameters and to signiﬁcantly increase network sizes without

requiring a corresponding increase in training data. It remains one of the best

examples of how to eﬀectively incorporate domain knowledge into the network

architecture.

CNNs are discussed in more detail in chapter 9.

7.10 Sparse Representations

Weight decay acts by placing a penalty directly on the model parameters. Another

strategy is to place a penalty on the activations of the units in a neural network,

encouraging their activations to be sparse. This indirectly imposes a complicated

penalty on the model parameters.

We have already discussed (in section 7.1.2) how

penalization induces

a sparse parametrization—meaning that many of the parameters become zero

(or close to zero). Representational sparsity, on the other hand, describes a

representation where many of the elements of the representation are zero (or close

to zero). A simpliﬁed view of this distinction can be illustrated in the context of

linear regression:







−9

−3













4 0 0 −2 0 0

0 0 −1 0 3 0

0 5 0 0 0 0

1 0 0 −1 0 −4

1 0 0 0 −5 0













−2

−5







y ∈ R

A ∈ R

m×n

x ∈ R

(7.46)







−14













3 −1 2 −5 4 1

4 2 −3 −1 1 3

−1 5 4 2 −3 −2

3 1 2 −3 0 −3

−5 4 −2 2 −5 −1













−3







y ∈ R

B ∈ R

m×n

h ∈ R

(7.47)

In the ﬁrst expression, we have an example of a sparsely parametrized linear

regression model. In the second, we have linear regression with a sparse representa-

251

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

tion

of the data

. That is,

is a function of

that, in some sense, represents

the information present in x, but does so with a sparse vector.

Representational regularization is accomplished by the same sorts of mechanisms

that we have used in parameter regularization.

Norm penalty regularization of representations is performed by adding to the

loss function

a norm penalty on the representation. This penalty is denoted

Ω(h). As before, we denote the regularized loss function by

J(θ; X, y) = J(θ; X, y) + αΩ(h), (7.48)

where

α ∈

, ∞

) weights the relative contribution of the norm penalty term, with

larger values of α corresponding to more regularization.

Just as an

penalty on the parameters induces parameter sparsity, an

penalty on the elements of the representation induces representational sparsity:

Ω(

) =

||h||



. Of course, the

penalty is only one choice of penalty

that can result in a sparse representation. Others include the penalty derived from

a Student

prior on the representation (Olshausen and Field, 1996; Bergstra, 2011)

and KL divergence penalties (Larochelle and Bengio, 2008), which are especially

useful for representations with elements constrained to lie on the unit interval.

Lee et al. (2008) and Goodfellow et al. (2009) both provide examples of strategies

based on regularizing the average activation across several examples,



(i)

, to

be near some target value, such as a vector with .01 for each entry.

Other approaches obtain representational sparsity with a hard constraint on

the activation values. For example,

orthogonal matching pursuit

(Pati et al.,

1993) encodes an input

with the representation

that solves the constrained

optimization problem

arg min

h,h

x − W h

, (7.49)

where

h

is the number of nonzero entries of

. This problem can be solved

eﬃciently when

is constrained to be orthogonal. This method is often called

OMP-

, with the value of

speciﬁed to indicate the number of nonzero features

allowed. Coates and Ng (2011) demonstrated that OMP-1 can be a very eﬀective

feature extractor for deep architectures.

Essentially any model that has hidden units can be made sparse. Throughout

this book, we see many examples of sparsity regularization used in various contexts.

252

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

7.11 Bagging and Other Ensemble Methods

Bagging

(short for

bootstrap aggregating

) is a technique for reducing general-

ization error by combining several models (Breiman, 1994). The idea is to train

several diﬀerent models separately, then have all the models vote on the output for

test examples. This is an example of a general strategy in machine learning called

model averaging

. Techniques employing this strategy are known as

ensemble

methods.

The reason that model averaging works is that diﬀerent models will usually

not make all the same errors on the test set.

Consider for example a set of

regression models. Suppose that each model

makes an error



on each example, with the errors drawn from a zero-mean

multivariate normal distribution with variances

[



] =

and covariances

[



] =

. Then the error made by the average prediction of all the ensemble models is





. The expected squared error of the ensemble predictor is































j=i











, (7.50)

v +

k −1

c. (7.51)

In the case where the errors are perfectly correlated and

, the mean squared

error reduces to

, so the model averaging does not help at all. In the case where

the errors are perfectly uncorrelated and

= 0, the expected squared error of the

ensemble is only

. This means that the expected squared error of the ensemble

is inversely proportional to the ensemble size. In other words, on average, the

ensemble will perform at least as well as any of its members, and if the members

make independent errors, the ensemble will perform signiﬁcantly better than its

members.

Diﬀerent ensemble methods construct the ensemble of models in diﬀerent ways.

For example, each member of the ensemble could be formed by training a completely

diﬀerent kind of model using a diﬀerent algorithm or objective function. Bagging

is a method that allows the same kind of model, training algorithm and objective

function to be reused several times.

Speciﬁcally, bagging involves constructing

diﬀerent datasets. Each dataset

has the same number of examples as the original dataset, but each dataset is

constructed by sampling with replacement from the original dataset. This means

that, with high probability, each dataset is missing some of the examples from the

253

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

First ensemble member

Second ensemble member

Original dataset

First resampled dataset

Second resampled dataset

Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an 8 detector on

the dataset depicted above, containing an 8, a 6 and a 9. Suppose we make two diﬀerent

resampled datasets. The bagging training procedure is to construct each of these datasets

by sampling with replacement. The ﬁrst dataset omits the 9 and repeats the 8. On this

dataset, the detector learns that a loop on top of the digit corresponds to an 8. On the

second dataset, we repeat the 9 and omit the 6. In this case, the detector learns that a

loop on the bottom of the digit corresponds to an 8. Each of these individual classiﬁcation

rules is brittle, but if we average their output, then the detector is robust, achieving

maximal conﬁdence only when both loops of the 8 are present.

original dataset and contains several duplicate examples

. Model

is then trained

on dataset

. The diﬀerences between which examples are included in each dataset

result in diﬀerences between the trained models. See ﬁgure 7.5 for an example.

Neural networks reach a wide enough variety of solution points that they can

often beneﬁt from model averaging even if all the models are trained on the same

dataset. Diﬀerences in random initialization, in random selection of minibatches,

in hyperparameters, or in outcomes of nondeterministic implementations of neural

networks are often enough to cause diﬀerent members of the ensemble to make

partially independent errors.

Model averaging is an extremely powerful and reliable method for reducing

generalization error. Its use is usually discouraged when benchmarking algorithms

for scientiﬁc papers, because any machine learning algorithm can beneﬁt substan-

tially from model averaging at the price of increased computation and memory.

When both the original and the resampled dataset contain

examples, the exact proportion

of examples missing from the new dataset is (1

−

)

. This is the chance that a particular

example is not chosen among the

possible source examples for all

draws used to create the

new dataset. As

approaches inﬁnity this quantity converges to

, which is slightly larger than

254

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

For this reason, benchmark comparisons are usually made using a single model.

Machine learning contests are usually won by methods using model averag-

ing over dozens of models. A recent prominent example is the Netﬂix Grand

Prize (Koren, 2009).

Not all techniques for constructing ensembles are designed to make the ensemble

more regularized than the individual models. For example, a technique called

boosting

(Freund and Schapire, 1996b,a) constructs an ensemble with higher

capacity than the individual models. Boosting has been applied to build ensembles

of neural networks (Schwenk and Bengio, 1998) by incrementally adding neural

networks to the ensemble. Boosting has also been applied interpreting an individual

neural network as an ensemble (Bengio et al., 2006a), incrementally adding hidden

units to the network.

7.12 Dropout

Dropout

(Srivastava et al., 2014) provides a computationally inexpensive but

powerful method of regularizing a broad family of models. To a ﬁrst approximation,

dropout can be thought of as a method of making bagging practical for ensembles

of very many large neural networks. Bagging involves training multiple models

and evaluating multiple models on each test example. This seems impractical

when each model is a large neural network, since training and evaluating such

networks is costly in terms of runtime and memory. It is common to use ensembles

of ﬁve to ten neural networks—Szegedy et al. (2014a) used six to win the ILSVRC—

but more than this rapidly becomes unwieldy. Dropout provides an inexpensive

approximation to training and evaluating a bagged ensemble of exponentially many

neural networks.

Speciﬁcally, dropout trains the ensemble consisting of all subnetworks that

can be formed by removing nonoutput units from an underlying base network, as

illustrated in ﬁgure 7.6. In most modern neural networks, based on a series of

aﬃne transformations and nonlinearities, we can eﬀectively remove a unit from a

network by multiplying its output value by zero. This procedure requires some

slight modiﬁcation for models such as radial basis function networks, which take

the diﬀerence between the unit’s state and some reference value. Here, we present

the dropout algorithm in terms of multiplication by zero for simplicity, but it can

be trivially modiﬁed to work with other operations that remove a unit from the

network.

Recall that to learn with bagging, we deﬁne

diﬀerent models, construct

255

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Base network

Ensemble of subnetworks

Figure 7.6: Dropout trains an ensemble consisting of all subnetworks that can be con-

structed by removing nonoutput units from an underlying base network. Here, we begin

with a base network with two visible units and two hidden units. There are sixteen

possible subsets of these four units. We show all sixteen subnetworks that may be formed

by dropping out diﬀerent subsets of units from the original network. In this small example,

a large proportion of the resulting networks have no input units or no path connecting

the input to the output. This problem becomes insigniﬁcant for networks with wider

layers, where the probability of dropping all possible paths from inputs to outputs becomes

smaller.

256

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

diﬀerent datasets by sampling from the training set with replacement, and then

train model

on dataset

. Dropout aims to approximate this process, but with an

exponentially large number of neural networks. Speciﬁcally, to train with dropout,

we use a minibatch-based learning algorithm that makes small steps, such as

stochastic gradient descent. Each time we load an example into a minibatch, we

randomly sample a diﬀerent binary mask to apply to all the input and hidden

units in the network. The mask for each unit is sampled independently from all

the others. The probability of sampling a mask value of one (causing a unit to be

included) is a hyperparameter ﬁxed before training begins. It is not a function

of the current value of the model parameters or the input example. Typically, an

input unit is included with probability 0.8, and a hidden unit is included with

probability 0.5. We then run forward propagation, back-propagation, and the

learning update as usual. Figure 7.7 illustrates how to run forward propagation

with dropout.

More formally, suppose that a mask vector

speciﬁes which units to include,

and

(

θ, µ

) deﬁnes the cost of the model deﬁned by parameters

and mask

Then dropout training consists of minimizing

(

θ, µ

). The expectation contains

exponentially many terms, but we can obtain an unbiased estimate of its gradient

by sampling values of µ.

Dropout training is not quite the same as bagging training. In the case of

bagging, the models are all independent. In the case of dropout, the models

share parameters, with each model inheriting a diﬀerent subset of parameters

from the parent neural network. This parameter sharing makes it possible to

represent an exponential number of models with a tractable amount of memory.

In the case of bagging, each model is trained to convergence on its respective

training set. In the case of dropout, typically most models are not explicitly

trained at all—usually, the model is large enough that it would be infeasible to

sample all possible subnetworks within the lifetime of the universe. Instead, a tiny

fraction of the possible subnetworks are each trained for a single step, and the

parameter sharing causes the remaining subnetworks to arrive at good settings of

the parameters. These are the only diﬀerences. Beyond these, dropout follows the

bagging algorithm. For example, the training set encountered by each subnetwork

is indeed a subset of the original training set sampled with replacement.

To make a prediction, a bagged ensemble must accumulate votes from all

its members. We refer to this process as

inference

in this context. So far, our

description of bagging and dropout has not required that the model be explicitly

probabilistic. Now, we assume that the model’s role is to output a probability

distribution. In the case of bagging, each model

produces a probability distribution

257

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

ˆx

Figure 7.7: An example of forward propagation through a feedforward network using

dropout. (Top)In this example, we use a feedforward network with two input units, one

hidden layer with two hidden units, and one output unit. (Bottom)To perform forward

propagation with dropout, we randomly sample a vector

with one entry for each input

or hidden unit in the network. The entries of

are binary and are sampled independently

from each other. The probability of each entry being 1 is a hyperparameter, usually 0

for the hidden layers and 0

8 for the input. Each unit in the network is multiplied by

the corresponding mask, and then forward propagation continues through the rest of the

network as usual. This is equivalent to randomly selecting one of the sub-networks from

ﬁgure 7.6 and running forward propagation through it.

258

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

(i)

(

y | x

). The prediction of the ensemble is given by the arithmetic mean of all

these distributions,



i=1

(i)

(y | x). (7.52)

In the case of dropout, each submodel deﬁned by mask vector

deﬁnes a probability

distribution p(y | x, µ). The arithmetic mean over all masks is given by



p(µ)p(y | x, µ), (7.53)

where

(

) is the probability distribution that was used to sample

at training

time.

Because this sum includes an exponential number of terms, it is intractable to

evaluate except when the structure of the model permits some form of simpliﬁcation.

So far, deep neural nets are not known to permit any tractable simpliﬁcation.

Instead, we can approximate the inference with sampling, by averaging together

the output from many masks. Even 10–20 masks are often suﬃcient to obtain

good performance.

An even better approach, however, allows us to obtain a good approximation to

the predictions of the entire ensemble, at the cost of only one forward propagation.

To do so, we change to using the geometric mean rather than the arithmetic mean of

the ensemble members’ predicted distributions. Warde-Farley et al. (2014) present

arguments and empirical evidence that the geometric mean performs comparably

to the arithmetic mean in this context.

The geometric mean of multiple probability distributions is not guaranteed to be

a probability distribution. To guarantee that the result is a probability distribution,

we impose the requirement that none of the submodels assigns probability 0 to any

event, and we renormalize the resulting distribution. The unnormalized probability

distribution deﬁned directly by the geometric mean is given by

˜p

ensemble

(y | x) =





p(y | x, µ), (7.54)

where

is the number of units that may be dropped. Here we use a uniform

distribution over

to simplify the presentation, but nonuniform distributions are

also possible. To make predictions we must renormalize the ensemble:

ensemble

(y | x) =

˜p

ensemble

(y | x)





˜p

ensemble



| x)

. (7.55)

259

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

A key insight (Hinton et al., 2012c) involved in dropout is that we can approxi-

mate

ensemble

by evaluating

(

y | x

) in one model: the model with all units, but

with the weights going out of unit

multiplied by the probability of including unit

. The motivation for this modiﬁcation is to capture the right expected value of the

output from that unit. We call this approach the

weight scaling inference rule

There is not yet any theoretical argument for the accuracy of this approximate

inference rule in deep nonlinear networks, but empirically it performs very well.

Because we usually use an inclusion probability of

, the weight scaling rule

usually amounts to dividing the weights by 2 at the end of training, and then using

the model as usual. Another way to achieve the same result is to multiply the

states of the units by 2 during training. Either way, the goal is to make sure that

the expected total input to a unit at test time is roughly the same as the expected

total input to that unit at train time, even though half the units at train time are

missing on average.

For many classes of models that do not have nonlinear hidden units, the weight

scaling inference rule is exact. For a simple example, consider a softmax regression

classiﬁer with n input variables represented by the vector v:

P (y = y | v) = softmax





v + b



. (7.56)

We can index into the family of submodels by element-wise multiplication of the

input with a binary vector d:

P (y = y | v; d) = softmax





(d  v) + b



. (7.57)

The ensemble predictor is deﬁned by renormalizing the geometric mean over all

ensemble members’ predictions:

ensemble

(y = y | v) =

ensemble

(y = y | v)





ensemble

(y = y



| v)

, (7.58)

where

ensemble

(y = y | v) =





d∈{0,1}

P (y = y | v; d). (7.59)

To see that the weight scaling rule is exact, we can simplify

ensemble

(y = y | v) =





d∈{0,1}

P (y = y | v; d) (7.60)

260

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING





d∈{0,1}

softmax (W



(d  v) + b)

(7.61)









d∈{0,1}

exp





y,:

(d  v) + b







exp







(d  v) + b





(7.62)





d∈{0,1}

exp





y,:

(d  v) + b







d∈{0,1}





exp







(d  v) + b





(7.63)

Because

will be normalized, we can safely ignore multiplication by factors that

are constant with respect to y:

ensemble

(y = y | v) ∝





d∈{0,1}

exp





y,:

(d  v) + b



(7.64)

= exp







d∈{0,1}



y,:

(d  v) + b





(7.65)

= exp





y,:

v + b



. (7.66)

Substituting this back into equation 7.58, we obtain a softmax classiﬁer with

weights

W .

The weight scaling rule is also exact in other settings, including regression

networks with conditionally normal outputs as well as deep networks that have

hidden layers without nonlinearities. However, the weight scaling rule is only an

approximation for deep models that have nonlinearities. Though the approximation

has not been theoretically characterized, it often works well, empirically. Goodfellow

et al. (2013a) found experimentally that the weight scaling approximation can work

better (in terms of classiﬁcation accuracy) than Monte Carlo approximations to

the ensemble predictor. This held true even when the Monte Carlo approximation

was allowed to sample up to 1,000 subnetworks. Gal and Ghahramani (2015) found

that some models obtain better classiﬁcation accuracy using twenty samples and

the Monte Carlo approximation. It appears that the optimal choice of inference

approximation is problem dependent.

Srivastava et al. (2014) showed that dropout is more eﬀective than other

standard computationally inexpensive regularizers, such as weight decay, ﬁlter

261

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

norm constraints, and sparse activity regularization. Dropout may also be combined

with other forms of regularization to yield a further improvement.

One advantage of dropout is that it is very computationally cheap. Using

dropout during training requires only

(

) computation per example per update,

to generate

random binary numbers and multiply them by the state. Depending

on the implementation, it may also require

(

) memory to store these binary

numbers until the back-propagation stage. Running inference in the trained model

has the same cost per example as if dropout were not used, though we must pay

the cost of dividing the weights by 2 once before beginning to run inference on

examples.

Another signiﬁcant advantage of dropout is that it does not signiﬁcantly limit

the type of model or training procedure that can be used. It works well with nearly

any model that uses a distributed representation and can be trained with stochastic

gradient descent. This includes feedforward neural networks, probabilistic models

such as restricted Boltzmann machines (Srivastava et al., 2014), and recurrent

neural networks (Bayer and Osendorfer, 2014; Pascanu et al., 2014a). Many other

regularization strategies of comparable power impose more severe restrictions on

the architecture of the model.

Though the cost per step of applying dropout to a speciﬁc model is negligible,

the cost of using dropout in a complete system can be signiﬁcant. Because dropout

is a regularization technique, it reduces the eﬀective capacity of a model. To oﬀset

this eﬀect, we must increase the size of the model. Typically the optimal validation

set error is much lower when using dropout, but this comes at the cost of a much

larger model and many more iterations of the training algorithm. For very large

datasets, regularization confers little reduction in generalization error. In these

cases, the computational cost of using dropout and larger models may outweigh

the beneﬁt of regularization.

When extremely few labeled training examples are available, dropout is less

eﬀective. Bayesian neural networks (Neal, 1996) outperform dropout on the

Alternative Splicing Dataset (Xiong et al., 2011), where fewer than 5,000 examples

are available (Srivastava et al., 2014). When additional unlabeled data is available,

unsupervised feature learning can gain an advantage over dropout.

Wager et al. (2013) showed that, when applied to linear regression, dropout

is equivalent to

weight decay, with a diﬀerent weight decay coeﬃcient for

each input feature. The magnitude of each feature’s weight decay coeﬃcient is

determined by its variance. Similar results hold for other linear models. For deep

models, dropout is not equivalent to weight decay.

The stochasticity used while training with dropout is not necessary for the

262

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

approach’s success. It is just a means of approximating the sum over all submodels.

Wang and Manning (2013) derived analytical approximations to this marginaliza-

tion. Their approximation, known as

fast dropout

, resulted in faster convergence

time due to the reduced stochasticity in the computation of the gradient. This

method can also be applied at test time, as a more principled (but also more

computationally expensive) approximation to the average over all sub-networks

than the weight scaling approximation. Fast dropout has been used to nearly

match the performance of standard dropout on small neural network problems, but

has not yet yielded a signiﬁcant improvement or been applied to a large problem.

Just as stochasticity is not necessary to achieve the regularizing eﬀect of

dropout, it is also not suﬃcient. To demonstrate this, Warde-Farley et al. (2014)

designed control experiments using a method called

dropout boosting

, which

they designed to use exactly the same mask noise as traditional dropout but lack

its regularizing eﬀect. Dropout boosting trains the entire ensemble to jointly

maximize the log-likelihood on the training set. In the same sense that traditional

dropout is analogous to bagging, this approach is analogous to boosting. As

intended, experiments with dropout boosting show almost no regularization eﬀect

compared to training the entire network as a single model. This demonstrates that

the interpretation of dropout as bagging has value beyond the interpretation of

dropout as robustness to noise. The regularization eﬀect of the bagged ensemble is

only achieved when the stochastically sampled ensemble members are trained to

perform well independently of each other.

Dropout has inspired other stochastic approaches to training exponentially

large ensembles of models that share weights. DropConnect is a special case of

dropout where each product between a single scalar weight and a single hidden

unit state is considered a unit that can be dropped (Wan et al., 2013). Stochastic

pooling is a form of randomized pooling (see section 9.3) for building ensembles

of convolutional networks, with each convolutional network attending to diﬀerent

spatial locations of each feature map. So far, dropout remains the most widely

used implicit ensemble method.

One of the key insights of dropout is that training a network with stochastic

behavior and making predictions by averaging over multiple stochastic decisions

implements a form of bagging with parameter sharing. Earlier, we described

dropout as bagging an ensemble of models formed by including or excluding units.

Yet this model averaging strategy does not need to be based on inclusion and

exclusion. In principle, any kind of random modiﬁcation is admissible. In practice,

we must choose modiﬁcation families that neural networks are able to learn to

resist. Ideally, we should also use model families that allow a fast approximate

263

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

inference rule. We can think of any form of modiﬁcation parametrized by a vector

as training an ensemble consisting of

(

y | x, µ

) for all possible values of

There is no requirement that

have a ﬁnite number of values. For example,

can

be real valued. Srivastava et al. (2014) showed that multiplying the weights by

µ ∼ N

(

1, I

) can outperform dropout based on binary masks. Because

[

] =

the standard network automatically implements approximate inference in the

ensemble, without needing any weight scaling.

So far we have described dropout purely as a means of performing eﬃcient,

approximate bagging. Another view of dropout goes further than this. Dropout

trains not just a bagged ensemble of models, but an ensemble of models that share

hidden units. This means each hidden unit must be able to perform well regardless

of which other hidden units are in the model. Hidden units must be prepared to be

swapped and interchanged between models. Hinton et al. (2012c) were inspired by

an idea from biology: sexual reproduction, which involves swapping genes between

two diﬀerent organisms, creates evolutionary pressure for genes to become not

just good but readily swapped between diﬀerent organisms. Such genes and such

features are robust to changes in their environment because they are not able to

incorrectly adapt to unusual features of any one organism or model. Dropout

thus regularizes each hidden unit to be not merely a good feature but a feature

that is good in many contexts. Warde-Farley et al. (2014) compared dropout

training to training of large ensembles and concluded that dropout oﬀers additional

improvements to generalization error beyond those obtained by ensembles of

independent models.

It is important to understand that a large portion of the power of dropout

arises from the fact that the masking noise is applied to the hidden units. This

can be seen as a form of highly intelligent, adaptive destruction of the information

content of the input rather than destruction of the raw values of the input. For

example, if the model learns a hidden unit

that detects a face by ﬁnding the

nose, then dropping

corresponds to erasing the information that there is a nose

in the image. The model must learn another

, that either redundantly encodes

the presence of a nose or detects the face by another feature, such as the mouth.

Traditional noise injection techniques that add unstructured noise at the input are

not able to randomly erase the information about a nose from an image of a face

unless the magnitude of the noise is so great that nearly all the information in

the image is removed. Destroying extracted features rather than original values

allows the destruction process to make use of all the knowledge about the input

distribution that the model has acquired so far.

Another important aspect of dropout is that the noise is multiplicative. If the

264

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

noise were additive with ﬁxed scale, then a rectiﬁed linear hidden unit

with

added noise



could simply learn to have

become very large in order to make

the added noise



insigniﬁcant by comparison. Multiplicative noise does not allow

such a pathological solution to the noise robustness problem.

Another deep learning algorithm, batch normalization, reparametrizes the model

in a way that introduces both additive and multiplicative noise on the hidden

units at training time. The primary purpose of batch normalization is to improve

optimization, but the noise can have a regularizing eﬀect, and sometimes makes

dropout unnecessary. Batch normalization is described further in section 8.7.1.

7.13 Adversarial Training

In many cases, neural networks have begun to reach human performance when

evaluated on an i.i.d. test set. It is natural therefore to wonder whether these

models have obtained a true human-level understanding of these tasks. To probe

the level of understanding a network has of the underlying task, we can search

for examples that the model misclassiﬁes. Szegedy et al. (2014b) found that even

neural networks that perform at human level accuracy have a nearly 100 percent

error rate on examples that are intentionally constructed by using an optimization

procedure to search for an input



near a data point

such that the model

output is very diﬀerent at



. In many cases,



can be so similar to

that a

007

x sign

(

∇

(

θ, x, y

))

x +

 sign

(

∇

(

θ, x, y

))

y =“panda” “nematode” “gibbon”

w/ 57.7%

conﬁdence

w/ 8.2%

conﬁdence

w/ 99.3%

conﬁdence

Figure 7.8: A demonstration of adversarial example generation applied to GoogLeNet

(Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose

elements are equal to the sign of the elements of the gradient of the cost function with

respect to the input, we can change GoogLeNet’s classiﬁcation of the image. Reproduced

with permission from Goodfellow et al. (2014b).

265

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

human observer cannot tell the diﬀerence between the original example and the

adversarial example

, but the network can make highly diﬀerent predictions. See

ﬁgure 7.8 for an example.

Adversarial examples have many implications, for example, in computer security,

that are beyond the scope of this chapter. They are interesting in the context of

regularization, however, because one can reduce the error rate on the original i.i.d.

test set via

adversarial training

—training on adversarially perturbed examples

from the training set (Szegedy et al., 2014b; Goodfellow et al., 2014b).

Goodfellow et al. (2014b) showed that one of the primary causes of these

adversarial examples is excessive linearity. Neural networks are built out of

primarily linear building blocks. In some experiments the overall function they

implement proves to be highly linear as a result. These linear functions are easy

to optimize. Unfortunately, the value of a linear function can change very rapidly

if it has numerous inputs. If we change each input by



, then a linear function

with weights

can change by as much as

||w||

, which can be a very large

amount if

is highdimensional. Adversarial training discourages this highly

sensitive locally linear behavior by encouraging the network to be locally constant

in the neighborhood of the training data. This can be seen as a way of explicitly

introducing a local constancy prior into supervised neural nets.

Adversarial training helps to illustrate the power of using a large function

family in combination with aggressive regularization. Purely linear models, like

logistic regression, are not able to resist adversarial examples because they are

forced to be linear. Neural networks are able to represent functions that can range

from nearly linear to nearly locally constant and thus have the ﬂexibility to capture

linear trends in the training data while still learning to resist local perturbation.

Adversarial examples also provide a means of accomplishing semi-supervised

learning. At a point

that is not associated with a label in the dataset, the

model itself assigns some label

ˆy

. The model’s label

ˆy

may not be the true label,

but if the model is high quality, then

ˆy

has a high probability of providing the

true label. We can seek an adversarial example



that causes the classiﬁer to

output a label



with





ˆy

. Adversarial examples generated using not the true

label but a label provided by a trained model are called

virtual adversarial

examples

(Miyato et al., 2015). The classiﬁer may then be trained to assign the

same label to

and



. This encourages the classiﬁer to learn a function that is

robust to small changes anywhere along the manifold where the unlabeled data lie.

The assumption motivating this approach is that diﬀerent classes usually lie on

disconnected manifolds, and a small perturbation should not be able to jump from

one class manifold to another class manifold.

266

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

7.14 Tangent Distance, Tangent Prop and Manifold

Tangent Classiﬁer

Many machine learning algorithms aim to overcome the curse of dimensionality

by assuming that the data lies near a low-dimensional manifold, as described in

section 5.11.3.

One of the early attempts to take advantage of the manifold hypothesis is the

tangent distance

algorithm (Simard et al., 1993, 1998). It is a nonparametric

nearest neighbor algorithm in which the metric used is not the generic Euclidean

distance but one that is derived from knowledge of the manifolds near which

probability concentrates. It is assumed that we are trying to classify examples, and

that examples on the same manifold share the same category. Since the classiﬁer

should be invariant to the local factors of variation that correspond to movement

on the manifold, it would make sense to use as nearest neighbor distance between

points

and

the distance between the manifolds

and

to which they

respectively belong. Although that may be computationally diﬃcult (it would

require solving an optimization problem, to ﬁnd the nearest pair of points on M

and

), a cheap alternative that makes sense locally is to approximate

by its

tangent plane at

and measure the distance between the two tangents, or between

a tangent plane and a point. That can be achieved by solving a low-dimensional

linear system (in the dimension of the manifolds). Of course, this algorithm requires

one to specify the tangent vectors.

In a related spirit, the

tangent prop

algorithm (Simard et al., 1992) (ﬁgure 7.9)

trains a neural net classiﬁer with an extra penalty to make each output

(

) of

the neural net locally invariant to known factors of variation. These factors of

variation correspond to movement along the manifold near which examples of the

same class concentrate. Local invariance is achieved by requiring

∇

(

) to be

orthogonal to the known manifold tangent vectors

(i)

, or equivalently that

the directional derivative of

in the directions

(i)

be small by adding a

regularization penalty Ω:

Ω(f) =





(∇

f(x))



(i)



. (7.67)

This regularizer can of course be scaled by an appropriate hyperparameter, and for

most neural networks, we would need to sum over many outputs rather than the lone

output

(

) described here for simplicity. As with the tangent distance algorithm,

the tangent vectors are derived a priori, usually from the formal knowledge of

the eﬀect of transformations, such as translation, rotation, and scaling in images.

267

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Normal

Tangent

Figure 7.9: Illustration of the main idea of the tangent prop algorithm (Simard et al.,

1992) and manifold tangent classiﬁer (Rifai et al., 2011c), which both regularize the

classiﬁer output function

(

). Each curve represents the manifold for a diﬀerent class,

illustrated here as a one-dimensional manifold embedded in a two-dimensional space.

On one curve, we have chosen a single point and drawn a vector that is tangent to the

class manifold (parallel to and touching the manifold) and a vector that is normal to the

class manifold (orthogonal to the manifold). In multiple dimensions there may be many

tangent directions and many normal directions. We expect the classiﬁcation function to

change rapidly as it moves in the direction normal to the manifold, and not to change as

it moves along the class manifold. Both tangent propagation and the manifold tangent

classiﬁer regularize

(

) to not change very much as

moves along the manifold. Tangent

propagation requires the user to manually specify functions that compute the tangent

directions (such as specifying that small translations of images remain in the same class

manifold), while the manifold tangent classiﬁer estimates the manifold tangent directions

by training an autoencoder to ﬁt the training data. The use of autoencoders to estimate

manifolds is described in chapter 14.

268

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Tangent prop has been used not just for supervised learning (Simard et al., 1992)

but also in the context of reinforcement learning (Thrun, 1995).

Tangent propagation is closely related to dataset augmentation. In both

cases, the user of the algorithm encodes his or her prior knowledge of the task

by specifying a set of transformations that should not alter the output of the

network. The diﬀerence is that in the case of dataset augmentation, the network is

explicitly trained to correctly classify distinct inputs that were created by applying

more than an inﬁnitesimal amount of these transformations. Tangent propagation

does not require explicitly visiting a new input point. Instead, it analytically

regularizes the model to resist perturbation in the directions corresponding to

the speciﬁed transformation. While this analytical approach is intellectually

elegant, it has two major drawbacks. First, it only regularizes the model to resist

inﬁnitesimal perturbation. Explicit dataset augmentation confers resistance to

larger perturbations. Second, the inﬁnitesimal approach poses diﬃculties for models

based on rectiﬁed linear units. These models can only shrink their derivatives

by turning units oﬀ or shrinking their weights. They are not able to shrink their

derivatives by saturating at a high value with large weights, as sigmoid or

tanh

units can. Dataset augmentation works well with rectiﬁed linear units because

diﬀerent subsets of rectiﬁed units can activate for diﬀerent transformed versions of

each original input.

Tangent propagation is also related to

double backprop

(Drucker and LeCun,

1992) and adversarial training (Szegedy et al., 2014b; Goodfellow et al., 2014b).

Double backprop regularizes the Jacobian to be small, while adversarial training

ﬁnds inputs near the original inputs and trains the model to produce the same

output on these as on the original inputs. Tangent propagation and dataset

augmentation using manually speciﬁed transformations both require that the

model be invariant to certain speciﬁed directions of change in the input. Double

backprop and adversarial training both require that the model should be invariant

to all directions of change in the input as long as the change is small. Just as dataset

augmentation is the non-inﬁnitesimal version of tangent propagation, adversarial

training is the non-inﬁnitesimal version of double backprop.

The manifold tangent classiﬁer (Rifai et al., 2011c), eliminates the need to

know the tangent vectors a priori. As we will see in chapter 14, autoencoders can

estimate the manifold tangent vectors. The manifold tangent classiﬁer makes use

of this technique to avoid needing user-speciﬁed tangent vectors. As illustrated

in ﬁgure 14.10, these estimated tangent vectors go beyond the classical invariants

that arise out of the geometry of images (such as translation, rotation, and scaling)

and include factors that must be learned because they are object-speciﬁc (such as

269

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

moving body parts). The algorithm proposed with the manifold tangent classiﬁer

is therefore simple: (1) use an autoencoder to learn the manifold structure by

unsupervised learning, and (2) use these tangents to regularize a neural net classiﬁer

as in tangent prop (equation 7.67).

In this chapter, we have described most of the general strategies used to

regularize neural networks. Regularization is a central theme of machine learning

and as such will be revisited periodically in most of the remaining chapters. Another

central theme of machine learning is optimization, described next.

270