Chapter 7
Regularization for Deep Learning
A central problem in machine learning is how to make an algorithm that will
perform well not just on the training data, but also on new inputs. Many strategies
used in machine learning are explicitly designed to reduce the test error, possibly
at the expense of increased training error. These strategies are known collectively
as regularization. A great many forms of regularization are available to the deep
learning practitioner. In fact, developing more effective regularization strategies
has been one of the major research efforts in the field.
Chapter 5 introduced the basic concepts of generalization, underfitting, overfit-
ting, bias, variance and regularization. If you are not already familiar with these
notions, please refer to that chapter before continuing with this one.
In this chapter, we describe regularization in more detail, focusing on regular-
ization strategies for deep models or models that may be used as building blocks
to form deep models.
Some sections of this chapter deal with standard concepts in machine learning.
If you are already familiar with these concepts, feel free to skip the relevant
sections. However, most of this chapter is concerned with the extension of these
basic concepts to the particular case of neural networks.
In section 5.2.2, we defined regularization as “any modification we make to a
learning algorithm that is intended to reduce its generalization error but not its
training error.” There are many regularization strategies. Some put extra con-
straints on a machine learning model, such as adding restrictions on the
parameter
values. Some add extra terms in the objective function that can be thought of as
corresponding to a soft constraint on the parameter values. If chosen carefully,
these extra constraints and penalties can lead to improved performance on the
224
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
test set. Sometimes these constraints and penalties are designed to encode specific
kinds of prior knowledge. Other times, these constraints and penalties are designed
to express a generic preference for a simpler model class in order to promote
generalization. Sometimes penalties and constraints are necessary to make an
underdetermined problem determined. Other forms of regularization, known as
ensemble methods, combine multiple hypotheses that explain the training data.
In the context of deep learning, most regularization strategies are based on
regularizing estimators. Regularization of an estimator works by trading increased
bias for reduced variance. An effective regularizer is one that makes a profitable
trade, reducing variance significantly while not overly increasing the bias. When we
discussed generalization and overfitting in chapter 5, we focused on three situations,
where the model family being trained either (1) excluded the true data-generating
process—corresponding to underfitting and inducing bias, or (2) matched the true
data-generating process, or (3) included the generating process but also many
other possible generating processes—the overfitting regime where variance rather
than bias dominates the estimation error. The goal of regularization is to take a
model from the third regime into the second regime.
In practice, an overly complex model family does not necessarily include the
target function or the true data-generating process, or even a close approximation
of either. We almost never have access to the true data-generating process so
we can never know for sure if the model family being estimated includes the
generating process or not. Most applications of deep learning algorithms, however,
are to domains where the true data-generating process is almost certainly outside
the model family. Deep learning algorithms are typically applied to extremely
complicated domains such as images, audio sequences and text, for which the true
generation process essentially involves simulating the entire universe. To some
extent, we are always trying to fit a square peg (the data-generating process) into
a round hole (our model family).
What this means is that controlling the complexity of the model is not a
simple matter of finding the model of the right size, with the right number of
parameters. Instead, we might find—and indeed in practical deep learning scenarios,
we almost always do find—that the best fitting model (in the sense of minimizing
generalization error) is a large model that has been regularized appropriately.
We now review several strategies for how to create such a large, deep regularized
model.
225
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.1 Parameter Norm Penalties
Regularization has been used for decades prior to the advent of deep learning. Linear
models such as linear regression and logistic regression allow simple, straightforward,
and effective regularization strategies.
Many regularization approaches are based on limiting the capacity of models,
such as neural networks, linear regression, or logistic regression, by adding a pa-
rameter norm penalty Ω(
θ
) to the objective function
J
. We denote the regularized
objective function by
˜
J:
˜
J(θ; X, y) = J(θ; X, y) + αΩ(θ), (7.1)
where
α
[0
,
) is a hyperparameter that weights the relative contribution of the
norm penalty term, , relative to the standard objective function
J
. Setting
α
to 0
results in no regularization. Larger values of
α
correspond to more regularization.
When our training algorithm minimizes the regularized objective function
˜
J
it
will decrease both the original objective
J
on the training data and some measure
of the size of the parameters
θ
(or some subset of the parameters). Different
choices for the parameter norm can result in different solutions being preferred.
In this section, we discuss the effects of the various norms when used as penalties
on the model parameters.
Before delving into the regularization behavior of different norms, we note
that for neural networks, we typically choose to use a parameter norm penalty
that penalizes only the weights of the affine transformation at each layer and
leaves the biases unregularized. The biases typically require less data than the
weights to fit accurately. Each weight specifies how two variables interact. Fitting
the weight well requires observing both variables in a variety of conditions. Each
bias controls only a single variable. This means that we do not induce too much
variance by leaving the biases unregularized. Also, regularizing the bias parameters
can introduce a significant amount of underfitting. We therefore use the vector
w
to indicate all the weights that should be affected by a norm penalty, while
the vector
θ
denotes all the parameters, including both
w
and the unregularized
parameters.
In the context of neural networks, it is sometimes desirable to use a separate
penalty with a different
α
coefficient for each layer of the network. Because it can
be expensive to search for the correct value of multiple hyperparameters, it is still
reasonable to use the same weight decay at all layers just to reduce the size of
search space.
226
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.1.1 L
2
Parameter Regularization
We have already seen, in section 5.2.2, one of the simplest and most common kinds
of parameter norm penalty: the
L
2
parameter norm penalty commonly known as
weight decay
. This regularization strategy drives the weights closer to the origin
1
by adding a regularization term Ω(
θ
) =
1
2
w
2
2
to the objective function. In other
academic communities,
L
2
regularization is also known as
ridge regression
or
Tikhonov regularization.
We can gain some insight into the behavior of weight decay regularization
by studying the gradient of the regularized objective function. To simplify the
presentation, we assume no bias parameter, so
θ
is just
w
. Such a model has the
following total objective function:
˜
J(w; X, y) =
α
2
w
w + J(w; X, y), (7.2)
with the corresponding parameter gradient
w
˜
J(w; X, y) = αw +
w
J(w; X, y). (7.3)
To take a single gradient step to update the weights, we perform this update:
w w (αw +
w
J(w; X, y)) . (7.4)
Written another way, the update is
w (1 α)w
w
J(w; X, y). (7.5)
We can see that the addition of the weight decay term has modified the learning
rule to multiplicatively shrink the weight vector by a constant factor on each step,
just before performing the usual gradient update. This describes what happens in
a single step. But what happens over the entire course of training?
We will further simplify the analysis by making a quadratic approximation
to the objective function in the neighborhood of the value of the weights that
obtains minimal unregularized training cost,
w
=
arg min
w
J
(
w
). If the objective
function is truly quadratic, as in the case of fitting a linear regression model with
1
More generally, we could regularize the parameters to be near any specific point in space
and, surprisingly, still get a regularization effect, but better results will be obtained for a value
closer to the true one, with zero being a default value that makes sense when we do not know if
the correct value should be positive or negative. Since it is far more common to regularize the
model parameters toward zero, we will focus on this special case in our exposition.
227
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
mean squared error, then the approximation is perfect. The approximation
ˆ
J
is
given by
ˆ
J(θ) = J(w
) +
1
2
(w w
)
H(w w
), (7.6)
where
H
is the Hessian matrix of
J
with respect to
w
evaluated at
w
. There is
no first-order term in this quadratic approximation, because
w
is defined to be a
minimum, where the gradient vanishes. Likewise, because
w
is the location of a
minimum of J, we can conclude that H is positive semidefinite.
The minimum of
ˆ
J occurs where its gradient
w
ˆ
J(w) = H(w w
) (7.7)
is equal to 0.
To study the effect of weight decay, we modify equation 7.7 by adding the
weight decay gradient. We can now solve for the minimum of the regularized
version of
ˆ
J. We use the variable
˜
w to represent the location of the minimum.
α
˜
w + H(
˜
w w
) = 0 (7.8)
(H + αI)
˜
w = Hw
(7.9)
˜
w = (H + αI)
1
Hw
(7.10)
As
α
approaches 0, the regularized solution
˜
w
approaches
w
. But what
happens as
α
grows? Because
H
is real and symmetric, we can decompose it
into a diagonal matrix
Λ
and an orthonormal basis of eigenvectors,
Q
, such that
H = QΛQ
. Applying the decomposition to equation 7.10, we obtain
˜
w = (QΛQ
+ αI)
1
QΛQ
w
(7.11)
=
Q(Λ + αI)Q
1
QΛQ
w
(7.12)
= Q(Λ + αI)
1
ΛQ
w
. (7.13)
We see that the effect of weight decay is to rescale
w
along the axes defined by
the eigenvectors of
H
. Specifically, the component of
w
that is aligned with the
i
-th eigenvector of
H
is rescaled by a factor of
λ
i
λ
i
+α
. (You may wish to review
how this kind of scaling works, first explained in figure 2.3).
Along the directions where the eigenvalues of
H
are relatively large, for example,
where
λ
i
α
, the effect of regularization is relatively small. Yet components with
λ
i
α
will be shrunk to have nearly zero magnitude. This effect is illustrated in
figure 7.1.
228
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
w
1
w
2
w
˜
w
Figure 7.1: An illustration of the effect of
L
2
(or weight decay) regularization on the value
of the optimal
w
. The solid ellipses represent contours of equal value of the unregularized
objective. The dotted circles represent contours of equal value of the
L
2
regularizer. At
the point
˜
w
, these competing objectives reach an equilibrium. In the first dimension, the
eigenvalue of the Hessian of
J
is small. The objective function does not increase much
when moving horizontally away from
w
. Because the objective function does not express
a strong preference along this direction, the regularizer has a strong effect on this axis.
The regularizer pulls
w
1
close to zero. In the second dimension, the objective function
is very sensitive to movements away from
w
. The corresponding eigenvalue is large,
indicating high curvature. As a result, weight decay affects the position of
w
2
relatively
little.
229
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Only directions along which the parameters contribute significantly to reducing
the objective function are preserved relatively intact. In directions that do not
contribute to reducing the objective function, a small eigenvalue of the Hessian
tells us that movement in this direction will not significantly increase the gradient.
Components of the weight vector corresponding to such unimportant directions
are decayed away through the use of the regularization throughout training.
So far we have discussed weight decay in terms of its effect on the optimization
of an abstract, general quadratic cost function. How do these effects relate to
machine learning in particular? We can find out by studying linear regression, a
model for which the true cost function is quadratic and therefore amenable to the
same kind of analysis we have used so far. Applying the analysis again, we will
be able to obtain a special case of the same results, but with the solution now
phrased in terms of the training data. For linear regression, the cost function is
the sum of squared errors:
(Xw y)
(Xw y). (7.14)
When we add L
2
regularization, the objective function changes to
(Xw y)
(Xw y) +
1
2
αw
w. (7.15)
This changes the normal equations for the solution from
w = (X
X)
1
X
y (7.16)
to
w = (X
X + αI)
1
X
y. (7.17)
The matrix
X
X
in equation 7.16 is proportional to the covariance matrix
1
m
X
X
.
Using
L
2
regularization replaces this matrix with
X
X + αI
1
in equation 7.17.
The new matrix is the same as the original one, but with the addition of
α
to the
diagonal. The diagonal entries of this matrix correspond to the variance of each
input feature. We can see that
L
2
regularization causes the learning algorithm
to “perceive” the input
X
as having higher variance, which makes it shrink the
weights on features whose covariance with the output target is low compared to
this added variance.
7.1.2 L
1
Regularization
While
L
2
weight decay is the most common form of weight decay, there are other
ways to penalize the size of the model parameters. Another option is to use
L
1
regularization.
230
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Formally, L
1
regularization on the model parameter w is defined as
Ω(θ) = ||w||
1
=
i
|w
i
|, (7.18)
that is, as the sum of absolute values of the individual parameters.
2
We will
now discuss the effect of
L
1
regularization on the simple linear regression model,
with no bias parameter, that we studied in our analysis of
L
2
regularization. In
particular, we are interested in delineating the differences between
L
1
and
L
2
forms
of regularization. As with
L
2
weight decay,
L
1
weight decay controls the strength
of the regularization by scaling the penalty using a positive hyperparameter
α
.
Thus, the regularized objective function
˜
J(w; X, y) is given by
˜
J(w; X, y) = α||w||
1
+ J(w; X, y), (7.19)
with the corresponding gradient (actually, sub gradient)
w
˜
J(w; X, y) = αsign(w) +
w
J(X, y; w), (7.20)
where sign(w) is simply the sign of w applied element-wise.
By inspecting equation 7.20, we can see immediately that the effect of
L
1
regularization is quite different from that of
L
2
regularization. Specifically, we can
see that the regularization contribution to the gradient no longer scales linearly
with each
w
i
; instead it is a constant factor with a sign equal to
sign
(
w
i
). One
consequence of this form of the gradient is that we will not necessarily see clean
algebraic solutions to quadratic approximations of
J
(
X, y
;
w
) as we did for
L
2
regularization.
Our simple linear model has a quadratic cost function that we can represent
via its Taylor series. Alternately, we could imagine that this is a truncated Taylor
series approximating the cost function of a more sophisticated model. The gradient
in this setting is given by
w
ˆ
J(w) = H(w w
), (7.21)
where, again, H is the Hessian matrix of J with respect to w evaluated at w
.
Because the
L
1
penalty does not admit clean algebraic expressions in the case
of a fully general Hessian, we will also make the further simplifying assumption
that the Hessian is diagonal,
H
=
diag
([
H
1,1
, . . . , H
n,n
]), where each
H
i,i
>
0.
2
As with
L
2
regularization, we could regularize the parameters toward a value that is not
zero, but instead toward some parameter value
w
(o)
. In that case the
L
1
regularization would
introduce the term Ω(θ) = ||w w
(o)
||
1
=
i
|w
i
w
(o)
i
|.
231
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
This assumption holds if the data for the linear regression problem has been
preprocessed to remove all correlation between the input features, which may be
accomplished using PCA.
Our quadratic approximation of the
L
1
regularized objective function decom-
poses into a sum over the parameters:
ˆ
J(w; X, y) = J(w
; X, y) +
i
1
2
H
i,i
(w
i
w
i
)
2
+ α|w
i
|
. (7.22)
The problem of minimizing this approximate cost function has an analytical solution
(for each dimension i), with the following form:
w
i
= sign(w
i
) max
|w
i
|
α
H
i,i
, 0
. (7.23)
Consider the situation where
w
i
>
0 for all
i
. There are two possible outcomes:
1.
The case where
w
i
α
H
i,i
. Here the optimal value of
w
i
under the regularized
objective is simply
w
i
= 0. This occurs because the contribution of
J
(
w
;
X, y
)
to the regularized objective
˜
J
(
w
;
X, y
) is overwhelmed—in direction
i
—by
the L
1
regularization, which pushes the value of w
i
to zero.
2.
The case where
w
i
>
α
H
i,i
. In this case, the regularization does not move the
optimal value of
w
i
to zero but instead just shifts it in that direction by a
distance equal to
α
H
i,i
.
A similar process happens when
w
i
<
0, but with the
L
1
penalty making
w
i
less
negative by
α
H
i,i
, or 0.
In comparison to
L
2
regularization,
L
1
regularization results in a solution that
is more
sparse
. Sparsity in this context refers to the fact that some parameters
have an optimal value of zero. The sparsity of
L
1
regularization is a qualitatively
different behavior than arises with
L
2
regularization. Equation 7.13 gave the
solution
˜w
for
L
2
regularization. If we revisit that equation using the assumption
of a diagonal and positive definite Hessian
H
that we introduced for our analysis of
L
1
regularization, we find that
˜w
i
=
H
i,i
H
i,i
+α
w
i
. If
w
i
was nonzero, then
˜w
i
remains
nonzero. This demonstrates that
L
2
regularization does not cause the parameters
to become sparse, while L
1
regularization may do so for large enough α.
The sparsity property induced by
L
1
regularization has been used extensively
as a
feature selection
mechanism. Feature selection simplifies a machine learning
problem by choosing which subset of the available features should be used. In
232
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
particular, the well known LASSO (Tibshirani, 1995) (least absolute shrinkage
and selection operator) model integrates an
L
1
penalty with a linear model and
a least-squares cost function. The
L
1
penalty causes a subset of the weights to
become zero, suggesting that the corresponding features may safely be discarded.
In section 5.6.1, we saw that many regularization strategies can be interpreted
as MAP Bayesian inference, and that in particular,
L
2
regularization is equivalent
to MAP Bayesian inference with a Gaussian prior on the weights. For
L
1
regu-
larization, the penalty
α
Ω(
w
) =
α
i
|w
i
|
used to regularize a cost function is
equivalent to the log-prior term that is maximized by MAP Bayesian inference
when the prior is an isotropic Laplace distribution (equation 3.26) over w R
n
:
log p(w) =
i
log Laplace(w
i
; 0,
1
α
) = α||w||
1
+ n log α n log 2. (7.24)
From the point of view of learning via maximization with respect to
w
, we can
ignore the log α log 2 terms because they do not depend on w.
7.2 Norm Penalties as Constrained Optimization
Consider the cost function regularized by a parameter norm penalty:
˜
J(θ; X, y) = J(θ; X, y) + αΩ(θ). (7.25)
Recall from section 4.4 that we can minimize a function subject to constraints
by constructing a generalized Lagrange function, consisting of the original objective
function plus a set of penalties. Each penalty is a product between a coefficient,
called a Karush–Kuhn–Tucker (KKT) multiplier, and a function representing
whether the constraint is satisfied. If we wanted to constrain Ω(
θ
) to be less than
some constant k, we could construct a generalized Lagrange function
L(θ, α; X, y) = J(θ; X, y) + α(Ω(θ) k). (7.26)
The solution to the constrained problem is given by
θ
= arg min
θ
max
α,α0
L(θ, α). (7.27)
As described in section 4.4, solving this problem requires modifying both
θ
and
α
. Section 4.5 provides a worked example of linear regression with an
L
2
constraint. Many different procedures are possible—some may use gradient descent,
233
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
while others may use analytical solutions for where the gradient is zero—but in all
procedures
α
must increase whenever Ω(
θ
)
> k
and decrease whenever Ω(
θ
)
< k
.
All positive
α
encourage Ω(
θ
) to shrink. The optimal value
α
will encourage Ω(
θ
)
to shrink, but not so strongly to make Ω(θ) become less than k.
To gain some insight into the effect of the constraint, we can fix
α
and view
the problem as just a function of θ:
θ
= arg min
θ
L(θ, α
) = arg min
θ
J(θ; X, y) + α
Ω(θ). (7.28)
This is exactly the same as the regularized training problem of minimizing
˜
J
.
We can thus think of a parameter norm penalty as imposing a constraint on the
weights. If is the
L
2
norm, then the weights are constrained to lie in an
L
2
ball. If is the
L
1
norm, then the weights are constrained to lie in a region of
limited
L
1
norm. Usually we do not know the size of the constraint region that we
impose by using weight decay with coefficient
α
because the value of
α
does not
directly tell us the value of
k
. In principle, one can solve for
k
, but the relationship
between
k
and
α
depends on the form of
J
. While we do not know the exact size
of the constraint region, we can control it roughly by increasing or decreasing
α
in order to grow or shrink the constraint region. Larger
α
will result in a smaller
constraint region. Smaller α will result in a larger constraint region.
Sometimes we may wish to use explicit constraints rather than penalties. As
described in section 4.4, we can modify algorithms such as stochastic gradient
descent to take a step downhill on
J
(
θ
) and then project
θ
back to the nearest
point that satisfies Ω(
θ
)
< k
. This can be useful if we have an idea of what value
of
k
is appropriate and do not want to spend time searching for the value of
α
that
corresponds to this k.
Another reason to use explicit constraints and reprojection rather than enforcing
constraints with penalties is that penalties can cause nonconvex optimization
procedures to get stuck in local minima corresponding to small
θ
. When training
neural networks, this usually manifests as neural networks that train with several
“dead units.” These are units that do not contribute much to the behavior of the
function learned by the network because the weights going into or out of them are
all very small. When training with a penalty on the norm of the weights, these
configurations can be locally optimal, even if it is possible to significantly reduce
J
by making the weights larger. Explicit constraints implemented by reprojection
can work much better in these cases because they do not encourage the weights
to approach the origin. Explicit constraints implemented by reprojection have an
effect only when the weights become large and attempt to leave the constraint
region.
234
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Finally, explicit constraints with reprojection can be useful because they impose
some stability on the optimization procedure. When using high learning rates, it
is possible to encounter a positive feedback loop in which large weights induce
large gradients, which then induce a large update to the weights. If these updates
consistently increase the size of the weights, then
θ
rapidly moves away from
the origin until numerical overflow occurs. Explicit constraints with reprojection
prevent this feedback loop from continuing to increase the magnitude of the weights
without bound. Hinton et al. (2012c) recommend using constraints combined with a
high learning rate to enable rapid exploration of parameter space while maintaining
some stability.
In particular, Hinton et al. (2012c) recommend a strategy introduced by Srebro
and Shraibman (2005): constraining the norm of each column of the weight matrix
of a neural net layer, rather than constraining the Frobenius norm of the entire
weight matrix. Constraining the norm of each column separately prevents any one
hidden unit from having very large weights. If we converted this constraint into a
penalty in a Lagrange function, it would be similar to
L
2
weight decay but with a
separate KKT multiplier for the weights of each hidden unit. Each of these KKT
multipliers would be dynamically updated separately to make each hidden unit
obey the constraint. In practice, column norm limitation is always implemented as
an explicit constraint with reprojection.
7.3 Regularization and Under-Constrained Problems
In some cases, regularization is necessary for machine learning problems to be
properly defined. Many linear models in machine learning, including linear re-
gression and PCA, depend on inverting the matrix
X
X
. This is not possible
when
X
X
is singular. This matrix can be singular whenever the data-generating
distribution truly has no variance in some direction, or when no variance is observed
in some direction because there are fewer examples (rows of
X
) than input features
(columns of
X
). In this case, many forms of regularization correspond to inverting
X
X + αI instead. This regularized matrix is guaranteed to be invertible.
These linear problems have closed form solutions when the relevant matrix
is invertible. It is also possible for a problem with no closed form solution to be
underdetermined. An example is logistic regression applied to a problem where
the classes are linearly separable. If a weight vector
w
is able to achieve perfect
classification, then 2
w
will also achieve perfect classification and higher likelihood.
An iterative optimization procedure like stochastic gradient descent will continually
increase the magnitude of
w
and, in theory, will never halt. In practice, a numerical
235
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
implementation of gradient descent will eventually reach sufficiently large weights
to cause numerical overflow, at which point its behavior will depend on how the
programmer has decided to handle values that are not real numbers.
Most forms of regularization are able to guarantee the convergence of iterative
methods applied to underdetermined problems. For example, weight decay will
cause gradient descent to quit increasing the magnitude of the weights when the
slope of the likelihood is equal to the weight decay coefficient.
The idea of using regularization to solve underdetermined problems extends
beyond machine learning. The same idea is useful for several basic linear algebra
problems.
As we saw in section 2.9, we can solve underdetermined linear equations using
the Moore-Penrose pseudoinverse. Recall that one definition of the pseudoinverse
X
+
of a matrix X is
X
+
= lim
α0
(X
X + αI)
1
X
. (7.29)
We can now recognize equation 7.29 as performing linear regression with weight
decay. Specifically, equation 7.29 is the limit of equation 7.17 as the regularization
coefficient shrinks to zero. We can thus interpret the pseudoinverse as stabilizing
underdetermined problems using regularization.
7.4 Dataset Augmentation
The best way to make a machine learning model generalize better is to train it on
more data. Of course, in practice, the amount of data we have is limited. One way
to get around this problem is to create fake data and add it to the training set.
For some machine learning tasks, it is reasonably straightforward to create new
fake data.
This approach is easiest for classification. A classifier needs to take a complicat-
ed, high-dimensional input
x
and summarize it with a single category identity
y
.
This means that the main task facing a classifier is to be invariant to a wide variety
of transformations. We can generate new (
x, y
) pairs easily just by transforming
the x inputs in our training set.
This approach is not as readily applicable to many other tasks. For example, it
is difficult to generate new fake data for a density estimation task unless we have
already solved the density estimation problem.
Dataset augmentation has been a particularly effective technique for a specific
classification problem: object recognition. Images are high dimensional and include
236
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
an enormous range of factors of variation, many of which can be easily simulated.
Operations like translating the training images a few pixels in each direction can
often greatly improve generalization, even if the model has already been designed to
be partially translation invariant by using the convolution and pooling techniques
described in chapter 9. Many other operations, such as rotating the image or
scaling the image, have also proved quite effective.
One must be careful not to apply transformations that would change the correct
class. For example, optical character recognition tasks require recognizing the
difference between “b” and “d” and the difference between “6” and “9,” so horizontal
flips and 180
rotations are not appropriate ways of augmenting datasets for these
tasks.
There are also transformations that we would like our classifiers to be invariant
to but that are not easy to perform. For example, out-of-plane rotation cannot be
implemented as a simple geometric operation on the input pixels.
Dataset augmentation is effective for speech recognition tasks as well (Jaitly
and Hinton, 2013).
Injecting noise in the input to a neural network (Sietsma and Dow, 1991)
can also be seen as a form of data augmentation. For many classification and
even some regression tasks, the task should still be possible to solve even if small
random noise is added to the input. Neural networks prove not to be very robust
to noise, however (Tang and Eliasmith, 2010). One way to improve the robustness
of neural networks is simply to train them with random noise applied to their
inputs. Input noise injection is part of some unsupervised learning algorithms,
such as the denoising autoencoder (Vincent et al., 2008). Noise injection also
works when the noise is applied to the hidden units, which can be seen as doing
dataset augmentation at multiple levels of abstraction. Poole et al. (2014) recently
showed that this approach can be highly effective provided that the magnitude of
the noise is carefully tuned. Dropout, a powerful regularization strategy that will
be described in section 7.12, can be seen as a process of constructing new inputs
by multiplying by noise.
When comparing machine learning benchmark results, taking the effect of
dataset augmentation into account is important. Often, hand-designed dataset
augmentation schemes can dramatically reduce the generalization error of a ma-
chine learning technique. To compare the performance of one machine learning
algorithm to another, it is necessary to perform controlled experiments. When
comparing machine learning algorithm A and machine learning algorithm B, make
sure that both algorithms are evaluated using the same hand-designed dataset
augmentation schemes. Suppose that algorithm A performs poorly with no dataset
237
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
augmentation, and algorithm B performs well when combined with numerous syn-
thetic transformations of the input. In such a case the synthetic transformations
likely caused the improved performance, rather than the use of machine learning
algorithm B. Sometimes deciding whether an experiment has been properly con-
trolled requires subjective judgment. For example, machine learning algorithms
that inject noise into the input are performing a form of dataset augmentation.
Usually, operations that are generally applicable (such as adding Gaussian noise to
the input) are considered part of the machine learning algorithm, while operations
that are specific to one application domain (such as randomly cropping an image)
are considered to be separate preprocessing steps.
7.5 Noise Robustness
Section 7.4 has motivated the use of noise applied to the inputs as a dataset
augmentation strategy. For some models, the addition of noise with infinitesimal
variance at the input of the model is equivalent to imposing a penalty on the
norm of the weights (Bishop, 1995a,b). In the general case, it is important to
remember that noise injection can be much more powerful than simply shrinking
the parameters, especially when the noise is added to the hidden units. Noise
applied to the hidden units is such an important topic that it merits its own
separate discussion; the dropout algorithm described in section 7.12 is the main
development of that approach.
Another way that noise has been used in the service of regularizing models
is by adding it to the weights. This technique has been used primarily in the
context of recurrent neural networks (Jim et al., 1996; Graves, 2011). This can
be interpreted as a stochastic implementation of Bayesian inference over the
weights. The Bayesian treatment of learning would consider the model weights
to be uncertain and representable via a probability distribution that reflects this
uncertainty. Adding noise to the weights is a practical, stochastic way to reflect
this uncertainty.
Noise applied to the weights can also be interpreted as equivalent (under some
assumptions) to a more traditional form of regularization, encouraging stability of
the function to be learned. Consider the regression setting, where we wish to train
a function
ˆy
(
x
) that maps a set of features
x
to a scalar using the least-squares
cost function between the model predictions ˆy(x) and the true values y:
J = E
p(x,y)
(ˆy(x) y)
2
. (7.30)
The training set consists of m labeled examples {(x
(1)
, y
(1)
), . . . , (x
(m)
, y
(m)
)}.
238
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
We now assume that with each input presentation we also include a random
perturbation
W
N
(
;
0, ηI
) of the network weights. Let us imagine that we
have a standard
l
-layer MLP. We denote the perturbed model as
ˆy
W
(
x
). Despite
the injection of noise, we are still interested in minimizing the squared error of the
output of the network. The objective function thus becomes
˜
J
W
= E
p(x,y,
W
)
(ˆy
W
(x) y)
2
(7.31)
= E
p(x,y,
W
)
ˆy
2
W
(x) 2y ˆy
W
(x) + y
2
. (7.32)
For small
η
, the minimization of
J
with added weight noise (with covariance
ηI
) is equivalent to minimization of
J
with an additional regularization ter-
m:
ηE
p(x,y)
∇
W
ˆy(x)
2
. This form of regularization encourages the parameters
to go to regions of parameter space where small perturbations of the weights have
a relatively small influence on the output. In other words, it pushes the model
into regions where the model is relatively insensitive to small variations in the
weights, finding points that are not merely minima, but minima surrounded by
flat regions (Hochreiter and Schmidhuber, 1995). In the simplified case of linear
regression (where, for instance,
ˆy
(
x
) =
w
x
+
b
), this regularization term collapses
into
ηE
p(x)
x
2
, which is not a function of parameters and therefore does not
contribute to the gradient of
˜
J
W
with respect to the model parameters.
7.5.1 Injecting Noise at the Output Targets
Most datasets have some number of mistakes in the
y
labels. It can be harmful to
maximize
log p
(
y | x
) when
y
is a mistake. One way to prevent this is to explicitly
model the noise on the labels. For example, we can assume that for some small
constant
, the training set label
y
is correct with probability 1
, and otherwise
any of the other possible labels might be correct. This assumption is easy to
incorporate into the cost function analytically, rather than by explicitly drawing
noise samples. For example,
label smoothing
regularizes a model based on a
softmax with
k
output values by replacing the hard 0 and 1 classification targets
with targets of
k1
and 1
, respectively. The standard cross-entropy loss may
then be used with these soft targets. Maximum likelihood learning with a softmax
classifier and hard targets may actually never converge—the softmax can never
predict a probability of exactly 0 or exactly 1, so it will continue to learn larger
and larger weights, making more extreme predictions forever. It is possible to
prevent this scenario using other regularization strategies like weight decay. Label
smoothing has the advantage of preventing the pursuit of hard probabilities without
discouraging correct classification. This strategy has been used since the 1980s
239
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
and continues to be featured prominently in modern neural networks (Szegedy
et al., 2015).
7.6 Semi-Supervised Learning
In the paradigm of semi-supervised learning, both unlabeled examples from
P
(
x
)
and labeled examples from
P
(
x, y
) are used to estimate
P
(
y | x
) or predict
y
from x.
In the context of deep learning, semi-supervised learning usually refers to
learning a representation
h
=
f
(
x
)
.
The goal is to learn a representation so
that examples from the same class have similar representations. Unsupervised
learning can provide useful clues for how to group examples in representation
space. Examples that cluster tightly in the input space should be mapped to
similar representations. A linear classifier in the new space may achieve better
generalization in many cases (Belkin and Niyogi, 2002; Chapelle et al., 2003). A
long-standing variant of this approach is the application of principal components
analysis as a preprocessing step before applying a classifier (on the projected data).
Instead of having separate unsupervised and supervised components in the
model, one can construct models in which a generative model of either
P
(
x
) or
P
(
x, y
) shares parameters with a discriminative model of
P
(
y | x
). One can
then trade off the supervised criterion
log P
(
y | x
) with the unsupervised or
generative one (such as
log P
(
x
) or
log P
(
x, y
)). The generative criterion then
expresses a particular form of prior belief about the solution to the supervised
learning problem (Lasserre et al., 2006), namely that the structure of
P
(
x
) is
connected to the structure of
P
(
y | x
) in a way that is captured by the shared
parametrization. By controlling how much of the generative criterion is included
in the total criterion, one can find a better trade-off than with a purely generative
or a purely discriminative training criterion (Lasserre et al., 2006; Larochelle and
Bengio, 2008).
Salakhutdinov and Hinton (2008) describe a method for learning the kernel
function of a kernel machine used for regression, in which the usage of unlabeled
examples for modeling P (x) improves P (y | x) quite significantly.
See Chapelle et al. (2006) for more information about semi-supervised learning.
240
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.7 Multitask Learning
Multitask learning (Caruana, 1993) is a way to improve generalization by pooling
the examples (which can be seen as soft constraints imposed on the parameters)
arising out of several tasks. In the same way that additional training examples
put more pressure on the parameters of the model toward values that generalize
well, when part of a model is shared across tasks, that part of the model is more
constrained toward good values (assuming the sharing is justified), often yielding
better generalization.
Figure 7.2 illustrates a very common form of multitask learning, in which
different supervised tasks (predicting
y
(i)
given
x
) share the same input
x
, as well
as some intermediate-level representation
h
(shared)
, capturing a common pool of
factors. The model can generally be divided into two kinds of parts and associated
parameters:
1.
Task-specific parameters (which only benefit from the examples of their task
to achieve good generalization). These are the upper layers of the neural
network in figure 7.2.
2.
Generic parameters, shared across all the tasks (which benefit from the
pooled data of all the tasks). These are the lower layers of the neural network
in figure 7.2.
Improved generalization and generalization error bounds (Baxter, 1995) can be
achieved because of the shared parameters, for which statistical strength can be
greatly improved (in proportion with the increased number of examples for the
shared parameters, compared to the scenario of single-task models). Of course this
will happen only if some assumptions about the statistical relationship between
the different tasks are valid, meaning that there is something shared across some
of the tasks.
From the point of view of deep learning, the underlying prior belief is the
following: among the factors that explain the variations observed in the data
associated with the different tasks, some are shared across two or more tasks.
7.8 Early Stopping
When training large models with sufficient representational capacity to overfit
the task, we often observe that training error decreases steadily over time, but
241
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
h
(1)
h
(1)
h
(2)
h
(2)
h
(3)
h
(3)
y
(1)
y
(1)
y
(2)
y
(2)
h
(shared)
h
(shared)
xx
Figure 7.2: Multitask learning can be cast in several ways in deep learning frameworks,
and this figure illustrates the common situation where the tasks share a common input but
involve different target random variables. The lower layers of a deep network (whether it
is supervised and feedforward or includes a generative component with downward arrows)
can be shared across such tasks, while task-specific parameters (associated respectively
with the weights into and from
h
(1)
and
h
(2)
) can be learned on top of those yielding a
shared representation
h
(shared)
. The underlying assumption is that there exists a common
pool of factors that explain the variations in the input
x
, while each task is associated
with a subset of these factors. In this example, it is additionally assumed that top-level
hidden units
h
(1)
and
h
(2)
are specialized to each task (respectively predicting
y
(1)
and
y
(2)
), while some intermediate-level representation
h
(shared)
is shared across all tasks. In
the unsupervised learning context, it makes sense for some of the top-level factors to be
associated with none of the output tasks (
h
(3)
): these are the factors that explain some of
the input variations but are not relevant for predicting y
(1)
or y
(2)
.
242
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
0 50 100 150 200 250
Time (epochs)
0.00
0.05
0.10
0.15
0.20
Loss (negative log-likelihood)
Training set loss
Validation set loss
Figure 7.3: Learning curves showing how the negative log-likelihood loss changes over
time (indicated as number of training iterations over the dataset, or
epochs
). In this
example, we train a maxout network on MNIST. Observe that the training objective
decreases consistently over time, but the validation set average loss eventually begins to
increase again, forming an asymmetric U-shaped curve.
validation set error begins to rise again. See figure 7.3 for an example of this
behavior, which occurs reliably.
This means we can obtain a model with better validation set error (and thus,
hopefully better test set error) by returning to the parameter setting at the point in
time with the lowest validation set error. Every time the error on the validation set
improves, we store a copy of the model parameters. When the training algorithm
terminates, we return these parameters, rather than the latest parameters. The
algorithm terminates when no parameters have improved over the best recorded
validation error for some pre-specified number of iterations. This procedure is
specified more formally in algorithm 7.1.
This strategy is known as
early stopping
. It is probably the most commonly
used form of regularization in deep learning. Its popularity is due to both its
effectiveness and its simplicity.
One way to think of early stopping is as a very efficient hyperparameter selection
algorithm. In this view, the number of training steps is just another hyperparameter.
We can see in figure 7.3 that this hyperparameter has a U-shaped validation set
performance curve. Most hyperparameters that control model capacity have such a
U-shaped validation set performance curve, as illustrated in figure 5.3. In the case of
early stopping, we are controlling the effective capacity of the model by determining
how many steps it can take to fit the training set. Most hyperparameters must be
243
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
chosen using an expensive guess and check process, where we set a hyperparameter
at the start of training, then run training for several steps to see its effect. The
“training time” hyperparameter is unique in that by definition, a single run of
training tries out many values of the hyperparameter. The only significant cost
to choosing this hyperparameter automatically via early stopping is running the
validation set evaluation periodically during training. Ideally, this is done in
parallel to the training process on a separate machine, separate CPU, or separate
GPU from the main training process. If such resources are not available, then the
cost of these periodic evaluations may be reduced by using a validation set that is
Algorithm 7.1
The early stopping meta-algorithm for determining the best
amount of time to train. This meta-algorithm is a general strategy that works
well with a variety of training algorithms and ways of quantifying error on the
validation set.
Let n be the number of steps between evaluations.
Let
p
be the “patience,” the number of times to observe worsening validation set
error before giving up.
Let θ
o
be the initial parameters.
θ θ
o
i 0
j 0
v
θ
θ
i
i
while j < p do
Update θ by running the training algorithm for n steps.
i i + n
v
ValidationSetError(θ)
if v
< v then
j 0
θ
θ
i
i
v v
else
j j + 1
end if
end while
Best parameters are θ
, best number of training steps is i
.
244
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
small compared to the training set or by evaluating the validation set error less
frequently and obtaining a lower-resolution estimate of the optimal training time.
An additional cost to early stopping is the need to maintain a copy of the
best parameters. This cost is generally negligible, because it is acceptable to store
these parameters in a slower and larger form of memory (for example, training in
GPU memory, but storing the optimal parameters in host memory or on a disk
drive). Since the best parameters are written to infrequently and never read during
training, these occasional slow writes have little effect on the total training time.
Early stopping is an unobtrusive form of regularization, in that it requires
almost no change in the underlying training procedure, the objective function,
or the set of allowable parameter values. This means that it is easy to use early
stopping without damaging the learning dynamics. This is in contrast to weight
decay, where one must be careful not to use too much weight decay and trap the
network in a bad local minimum corresponding to a solution with pathologically
small weights.
Early stopping may be used either alone or in conjunction with other regulariza-
tion strategies. Even when using regularization strategies that modify the objective
function to encourage better generalization, it is rare for the best generalization to
occur at a local minimum of the training objective.
Early stopping requires a validation set, which means some training data is not
fed to the model. To best exploit this extra data, one can perform extra training
after the initial training with early stopping has completed. In the second, extra
training step, all the training data is included. There are two basic strategies one
can use for this second training procedure.
One strategy (algorithm 7.2) is to initialize the model again and retrain on all
the data. In this second training pass, we train for the same number of steps as
the early stopping procedure determined was optimal in the first pass. There are
some subtleties associated with this procedure. For example, there is not a good
way of knowing whether to retrain for the same number of parameter updates or
the same number of passes through the dataset. On the second round of training,
each pass through the dataset will require more parameter updates because the
training set is bigger.
Another strategy for using all the data is to keep the parameters obtained from
the first round of training and then continue training, but now using all the data.
At this stage, we now no longer have a guide for when to stop in terms of a number
of steps. Instead, we can monitor the average loss function on the validation set
and continue training until it falls below the value of the training set objective at
which the early stopping procedure halted. This strategy avoids the high cost of
245
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Algorithm 7.2
A meta-algorithm for using early stopping to determine how long
to train, then retraining on all the data.
Let X
(train)
and y
(train)
be the training set.
Split
X
(train)
and
y
(train)
into (
X
(subtrain)
,
X
(valid)
) and (
y
(subtrain)
,
y
(valid)
)
respectively.
Run early stopping (algorithm 7.1) starting from random
θ
using
X
(subtrain)
and
y
(subtrain)
for training data and
X
(valid)
and
y
(valid)
for validation data. This
returns i
, the optimal number of steps.
Set θ to random values again.
Train on X
(train)
and y
(train)
for i
steps.
Algorithm 7.3
Meta-algorithm using early stopping to determine at what objec-
tive value we start to overfit, then continue training until that value is reached.
Let X
(train)
and y
(train)
be the training set.
Split
X
(train)
and
y
(train)
into (
X
(subtrain)
,
X
(valid)
) and (
y
(subtrain)
,
y
(valid)
)
respectively.
Run early stopping (algorithm 7.1) starting from random
θ
using
X
(subtrain)
and
y
(subtrain)
for training data and
X
(valid)
and
y
(valid)
for validation data. This
updates θ.
J(θ, X
(subtrain)
, y
(subtrain)
)
while J(θ, X
(valid)
, y
(valid)
) > do
Train on X
(train)
and y
(train)
for n steps.
end while
retraining the model from scratch but is not as well behaved. For example, the
objective on the validation set may not ever reach the target value, so this strategy
is not even guaranteed to terminate. This procedure is presented more formally in
algorithm 7.3.
Early stopping is also useful because it reduces the computational cost of the
training procedure. Besides the obvious reduction in cost due to limiting the number
of training iterations, it also has the benefit of providing regularization without
requiring the addition of penalty terms to the cost function or the computation of
the gradients of such additional terms.
How early stopping acts as a regularizer:
So far we have stated that early
stopping is a regularization strategy, but we have supported this claim only by
showing learning curves where the validation set error has a U-shaped curve. What
246
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
w
1
w
2
w
˜
w
w
1
w
2
w
˜
w
Figure 7.4: An illustration of the effect of early stopping. (Left)The solid contour lines
indicate the contours of the negative log-likelihood. The dashed line indicates the trajectory
taken by SGD beginning from the origin. Rather than stopping at the point
w
that
minimizes the cost, early stopping results in the trajectory stopping at an earlier point
˜
w
.
(Right)An illustration of the effect of
L
2
regularization for comparison. The dashed circles
indicate the contours of the
L
2
penalty, which causes the minimum of the total cost to lie
nearer the origin than the minimum of the unregularized cost.
is the actual mechanism by which early stopping regularizes the model? Bishop
(1995a) and Sjöberg and Ljung (1995) argued that early stopping has the effect of
restricting the optimization procedure to a relatively small volume of parameter
space in the neighborhood of the initial parameter value
θ
o
, as illustrated in
figure 7.4. More specifically, imagine taking
τ
optimization steps (corresponding
to
τ
training iterations) and with learning rate
. We can view the product
τ
as a measure of effective capacity. Assuming the gradient is bounded, restricting
both the number of iterations and the learning rate limits the volume of parameter
space reachable from
θ
o
. In this sense,
τ
behaves as if it were the reciprocal of
the coefficient used for weight decay.
Indeed, we can show how—in the case of a simple linear model with a quadratic
error function and simple gradient descent—early stopping is equivalent to
L
2
regularization.
To compare with classical
L
2
regularization, we examine a simple setting where
the only parameters are linear weights (
θ
=
w
). We can model the cost function
J
with a quadratic approximation in the neighborhood of the empirically optimal
value of the weights w
:
ˆ
J(θ) = J(w
) +
1
2
(w w
)
H(w w
), (7.33)
where
H
is the Hessian matrix of
J
with respect to
w
evaluated at
w
. Given the
assumption that
w
is a minimum of
J
(
w
), we know that
H
is positive semidefinite.
247
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Under a local Taylor series approximation, the gradient is given by
w
ˆ
J(w) = H(w w
). (7.34)
We are going to study the trajectory followed by the parameter vector during
training. For simplicity, let us set the initial parameter vector to the origin,
3
w
(0)
=
0
. Let us study the approximate behavior of gradient descent on
J
by
analyzing gradient descent on
ˆ
J:
w
(τ)
= w
(τ1)
w
ˆ
J(w
(τ1)
) (7.35)
= w
(τ1)
H(w
(τ1)
w
), (7.36)
w
(τ)
w
= (I H)(w
(τ1)
w
). (7.37)
Let us now rewrite this expression in the space of the eigenvectors of
H
, exploiting
the eigendecomposition of
H
:
H
=
QΛQ
, where
Λ
is a diagonal matrix and
Q
is an orthonormal basis of eigenvectors.
w
(τ)
w
= (I QΛQ
)(w
(τ1)
w
) (7.38)
Q
(w
(τ)
w
) = (I Λ)Q
(w
(τ1)
w
) (7.39)
Assuming that
w
(0)
= 0 and that
is chosen to be small enough to guarantee
|
1
λ
i
| <
1, the parameter trajectory during training after
τ
parameter updates
is as follows:
Q
w
(τ)
= [I (I Λ)
τ
]Q
w
. (7.40)
Now, the expression for
Q
˜
w
in equation 7.13 for
L
2
regularization can be rear-
ranged as
Q
˜
w = (Λ + αI)
1
ΛQ
w
, (7.41)
Q
˜
w = [I (Λ + αI)
1
α]Q
w
. (7.42)
Comparing equation 7.40 and equation 7.42, we see that if the hyperparameters
,
α, and τ are chosen such that
(I Λ)
τ
= (Λ + αI)
1
α, (7.43)
3
For neural networks, to obtain symmetry breaking between hidden units, we cannot initialize
all the parameters to
0
, as discussed in section 6.2. However, the argument holds for any other
initial value w
(0)
.
248
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
then
L
2
regularization and early stopping can be seen as equivalent (at least under
the quadratic approximation of the objective function). Going even further, by
taking logarithms and using the series expansion for
log
(1 +
x
), we can conclude
that if all λ
i
are small (that is, λ
i
1 and λ
i
1) then
τ
1
α
, (7.44)
α
1
τ
. (7.45)
That is, under these assumptions, the number of training iterations
τ
plays a role
inversely proportional to the
L
2
regularization parameter, and the inverse of
τ
plays the role of the weight decay coefficient.
Parameter values corresponding to directions of significant curvature (of the
objective function) are regularized less than directions of less curvature. Of course,
in the context of early stopping, this really means that parameters that correspond
to directions of significant curvature tend to learn early relative to parameters
corresponding to directions of less curvature.
The derivations in this section have shown that a trajectory of length
τ
ends
at a point that corresponds to a minimum of the
L
2
-regularized objective. Early
stopping is of course more than the mere restriction of the trajectory length;
instead, early stopping typically involves monitoring the validation set error in
order to stop the trajectory at a particularly good point in space. Early stopping
therefore has the advantage over weight decay in that it automatically determines
the correct amount of regularization while weight decay requires many training
experiments with different values of its hyperparameter.
7.9 Parameter Tying and Parameter Sharing
Thus far, in this chapter, when we have discussed adding constraints or penalties
to the parameters, we have always done so with respect to a fixed region or point.
For example,
L
2
regularization (or weight decay) penalizes model parameters for
deviating from the fixed value of zero. Sometimes, however, we may need other
ways to express our prior knowledge about suitable values of the model parameters.
Sometimes we might not know precisely what values the parameters should take,
but we know, from knowledge of the domain and model architecture, that there
should be some dependencies between the model parameters.
A common type of dependency that we often want to express is that certain
parameters should be close to one another. Consider the following scenario:
249
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
we have two models performing the same classification task (with the same set
of classes) but with somewhat different input distributions. Formally, we have
model
A
with parameters
w
(A)
and model
B
with parameters
w
(B)
. The two
models map the input to two different but related outputs:
ˆy
(A)
=
f
(
w
(A)
, x
) and
ˆy
(B)
= g(w
(B)
, x).
Let us imagine that the tasks are similar enough (perhaps with similar input
and output distributions) that we believe the model parameters should be close
to each other:
i
,
w
(A)
i
should be close to
w
(B)
i
. We can leverage this information
through regularization. Specifically, we can use a parameter norm penalty of the
form Ω(
w
(A)
, w
(B)
) =
w
(A)
w
(B)
2
2
. Here we used an
L
2
penalty, but other
choices are also possible.
This kind of approach was proposed by Lasserre et al. (2006), who regularized
the parameters of one model, trained as a classifier in a supervised paradigm, to
be close to the parameters of another model, trained in an unsupervised paradigm
(to capture the distribution of the observed input data). The architectures were
constructed such that many of the parameters in the classifier model could be
paired to corresponding parameters in the unsupervised model.
While a parameter norm penalty is one way to regularize parameters to be
close to one another, the more popular way is to use constraints: to force sets
of parameters to be equal. This method of regularization is often referred to as
parameter sharing
, because we interpret the various models or model components
as sharing a unique set of parameters. A significant advantage of parameter sharing
over regularizing the parameters to be close (via a norm penalty) is that only a
subset of the parameters (the unique set) needs to be stored in memory. In certain
models—such as the convolutional neural network—this can lead to significant
reduction in the memory footprint of the model.
7.9.1 Convolutional Neural Networks
By far the most popular and extensive use of parameter sharing occurs in
convo-
lutional neural networks (CNNs) applied to computer vision.
Natural images have many statistical properties that are invariant to translation.
For example, a photo of a cat remains a photo of a cat if it is translated one pixel
to the right. CNNs take this property into account by sharing parameters across
multiple image locations. The same feature (a hidden unit with the same weights)
is computed over different locations in the input. This means that we can find a
cat with the same cat detector whether the cat appears at column
i
or column
i + 1 in the image.
250
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Parameter sharing has enabled CNNs to dramatically lower the number of
unique model parameters and to significantly increase network sizes without
requiring a corresponding increase in training data. It remains one of the best
examples of how to effectively incorporate domain knowledge into the network
architecture.
CNNs are discussed in more detail in chapter 9.
7.10 Sparse Representations
Weight decay acts by placing a penalty directly on the model parameters. Another
strategy is to place a penalty on the activations of the units in a neural network,
encouraging their activations to be sparse. This indirectly imposes a complicated
penalty on the model parameters.
We have already discussed (in section 7.1.2) how
L
1
penalization induces
a sparse parametrization—meaning that many of the parameters become zero
(or close to zero). Representational sparsity, on the other hand, describes a
representation where many of the elements of the representation are zero (or close
to zero). A simplified view of this distinction can be illustrated in the context of
linear regression:
18
5
15
9
3
=
4 0 0 2 0 0
0 0 1 0 3 0
0 5 0 0 0 0
1 0 0 1 0 4
1 0 0 0 5 0
2
3
2
5
1
4
y R
m
A R
m×n
x R
n
(7.46)
14
1
19
2
23
=
3 1 2 5 4 1
4 2 3 1 1 3
1 5 4 2 3 2
3 1 2 3 0 3
5 4 2 2 5 1
0
2
0
0
3
0
y R
m
B R
m×n
h R
n
(7.47)
In the first expression, we have an example of a sparsely parametrized linear
regression model. In the second, we have linear regression with a sparse representa-
251
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
tion
h
of the data
x
. That is,
h
is a function of
x
that, in some sense, represents
the information present in x, but does so with a sparse vector.
Representational regularization is accomplished by the same sorts of mechanisms
that we have used in parameter regularization.
Norm penalty regularization of representations is performed by adding to the
loss function
J
a norm penalty on the representation. This penalty is denoted
Ω(h). As before, we denote the regularized loss function by
˜
J:
˜
J(θ; X, y) = J(θ; X, y) + αΩ(h), (7.48)
where
α
[0
,
) weights the relative contribution of the norm penalty term, with
larger values of α corresponding to more regularization.
Just as an
L
1
penalty on the parameters induces parameter sparsity, an
L
1
penalty on the elements of the representation induces representational sparsity:
Ω(
h
) =
||h||
1
=
i
|h
i
|
. Of course, the
L
1
penalty is only one choice of penalty
that can result in a sparse representation. Others include the penalty derived from
a Student
t
prior on the representation (Olshausen and Field, 1996; Bergstra, 2011)
and KL divergence penalties (Larochelle and Bengio, 2008), which are especially
useful for representations with elements constrained to lie on the unit interval.
Lee et al. (2008) and Goodfellow et al. (2009) both provide examples of strategies
based on regularizing the average activation across several examples,
1
m
i
h
(i)
, to
be near some target value, such as a vector with .01 for each entry.
Other approaches obtain representational sparsity with a hard constraint on
the activation values. For example,
orthogonal matching pursuit
(Pati et al.,
1993) encodes an input
x
with the representation
h
that solves the constrained
optimization problem
arg min
h,h
0
<k
x W h
2
, (7.49)
where
h
0
is the number of nonzero entries of
h
. This problem can be solved
efficiently when
W
is constrained to be orthogonal. This method is often called
OMP-
k
, with the value of
k
specified to indicate the number of nonzero features
allowed. Coates and Ng (2011) demonstrated that OMP-1 can be a very effective
feature extractor for deep architectures.
Essentially any model that has hidden units can be made sparse. Throughout
this book, we see many examples of sparsity regularization used in various contexts.
252
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.11 Bagging and Other Ensemble Methods
Bagging
(short for
bootstrap aggregating
) is a technique for reducing general-
ization error by combining several models (Breiman, 1994). The idea is to train
several different models separately, then have all the models vote on the output for
test examples. This is an example of a general strategy in machine learning called
model averaging
. Techniques employing this strategy are known as
ensemble
methods.
The reason that model averaging works is that different models will usually
not make all the same errors on the test set.
Consider for example a set of
k
regression models. Suppose that each model
makes an error
i
on each example, with the errors drawn from a zero-mean
multivariate normal distribution with variances
E
[
2
i
] =
v
and covariances
E
[
i
j
] =
c
. Then the error made by the average prediction of all the ensemble models is
1
k
i
i
. The expected squared error of the ensemble predictor is
E
1
k
i
i
2
=
1
k
2
E
i
2
i
+
j=i
i
j
, (7.50)
=
1
k
v +
k 1
k
c. (7.51)
In the case where the errors are perfectly correlated and
c
=
v
, the mean squared
error reduces to
v
, so the model averaging does not help at all. In the case where
the errors are perfectly uncorrelated and
c
= 0, the expected squared error of the
ensemble is only
1
k
v
. This means that the expected squared error of the ensemble
is inversely proportional to the ensemble size. In other words, on average, the
ensemble will perform at least as well as any of its members, and if the members
make independent errors, the ensemble will perform significantly better than its
members.
Different ensemble methods construct the ensemble of models in different ways.
For example, each member of the ensemble could be formed by training a completely
different kind of model using a different algorithm or objective function. Bagging
is a method that allows the same kind of model, training algorithm and objective
function to be reused several times.
Specifically, bagging involves constructing
k
different datasets. Each dataset
has the same number of examples as the original dataset, but each dataset is
constructed by sampling with replacement from the original dataset. This means
that, with high probability, each dataset is missing some of the examples from the
253
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
8
8
First ensemble member
Second ensemble member
Original dataset
First resampled dataset
Second resampled dataset
Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an 8 detector on
the dataset depicted above, containing an 8, a 6 and a 9. Suppose we make two different
resampled datasets. The bagging training procedure is to construct each of these datasets
by sampling with replacement. The first dataset omits the 9 and repeats the 8. On this
dataset, the detector learns that a loop on top of the digit corresponds to an 8. On the
second dataset, we repeat the 9 and omit the 6. In this case, the detector learns that a
loop on the bottom of the digit corresponds to an 8. Each of these individual classification
rules is brittle, but if we average their output, then the detector is robust, achieving
maximal confidence only when both loops of the 8 are present.
original dataset and contains several duplicate examples
4
. Model
i
is then trained
on dataset
i
. The differences between which examples are included in each dataset
result in differences between the trained models. See figure 7.5 for an example.
Neural networks reach a wide enough variety of solution points that they can
often benefit from model averaging even if all the models are trained on the same
dataset. Differences in random initialization, in random selection of minibatches,
in hyperparameters, or in outcomes of nondeterministic implementations of neural
networks are often enough to cause different members of the ensemble to make
partially independent errors.
Model averaging is an extremely powerful and reliable method for reducing
generalization error. Its use is usually discouraged when benchmarking algorithms
for scientific papers, because any machine learning algorithm can benefit substan-
tially from model averaging at the price of increased computation and memory.
4
When both the original and the resampled dataset contain
m
examples, the exact proportion
of examples missing from the new dataset is (1
1
m
)
m
. This is the chance that a particular
example is not chosen among the
m
possible source examples for all
m
draws used to create the
new dataset. As
m
approaches infinity this quantity converges to
1
e
, which is slightly larger than
1
3
.
254
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
For this reason, benchmark comparisons are usually made using a single model.
Machine learning contests are usually won by methods using model averag-
ing over dozens of models. A recent prominent example is the Netflix Grand
Prize (Koren, 2009).
Not all techniques for constructing ensembles are designed to make the ensemble
more regularized than the individual models. For example, a technique called
boosting
(Freund and Schapire, 1996b,a) constructs an ensemble with higher
capacity than the individual models. Boosting has been applied to build ensembles
of neural networks (Schwenk and Bengio, 1998) by incrementally adding neural
networks to the ensemble. Boosting has also been applied interpreting an individual
neural network as an ensemble (Bengio et al., 2006a), incrementally adding hidden
units to the network.
7.12 Dropout
Dropout
(Srivastava et al., 2014) provides a computationally inexpensive but
powerful method of regularizing a broad family of models. To a first approximation,
dropout can be thought of as a method of making bagging practical for ensembles
of very many large neural networks. Bagging involves training multiple models
and evaluating multiple models on each test example. This seems impractical
when each model is a large neural network, since training and evaluating such
networks is costly in terms of runtime and memory. It is common to use ensembles
of five to ten neural networks—Szegedy et al. (2014a) used six to win the ILSVRC—
but more than this rapidly becomes unwieldy. Dropout provides an inexpensive
approximation to training and evaluating a bagged ensemble of exponentially many
neural networks.
Specifically, dropout trains the ensemble consisting of all subnetworks that
can be formed by removing nonoutput units from an underlying base network, as
illustrated in figure 7.6. In most modern neural networks, based on a series of
affine transformations and nonlinearities, we can effectively remove a unit from a
network by multiplying its output value by zero. This procedure requires some
slight modification for models such as radial basis function networks, which take
the difference between the unit’s state and some reference value. Here, we present
the dropout algorithm in terms of multiplication by zero for simplicity, but it can
be trivially modified to work with other operations that remove a unit from the
network.
Recall that to learn with bagging, we define
k
different models, construct
k
255
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
yy
h
2
h
2
x
1
x
1
x
2
x
2
yy
h
2
h
2
x
1
x
1
x
2
x
2
yy
h
2
h
2
x
2
x
2
yy
h
2
h
2
x
1
x
1
yy
h
2
h
2
x
1
x
1
x
2
x
2
yy
x
1
x
1
x
2
x
2
yy
h
2
h
2
yy
x
1
x
1
x
2
x
2
yy
h
2
h
2
x
2
x
2
yy
x
1
x
1
yy
x
2
x
2
yy
h
2
h
2
x
1
x
1
yy
x
1
x
1
yy
x
2
x
2
yy
h
2
h
2
yy
yy
Base network
Ensemble of subnetworks
Figure 7.6: Dropout trains an ensemble consisting of all subnetworks that can be con-
structed by removing nonoutput units from an underlying base network. Here, we begin
with a base network with two visible units and two hidden units. There are sixteen
possible subsets of these four units. We show all sixteen subnetworks that may be formed
by dropping out different subsets of units from the original network. In this small example,
a large proportion of the resulting networks have no input units or no path connecting
the input to the output. This problem becomes insignificant for networks with wider
layers, where the probability of dropping all possible paths from inputs to outputs becomes
smaller.
256
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
different datasets by sampling from the training set with replacement, and then
train model
i
on dataset
i
. Dropout aims to approximate this process, but with an
exponentially large number of neural networks. Specifically, to train with dropout,
we use a minibatch-based learning algorithm that makes small steps, such as
stochastic gradient descent. Each time we load an example into a minibatch, we
randomly sample a different binary mask to apply to all the input and hidden
units in the network. The mask for each unit is sampled independently from all
the others. The probability of sampling a mask value of one (causing a unit to be
included) is a hyperparameter fixed before training begins. It is not a function
of the current value of the model parameters or the input example. Typically, an
input unit is included with probability 0.8, and a hidden unit is included with
probability 0.5. We then run forward propagation, back-propagation, and the
learning update as usual. Figure 7.7 illustrates how to run forward propagation
with dropout.
More formally, suppose that a mask vector
µ
specifies which units to include,
and
J
(
θ, µ
) defines the cost of the model defined by parameters
θ
and mask
µ
.
Then dropout training consists of minimizing
E
µ
J
(
θ, µ
). The expectation contains
exponentially many terms, but we can obtain an unbiased estimate of its gradient
by sampling values of µ.
Dropout training is not quite the same as bagging training. In the case of
bagging, the models are all independent. In the case of dropout, the models
share parameters, with each model inheriting a different subset of parameters
from the parent neural network. This parameter sharing makes it possible to
represent an exponential number of models with a tractable amount of memory.
In the case of bagging, each model is trained to convergence on its respective
training set. In the case of dropout, typically most models are not explicitly
trained at all—usually, the model is large enough that it would be infeasible to
sample all possible subnetworks within the lifetime of the universe. Instead, a tiny
fraction of the possible subnetworks are each trained for a single step, and the
parameter sharing causes the remaining subnetworks to arrive at good settings of
the parameters. These are the only differences. Beyond these, dropout follows the
bagging algorithm. For example, the training set encountered by each subnetwork
is indeed a subset of the original training set sampled with replacement.
To make a prediction, a bagged ensemble must accumulate votes from all
its members. We refer to this process as
inference
in this context. So far, our
description of bagging and dropout has not required that the model be explicitly
probabilistic. Now, we assume that the model’s role is to output a probability
distribution. In the case of bagging, each model
i
produces a probability distribution
257
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
ˆx
1
ˆx
1
µ
x
1
µ
x
1
x
1
x
1
ˆx
2
ˆx
2
x
2
x
2
µ
x
2
µ
x
2
h
2
h
2
µ
h
1
µ
h
1
µ
h
2
µ
h
2
ˆ
h
1
ˆ
h
1
ˆ
h
2
ˆ
h
2
yy
yy
h
2
h
2
x
1
x
1
x
2
x
2
Figure 7.7: An example of forward propagation through a feedforward network using
dropout. (Top)In this example, we use a feedforward network with two input units, one
hidden layer with two hidden units, and one output unit. (Bottom)To perform forward
propagation with dropout, we randomly sample a vector
µ
with one entry for each input
or hidden unit in the network. The entries of
µ
are binary and are sampled independently
from each other. The probability of each entry being 1 is a hyperparameter, usually 0
.
5
for the hidden layers and 0
.
8 for the input. Each unit in the network is multiplied by
the corresponding mask, and then forward propagation continues through the rest of the
network as usual. This is equivalent to randomly selecting one of the sub-networks from
figure 7.6 and running forward propagation through it.
258
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
p
(i)
(
y | x
). The prediction of the ensemble is given by the arithmetic mean of all
these distributions,
1
k
k
i=1
p
(i)
(y | x). (7.52)
In the case of dropout, each submodel defined by mask vector
µ
defines a probability
distribution p(y | x, µ). The arithmetic mean over all masks is given by
µ
p(µ)p(y | x, µ), (7.53)
where
p
(
µ
) is the probability distribution that was used to sample
µ
at training
time.
Because this sum includes an exponential number of terms, it is intractable to
evaluate except when the structure of the model permits some form of simplification.
So far, deep neural nets are not known to permit any tractable simplification.
Instead, we can approximate the inference with sampling, by averaging together
the output from many masks. Even 10–20 masks are often sufficient to obtain
good performance.
An even better approach, however, allows us to obtain a good approximation to
the predictions of the entire ensemble, at the cost of only one forward propagation.
To do so, we change to using the geometric mean rather than the arithmetic mean of
the ensemble members’ predicted distributions. Warde-Farley et al. (2014) present
arguments and empirical evidence that the geometric mean performs comparably
to the arithmetic mean in this context.
The geometric mean of multiple probability distributions is not guaranteed to be
a probability distribution. To guarantee that the result is a probability distribution,
we impose the requirement that none of the submodels assigns probability 0 to any
event, and we renormalize the resulting distribution. The unnormalized probability
distribution defined directly by the geometric mean is given by
˜p
ensemble
(y | x) =
2
d
µ
p(y | x, µ), (7.54)
where
d
is the number of units that may be dropped. Here we use a uniform
distribution over
µ
to simplify the presentation, but nonuniform distributions are
also possible. To make predictions we must renormalize the ensemble:
p
ensemble
(y | x) =
˜p
ensemble
(y | x)
y
˜p
ensemble
(y
| x)
. (7.55)
259
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
A key insight (Hinton et al., 2012c) involved in dropout is that we can approxi-
mate
p
ensemble
by evaluating
p
(
y | x
) in one model: the model with all units, but
with the weights going out of unit
i
multiplied by the probability of including unit
i
. The motivation for this modification is to capture the right expected value of the
output from that unit. We call this approach the
weight scaling inference rule
.
There is not yet any theoretical argument for the accuracy of this approximate
inference rule in deep nonlinear networks, but empirically it performs very well.
Because we usually use an inclusion probability of
1
2
, the weight scaling rule
usually amounts to dividing the weights by 2 at the end of training, and then using
the model as usual. Another way to achieve the same result is to multiply the
states of the units by 2 during training. Either way, the goal is to make sure that
the expected total input to a unit at test time is roughly the same as the expected
total input to that unit at train time, even though half the units at train time are
missing on average.
For many classes of models that do not have nonlinear hidden units, the weight
scaling inference rule is exact. For a simple example, consider a softmax regression
classifier with n input variables represented by the vector v:
P (y = y | v) = softmax
W
v + b
y
. (7.56)
We can index into the family of submodels by element-wise multiplication of the
input with a binary vector d:
P (y = y | v; d) = softmax
W
(d v) + b
y
. (7.57)
The ensemble predictor is defined by renormalizing the geometric mean over all
ensemble members’ predictions:
P
ensemble
(y = y | v) =
˜
P
ensemble
(y = y | v)
y
˜
P
ensemble
(y = y
| v)
, (7.58)
where
˜
P
ensemble
(y = y | v) =
2
n
d∈{0,1}
n
P (y = y | v; d). (7.59)
To see that the weight scaling rule is exact, we can simplify
˜
P
ensemble
:
˜
P
ensemble
(y = y | v) =
2
n
d∈{0,1}
n
P (y = y | v; d) (7.60)
260
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
=
2
n
d∈{0,1}
n
softmax (W
(d v) + b)
y
(7.61)
=
2
n
d∈{0,1}
n
exp
W
y,:
(d v) + b
y
y
exp
W
y
,:
(d v) + b
y
(7.62)
=
2
n
d∈{0,1}
n
exp
W
y,:
(d v) + b
y
2
n
d∈{0,1}
n
y
exp
W
y
,:
(d v) + b
y
(7.63)
Because
˜
P
will be normalized, we can safely ignore multiplication by factors that
are constant with respect to y:
˜
P
ensemble
(y = y | v)
2
n
d∈{0,1}
n
exp
W
y,:
(d v) + b
y
(7.64)
= exp
1
2
n
d∈{0,1}
n
W
y,:
(d v) + b
y
(7.65)
= exp
1
2
W
y,:
v + b
y
. (7.66)
Substituting this back into equation 7.58, we obtain a softmax classifier with
weights
1
2
W .
The weight scaling rule is also exact in other settings, including regression
networks with conditionally normal outputs as well as deep networks that have
hidden layers without nonlinearities. However, the weight scaling rule is only an
approximation for deep models that have nonlinearities. Though the approximation
has not been theoretically characterized, it often works well, empirically. Goodfellow
et al. (2013a) found experimentally that the weight scaling approximation can work
better (in terms of classification accuracy) than Monte Carlo approximations to
the ensemble predictor. This held true even when the Monte Carlo approximation
was allowed to sample up to 1,000 subnetworks. Gal and Ghahramani (2015) found
that some models obtain better classification accuracy using twenty samples and
the Monte Carlo approximation. It appears that the optimal choice of inference
approximation is problem dependent.
Srivastava et al. (2014) showed that dropout is more effective than other
standard computationally inexpensive regularizers, such as weight decay, filter
261
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
norm constraints, and sparse activity regularization. Dropout may also be combined
with other forms of regularization to yield a further improvement.
One advantage of dropout is that it is very computationally cheap. Using
dropout during training requires only
O
(
n
) computation per example per update,
to generate
n
random binary numbers and multiply them by the state. Depending
on the implementation, it may also require
O
(
n
) memory to store these binary
numbers until the back-propagation stage. Running inference in the trained model
has the same cost per example as if dropout were not used, though we must pay
the cost of dividing the weights by 2 once before beginning to run inference on
examples.
Another significant advantage of dropout is that it does not significantly limit
the type of model or training procedure that can be used. It works well with nearly
any model that uses a distributed representation and can be trained with stochastic
gradient descent. This includes feedforward neural networks, probabilistic models
such as restricted Boltzmann machines (Srivastava et al., 2014), and recurrent
neural networks (Bayer and Osendorfer, 2014; Pascanu et al., 2014a). Many other
regularization strategies of comparable power impose more severe restrictions on
the architecture of the model.
Though the cost per step of applying dropout to a specific model is negligible,
the cost of using dropout in a complete system can be significant. Because dropout
is a regularization technique, it reduces the effective capacity of a model. To offset
this effect, we must increase the size of the model. Typically the optimal validation
set error is much lower when using dropout, but this comes at the cost of a much
larger model and many more iterations of the training algorithm. For very large
datasets, regularization confers little reduction in generalization error. In these
cases, the computational cost of using dropout and larger models may outweigh
the benefit of regularization.
When extremely few labeled training examples are available, dropout is less
effective. Bayesian neural networks (Neal, 1996) outperform dropout on the
Alternative Splicing Dataset (Xiong et al., 2011), where fewer than 5,000 examples
are available (Srivastava et al., 2014). When additional unlabeled data is available,
unsupervised feature learning can gain an advantage over dropout.
Wager et al. (2013) showed that, when applied to linear regression, dropout
is equivalent to
L
2
weight decay, with a different weight decay coefficient for
each input feature. The magnitude of each feature’s weight decay coefficient is
determined by its variance. Similar results hold for other linear models. For deep
models, dropout is not equivalent to weight decay.
The stochasticity used while training with dropout is not necessary for the
262
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
approach’s success. It is just a means of approximating the sum over all submodels.
Wang and Manning (2013) derived analytical approximations to this marginaliza-
tion. Their approximation, known as
fast dropout
, resulted in faster convergence
time due to the reduced stochasticity in the computation of the gradient. This
method can also be applied at test time, as a more principled (but also more
computationally expensive) approximation to the average over all sub-networks
than the weight scaling approximation. Fast dropout has been used to nearly
match the performance of standard dropout on small neural network problems, but
has not yet yielded a significant improvement or been applied to a large problem.
Just as stochasticity is not necessary to achieve the regularizing effect of
dropout, it is also not sufficient. To demonstrate this, Warde-Farley et al. (2014)
designed control experiments using a method called
dropout boosting
, which
they designed to use exactly the same mask noise as traditional dropout but lack
its regularizing effect. Dropout boosting trains the entire ensemble to jointly
maximize the log-likelihood on the training set. In the same sense that traditional
dropout is analogous to bagging, this approach is analogous to boosting. As
intended, experiments with dropout boosting show almost no regularization effect
compared to training the entire network as a single model. This demonstrates that
the interpretation of dropout as bagging has value beyond the interpretation of
dropout as robustness to noise. The regularization effect of the bagged ensemble is
only achieved when the stochastically sampled ensemble members are trained to
perform well independently of each other.
Dropout has inspired other stochastic approaches to training exponentially
large ensembles of models that share weights. DropConnect is a special case of
dropout where each product between a single scalar weight and a single hidden
unit state is considered a unit that can be dropped (Wan et al., 2013). Stochastic
pooling is a form of randomized pooling (see section 9.3) for building ensembles
of convolutional networks, with each convolutional network attending to different
spatial locations of each feature map. So far, dropout remains the most widely
used implicit ensemble method.
One of the key insights of dropout is that training a network with stochastic
behavior and making predictions by averaging over multiple stochastic decisions
implements a form of bagging with parameter sharing. Earlier, we described
dropout as bagging an ensemble of models formed by including or excluding units.
Yet this model averaging strategy does not need to be based on inclusion and
exclusion. In principle, any kind of random modification is admissible. In practice,
we must choose modification families that neural networks are able to learn to
resist. Ideally, we should also use model families that allow a fast approximate
263
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
inference rule. We can think of any form of modification parametrized by a vector
µ
as training an ensemble consisting of
p
(
y | x, µ
) for all possible values of
µ
.
There is no requirement that
µ
have a finite number of values. For example,
µ
can
be real valued. Srivastava et al. (2014) showed that multiplying the weights by
µ N
(
1, I
) can outperform dropout based on binary masks. Because
E
[
µ
] =
1
,
the standard network automatically implements approximate inference in the
ensemble, without needing any weight scaling.
So far we have described dropout purely as a means of performing efficient,
approximate bagging. Another view of dropout goes further than this. Dropout
trains not just a bagged ensemble of models, but an ensemble of models that share
hidden units. This means each hidden unit must be able to perform well regardless
of which other hidden units are in the model. Hidden units must be prepared to be
swapped and interchanged between models. Hinton et al. (2012c) were inspired by
an idea from biology: sexual reproduction, which involves swapping genes between
two different organisms, creates evolutionary pressure for genes to become not
just good but readily swapped between different organisms. Such genes and such
features are robust to changes in their environment because they are not able to
incorrectly adapt to unusual features of any one organism or model. Dropout
thus regularizes each hidden unit to be not merely a good feature but a feature
that is good in many contexts. Warde-Farley et al. (2014) compared dropout
training to training of large ensembles and concluded that dropout offers additional
improvements to generalization error beyond those obtained by ensembles of
independent models.
It is important to understand that a large portion of the power of dropout
arises from the fact that the masking noise is applied to the hidden units. This
can be seen as a form of highly intelligent, adaptive destruction of the information
content of the input rather than destruction of the raw values of the input. For
example, if the model learns a hidden unit
h
i
that detects a face by finding the
nose, then dropping
h
i
corresponds to erasing the information that there is a nose
in the image. The model must learn another
h
i
, that either redundantly encodes
the presence of a nose or detects the face by another feature, such as the mouth.
Traditional noise injection techniques that add unstructured noise at the input are
not able to randomly erase the information about a nose from an image of a face
unless the magnitude of the noise is so great that nearly all the information in
the image is removed. Destroying extracted features rather than original values
allows the destruction process to make use of all the knowledge about the input
distribution that the model has acquired so far.
Another important aspect of dropout is that the noise is multiplicative. If the
264
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
noise were additive with fixed scale, then a rectified linear hidden unit
h
i
with
added noise
could simply learn to have
h
i
become very large in order to make
the added noise
insignificant by comparison. Multiplicative noise does not allow
such a pathological solution to the noise robustness problem.
Another deep learning algorithm, batch normalization, reparametrizes the model
in a way that introduces both additive and multiplicative noise on the hidden
units at training time. The primary purpose of batch normalization is to improve
optimization, but the noise can have a regularizing effect, and sometimes makes
dropout unnecessary. Batch normalization is described further in section 8.7.1.
7.13 Adversarial Training
In many cases, neural networks have begun to reach human performance when
evaluated on an i.i.d. test set. It is natural therefore to wonder whether these
models have obtained a true human-level understanding of these tasks. To probe
the level of understanding a network has of the underlying task, we can search
for examples that the model misclassifies. Szegedy et al. (2014b) found that even
neural networks that perform at human level accuracy have a nearly 100 percent
error rate on examples that are intentionally constructed by using an optimization
procedure to search for an input
x
near a data point
x
such that the model
output is very different at
x
. In many cases,
x
can be so similar to
x
that a
+
.
007
×
=
x sign
(
x
J
(
θ, x, y
))
x +
sign
(
x
J
(
θ, x, y
))
y =“panda” “nematode” “gibbon”
w/ 57.7%
confidence
w/ 8.2%
confidence
w/ 99.3%
confidence
Figure 7.8: A demonstration of adversarial example generation applied to GoogLeNet
(Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose
elements are equal to the sign of the elements of the gradient of the cost function with
respect to the input, we can change GoogLeNet’s classification of the image. Reproduced
with permission from Goodfellow et al. (2014b).
265
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
human observer cannot tell the difference between the original example and the
adversarial example
, but the network can make highly different predictions. See
figure 7.8 for an example.
Adversarial examples have many implications, for example, in computer security,
that are beyond the scope of this chapter. They are interesting in the context of
regularization, however, because one can reduce the error rate on the original i.i.d.
test set via
adversarial training
—training on adversarially perturbed examples
from the training set (Szegedy et al., 2014b; Goodfellow et al., 2014b).
Goodfellow et al. (2014b) showed that one of the primary causes of these
adversarial examples is excessive linearity. Neural networks are built out of
primarily linear building blocks. In some experiments the overall function they
implement proves to be highly linear as a result. These linear functions are easy
to optimize. Unfortunately, the value of a linear function can change very rapidly
if it has numerous inputs. If we change each input by
, then a linear function
with weights
w
can change by as much as
||w||
1
, which can be a very large
amount if
w
is highdimensional. Adversarial training discourages this highly
sensitive locally linear behavior by encouraging the network to be locally constant
in the neighborhood of the training data. This can be seen as a way of explicitly
introducing a local constancy prior into supervised neural nets.
Adversarial training helps to illustrate the power of using a large function
family in combination with aggressive regularization. Purely linear models, like
logistic regression, are not able to resist adversarial examples because they are
forced to be linear. Neural networks are able to represent functions that can range
from nearly linear to nearly locally constant and thus have the flexibility to capture
linear trends in the training data while still learning to resist local perturbation.
Adversarial examples also provide a means of accomplishing semi-supervised
learning. At a point
x
that is not associated with a label in the dataset, the
model itself assigns some label
ˆy
. The model’s label
ˆy
may not be the true label,
but if the model is high quality, then
ˆy
has a high probability of providing the
true label. We can seek an adversarial example
x
that causes the classifier to
output a label
y
with
y
=
ˆy
. Adversarial examples generated using not the true
label but a label provided by a trained model are called
virtual adversarial
examples
(Miyato et al., 2015). The classifier may then be trained to assign the
same label to
x
and
x
. This encourages the classifier to learn a function that is
robust to small changes anywhere along the manifold where the unlabeled data lie.
The assumption motivating this approach is that different classes usually lie on
disconnected manifolds, and a small perturbation should not be able to jump from
one class manifold to another class manifold.
266
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.14 Tangent Distance, Tangent Prop and Manifold
Tangent Classifier
Many machine learning algorithms aim to overcome the curse of dimensionality
by assuming that the data lies near a low-dimensional manifold, as described in
section 5.11.3.
One of the early attempts to take advantage of the manifold hypothesis is the
tangent distance
algorithm (Simard et al., 1993, 1998). It is a nonparametric
nearest neighbor algorithm in which the metric used is not the generic Euclidean
distance but one that is derived from knowledge of the manifolds near which
probability concentrates. It is assumed that we are trying to classify examples, and
that examples on the same manifold share the same category. Since the classifier
should be invariant to the local factors of variation that correspond to movement
on the manifold, it would make sense to use as nearest neighbor distance between
points
x
1
and
x
2
the distance between the manifolds
M
1
and
M
2
to which they
respectively belong. Although that may be computationally difficult (it would
require solving an optimization problem, to find the nearest pair of points on M
1
and
M
2
), a cheap alternative that makes sense locally is to approximate
M
i
by its
tangent plane at
x
i
and measure the distance between the two tangents, or between
a tangent plane and a point. That can be achieved by solving a low-dimensional
linear system (in the dimension of the manifolds). Of course, this algorithm requires
one to specify the tangent vectors.
In a related spirit, the
tangent prop
algorithm (Simard et al., 1992) (figure 7.9)
trains a neural net classifier with an extra penalty to make each output
f
(
x
) of
the neural net locally invariant to known factors of variation. These factors of
variation correspond to movement along the manifold near which examples of the
same class concentrate. Local invariance is achieved by requiring
x
f
(
x
) to be
orthogonal to the known manifold tangent vectors
v
(i)
at
x
, or equivalently that
the directional derivative of
f
at
x
in the directions
v
(i)
be small by adding a
regularization penalty :
Ω(f) =
i
(
x
f(x))
v
(i)
2
. (7.67)
This regularizer can of course be scaled by an appropriate hyperparameter, and for
most neural networks, we would need to sum over many outputs rather than the lone
output
f
(
x
) described here for simplicity. As with the tangent distance algorithm,
the tangent vectors are derived a priori, usually from the formal knowledge of
the effect of transformations, such as translation, rotation, and scaling in images.
267
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
x
1
x
2
Normal
Tangent
Figure 7.9: Illustration of the main idea of the tangent prop algorithm (Simard et al.,
1992) and manifold tangent classifier (Rifai et al., 2011c), which both regularize the
classifier output function
f
(
x
). Each curve represents the manifold for a different class,
illustrated here as a one-dimensional manifold embedded in a two-dimensional space.
On one curve, we have chosen a single point and drawn a vector that is tangent to the
class manifold (parallel to and touching the manifold) and a vector that is normal to the
class manifold (orthogonal to the manifold). In multiple dimensions there may be many
tangent directions and many normal directions. We expect the classification function to
change rapidly as it moves in the direction normal to the manifold, and not to change as
it moves along the class manifold. Both tangent propagation and the manifold tangent
classifier regularize
f
(
x
) to not change very much as
x
moves along the manifold. Tangent
propagation requires the user to manually specify functions that compute the tangent
directions (such as specifying that small translations of images remain in the same class
manifold), while the manifold tangent classifier estimates the manifold tangent directions
by training an autoencoder to fit the training data. The use of autoencoders to estimate
manifolds is described in chapter 14.
268
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Tangent prop has been used not just for supervised learning (Simard et al., 1992)
but also in the context of reinforcement learning (Thrun, 1995).
Tangent propagation is closely related to dataset augmentation. In both
cases, the user of the algorithm encodes his or her prior knowledge of the task
by specifying a set of transformations that should not alter the output of the
network. The difference is that in the case of dataset augmentation, the network is
explicitly trained to correctly classify distinct inputs that were created by applying
more than an infinitesimal amount of these transformations. Tangent propagation
does not require explicitly visiting a new input point. Instead, it analytically
regularizes the model to resist perturbation in the directions corresponding to
the specified transformation. While this analytical approach is intellectually
elegant, it has two major drawbacks. First, it only regularizes the model to resist
infinitesimal perturbation. Explicit dataset augmentation confers resistance to
larger perturbations. Second, the infinitesimal approach poses difficulties for models
based on rectified linear units. These models can only shrink their derivatives
by turning units off or shrinking their weights. They are not able to shrink their
derivatives by saturating at a high value with large weights, as sigmoid or
tanh
units can. Dataset augmentation works well with rectified linear units because
different subsets of rectified units can activate for different transformed versions of
each original input.
Tangent propagation is also related to
double backprop
(Drucker and LeCun,
1992) and adversarial training (Szegedy et al., 2014b; Goodfellow et al., 2014b).
Double backprop regularizes the Jacobian to be small, while adversarial training
finds inputs near the original inputs and trains the model to produce the same
output on these as on the original inputs. Tangent propagation and dataset
augmentation using manually specified transformations both require that the
model be invariant to certain specified directions of change in the input. Double
backprop and adversarial training both require that the model should be invariant
to all directions of change in the input as long as the change is small. Just as dataset
augmentation is the non-infinitesimal version of tangent propagation, adversarial
training is the non-infinitesimal version of double backprop.
The manifold tangent classifier (Rifai et al., 2011c), eliminates the need to
know the tangent vectors a priori. As we will see in chapter 14, autoencoders can
estimate the manifold tangent vectors. The manifold tangent classifier makes use
of this technique to avoid needing user-specified tangent vectors. As illustrated
in figure 14.10, these estimated tangent vectors go beyond the classical invariants
that arise out of the geometry of images (such as translation, rotation, and scaling)
and include factors that must be learned because they are object-specific (such as
269
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
moving body parts). The algorithm proposed with the manifold tangent classifier
is therefore simple: (1) use an autoencoder to learn the manifold structure by
unsupervised learning, and (2) use these tangents to regularize a neural net classifier
as in tangent prop (equation 7.67).
In this chapter, we have described most of the general strategies used to
regularize neural networks. Regularization is a central theme of machine learning
and as such will be revisited periodically in most of the remaining chapters. Another
central theme of machine learning is optimization, described next.
270