Chapter 6

Deep Feedforward Networks

Deep feedforward networks

, also called

feedforward neural networks

, or

multilayer perceptrons

(MLPs), are the quintessential deep learning models.

The goal of a feedforward network is to approximate some function

∗

. For example,

for a classiﬁer,

∗

(

) maps an input

to a category

. A feedforward network

deﬁnes a mapping

(

;

) and learns the value of the parameters

that result

in the best function approximation.

These models are called

feedforward

because information ﬂows through the

function being evaluated from

, through the intermediate computations used to

deﬁne

, and ﬁnally to the output

. There are no

feedback

connections in which

outputs of the model are fed back into itself. When feedforward neural networks

are extended to include feedback connections, they are called

recurrent neural

networks, as presented in chapter 10.

Feedforward networks are of extreme importance to machine learning practi-

tioners. They form the basis of many important commercial applications. For

example, the convolutional networks used for object recognition from photos are a

specialized kind of feedforward network. Feedforward networks are a conceptual

stepping stone on the path to recurrent networks, which power many natural

language applications.

Feedforward neural networks are called

networks

because they are typically

represented by composing together many diﬀerent functions. The model is asso-

ciated with a directed acyclic graph describing how the functions are composed

together. For example, we might have three functions

(1)

(2)

, and

(3)

connected

in a chain, to form

(

) =

(3)

(

(2)

(

(1)

(

))). These chain structures are the

most commonly used structures of neural networks. In this case,

(1)

is called

the

ﬁrst layer

of the network,

(2)

is called the

second layer

, and so on. The

164

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

overall length of the chain gives the

depth

of the model. The name “deep learning”

arose from this terminology. The ﬁnal layer of a feedforward network is called the

output layer

. During neural network training, we drive

(

) to match

∗

(

The training data provides us with noisy, approximate examples of

∗

(

) evaluated

at diﬀerent training points. Each example

is accompanied by a label

y ≈ f

∗

(

The training examples specify directly what the output layer must do at each point

; it must produce a value that is close to

. The behavior of the other layers is

not directly speciﬁed by the training data. The learning algorithm must decide

how to use those layers to produce the desired output, but the training data do

not say what each individual layer should do. Instead, the learning algorithm must

decide how to use these layers to best implement an approximation of

∗

. Because

the training data does not show the desired output for each of these layers, they

are called hidden layers.

Finally, these networks are called neural because they are loosely inspired by

neuroscience. Each hidden layer of the network is typically vector valued. The

dimensionality of these hidden layers determines the

width

of the model. Each

element of the vector may be interpreted as playing a role analogous to a neuron.

Rather than thinking of the layer as representing a single vector-to-vector function,

we can also think of the layer as consisting of many

units

that act in parallel,

each representing a vector-to-scalar function. Each unit resembles a neuron in

the sense that it receives input from many other units and computes its own

activation value. The idea of using many layers of vector-valued representations

is drawn from neuroscience. The choice of the functions

(i)

(

) used to compute

these representations is also loosely guided by neuroscientiﬁc observations about

the functions that biological neurons compute. Modern neural network research,

however, is guided by many mathematical and engineering disciplines, and the

goal of neural networks is not to perfectly model the brain. It is best to think of

feedforward networks as function approximation machines that are designed to

achieve statistical generalization, occasionally drawing some insights from what we

know about the brain, rather than as models of brain function.

One way to understand feedforward networks is to begin with linear models

and consider how to overcome their limitations. Linear models, such as logistic

regression and linear regression, are appealing because they can be ﬁt eﬃciently

and reliably, either in closed form or with convex optimization. Linear models also

have the obvious defect that the model capacity is limited to linear functions, so

the model cannot understand the interaction between any two input variables.

To extend linear models to represent nonlinear functions of

, we can apply

the linear model not to

itself but to a transformed input

(

), where

is a

165

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

nonlinear transformation. Equivalently, we can apply the kernel trick described in

section 5.7.2, to obtain a nonlinear learning algorithm based on implicitly applying

the

mapping. We can think of

as providing a set of features describing

, or

as providing a new representation for x.

The question is then how to choose the mapping φ.

One option is to use a very generic

, such as the inﬁnite-dimensional

that

is implicitly used by kernel machines based on the RBF kernel. If

(

) is

of high enough dimension, we can always have enough capacity to ﬁt the

training set, but generalization to the test set often remains poor. Very

generic feature mappings are usually based only on the principle of local

smoothness and do not encode enough prior information to solve advanced

problems.

Another option is to manually engineer

. Until the advent of deep learning,

this was the dominant approach. It requires decades of human eﬀort for

each separate task, with practitioners specializing in diﬀerent domains, such

as speech recognition or computer vision, and with little transfer between

domains.

The strategy of deep learning is to learn

. In this approach, we have a model

(

;

θ, w

) =

(

;

)



. We now have parameters

that we use to learn

from a broad class of functions, and parameters

that map from

(

) to

the desired output. This is an example of a deep feedforward network, with

deﬁning a hidden layer. This approach is the only one of the three that

gives up on the convexity of the training problem, but the beneﬁts outweigh

the harms. In this approach, we parametrize the representation as

(

;

)

and use the optimization algorithm to ﬁnd the

that corresponds to a good

representation. If we wish, this approach can capture the beneﬁt of the ﬁrst

approach by being highly generic—we do so by using a very broad family

(

;

). Deep learning can also capture the beneﬁt of the second approach.

Human practitioners can encode their knowledge to help generalization by

designing families

(

;

) that they expect will perform well. The advantage

is that the human designer only needs to ﬁnd the right general function

family rather than ﬁnding precisely the right function.

This general principle of improving models by learning features extends beyond

the feedforward networks described in this chapter. It is a recurring theme of

deep learning that applies to all the kinds of models described throughout this

book. Feedforward networks are the application of this principle to learning

166

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

deterministic mappings from

that lack feedback connections. Other models,

presented later, apply these principles to learning stochastic mappings, functions

with feedback, and probability distributions over a single vector.

We begin this chapter with a simple example of a feedforward network. Next,

we address each of the design decisions needed to deploy a feedforward network.

First, training a feedforward network requires making many of the same design

decisions as are necessary for a linear model: choosing the optimizer, the cost

function, and the form of the output units. We review these basics of gradient-based

learning, then proceed to confront some of the design decisions that are unique

to feedforward networks. Feedforward networks have introduced the concept of a

hidden layer, and this requires us to choose the

activation functions

that will

be used to compute the hidden layer values. We must also design the architecture

of the network, including how many layers the network should contain, how these

layers should be connected to each other, and how many units should be in

each layer. Learning in deep neural networks requires computing the gradients

of complicated functions. We present the

back-propagation

algorithm and its

modern generalizations, which can be used to eﬃciently compute these gradients.

Finally, we close with some historical perspective.

6.1 Example: Learning XOR

To make the idea of a feedforward network more concrete, we begin with an

example of a fully functioning feedforward network on a very simple task: learning

the XOR function.

The XOR function (“exclusive or”) is an operation on two binary values,

and

. When exactly one of these binary values is equal to 1, the XOR function

returns 1. Otherwise, it returns 0. The XOR function provides the target function

∗

(

) that we want to learn. Our model provides a function

(

;

), and

our learning algorithm will adapt the parameters

to make

as similar as possible

to f

∗

In this simple example, we will not be concerned with statistical generalization.

We want our network to perform correctly on the four points

{



, [0



, and [1



}

. We will train the network on all four of these points. The

only challenge is to ﬁt the training set.

We can treat this problem as a regression problem and use a mean squared

error loss function. We have chosen this loss function to simplify the math for

this example as much as possible. In practical applications, MSE is usually not an

167

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

appropriate cost function for modeling binary data. More appropriate approaches

are described in section 6.2.2.2.

Evaluated on our whole training set, the MSE loss function is

J(θ) =



x∈X

∗

(x) − f(x; θ))

. (6.1)

Now we must choose the form of our model,

(

;

). Suppose that we choose

a linear model, with θ consisting of w and b. Our model is deﬁned to be

f(x; w, b) = x



w + b. (6.2)

We can minimize

(

) in closed form with respect to

and

using the normal

equations.

After solving the normal equations, we obtain

and

. The linear

model simply outputs 0

5 everywhere. Why does this happen? Figure 6.1 shows

how a linear model is not able to represent the XOR function. One way to solve

this problem is to use a model that learns a diﬀerent feature space in which a

linear model is able to represent the solution.

Speciﬁcally, we will introduce a simple feedforward network with one hidden

layer containing two hidden units. See ﬁgure 6.2 for an illustration of this model.

This feedforward network has a vector of hidden units

that are computed by a

function

(1)

(

;

W , c

). The values of these hidden units are then used as the input

for a second layer. The second layer is the output layer of the network. The output

layer is still just a linear regression model, but now it is applied to

rather than to

The network now contains two functions chained together,

(1)

(

;

W , c

) and

(2)

(

;

w, b

), with the complete model being

(

;

W , c, w, b

) =

(2)

(

(1)

(

))

What function should

(1)

compute? Linear models have served us well so far,

and it may be tempting to make

(1)

linear as well. Unfortunately, if

(1)

were

linear, then the feedforward network as a whole would remain a linear function of

its input. Ignoring the intercept terms for the moment, suppose

(1)

(

) =



and

(2)

(

) =



. Then

(

) =



W w

. We could represent this function as

f(x) = x





where w



= W w.

Clearly, we must use a nonlinear function to describe the features. Most neural

networks do so using an aﬃne transformation controlled by learned parameters,

followed by a ﬁxed nonlinear function called an activation function. We use that

strategy here, by deﬁning

(



)

where

provides the weights of a

linear transformation and

the biases. Previously, to describe a linear regression

model, we used a vector of weights and a scalar bias parameter to describe an

168

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

0 1

Original x space

0 1 2

Learned h space

Figure 6.1: Solving the XOR problem by learning a representation. The bold numbers

printed on the plot indicate the value that the learned function must output at each point.

(Left) A linear model applied directly to the original input cannot implement the XOR

function. When

= 0, the model’s output must increase as

increases. When

= 1,

the model’s output must decrease as

increases. A linear model must apply a ﬁxed

coeﬃcient

. The linear model therefore cannot use the value of

to change

the coeﬃcient on

and cannot solve this problem. (Right) In the transformed space

represented by the features extracted by a neural network, a linear model can now solve

the problem. In our example solution, the two points that must have output 1 have been

collapsed into a single point in feature space. In other words, the nonlinear features have

mapped both

= [1



and

= [0



to a single point in feature space,

= [1



The linear model can now describe the function as increasing in

and decreasing in

In this example, the motivation for learning the feature space is only to make the model

capacity greater so that it can ﬁt the training set. In more realistic applications, learned

representations can also help the model to generalize.

169

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.2: An example of a feedforward network, drawn in two diﬀerent styles. Speciﬁcally,

this is the feedforward network we use to solve the XOR example. It has a single hidden

layer containing two units. (Left) In this style, we draw every unit as a node in the graph.

This style is explicit and unambiguous, but for networks larger than this example, it can

consume too much space. (Right) In this style, we draw a node in the graph for each entire

vector representing a layer’s activations. This style is much more compact. Sometimes

we annotate the edges in this graph with the name of the parameters that describe the

relationship between two layers. Here, we indicate that a matrix

describes the mapping

from

, and a vector

describes the mapping from

. We typically omit the

intercept parameters associated with each layer when labeling this kind of drawing.

g(z) = max{0,z}

Figure 6.3: The rectiﬁed linear activation function. This activation function is the default

activation function recommended for use with most feedforward neural networks. Applying

this function to the output of a linear transformation yields a nonlinear transformation.

The function remains very close to linear, however, in the sense that it is a piecewise

linear function with two linear pieces. Because rectiﬁed linear units are nearly linear,

they preserve many of the properties that make linear models easy to optimize with

gradient-based methods. They also preserve many of the properties that make linear

models generalize well. A common principle throughout computer science is that we

can build complicated systems from minimal components. Much as a Turing machine’s

memory needs only to be able to store 0 or 1 states, we can build a universal function

approximator from rectiﬁed linear functions.

170

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

aﬃne transformation from an input vector to an output scalar. Now, we describe

an aﬃne transformation from a vector

to a vector

, so an entire vector of bias

parameters is needed. The activation function

is typically chosen to be a function

that is applied element-wise, with

(



:,i

). In modern neural networks,

the default recommendation is to use the

rectiﬁed linear unit

, or ReLU (Jarrett

et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011a), deﬁned by the activation

function g(z) = max{0, z}, depicted in ﬁgure 6.3.

We can now specify our complete network as

f(x; W , c, w, b) = w



max{0, W



x + c} + b. (6.3)

We can then specify a solution to the XOR problem. Let

W =



1 1



, (6.4)

c =



−1



, (6.5)

w =



−2



, (6.6)

and b = 0.

We can now walk through how the model processes a batch of inputs. Let

be the design matrix containing all four points in the binary input space, with one

example per row:

X =







0 0

0 1

1 0

1 1







. (6.7)

The ﬁrst step in the neural network is to multiply the input matrix by the ﬁrst

layer’s weight matrix:

XW =







0 0

1 1

2 2







. (6.8)

Next, we add the bias vector c, to obtain







0 −1

1 0

2 1







. (6.9)

171

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

In this space, all the examples lie along a line with slope 1. As we move along this

line, the output needs to begin at 0, then rise to 1, then drop back down to 0. A

linear model cannot implement such a function. To ﬁnish computing the value of

h for each example, we apply the rectiﬁed linear transformation:







0 0

1 0

2 1







. (6.10)

This transformation has changed the relationship between the examples. They no

longer lie on a single line. As shown in ﬁgure 6.1, they now lie in a space where a

linear model can solve the problem.

We ﬁnish with multiplying by the weight vector w:













. (6.11)

The neural network has obtained the correct answer for every example in the batch.

In this example, we simply speciﬁed the solution, then showed that it obtained

zero error. In a real situation, there might be billions of model parameters and

billions of training examples, so one cannot simply guess the solution as we did

here. Instead, a gradient-based optimization algorithm can ﬁnd parameters that

produce very little error. The solution we described to the XOR problem is at a

global minimum of the loss function, so gradient descent could converge to this

point. There are other equivalent solutions to the XOR problem that gradient

descent could also ﬁnd. The convergence point of gradient descent depends on the

initial values of the parameters. In practice, gradient descent would usually not

ﬁnd clean, easily understood, integer-valued solutions like the one we presented

here.

6.2 Gradient-Based Learning

Designing and training a neural network is not much diﬀerent from training any

other machine learning model with gradient descent. In section 5.10, we described

how to build a machine learning algorithm by specifying an optimization procedure,

a cost function, and a model family.

172

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

The largest diﬀerence between the linear models we have seen so far and neural

networks is that the nonlinearity of a neural network causes most interesting loss

functions to become nonconvex. This means that neural networks are usually

trained by using iterative, gradient-based optimizers that merely drive the cost

function to a very low value, rather than the linear equation solvers used to train

linear regression models or the convex optimization algorithms with global conver-

gence guarantees used to train logistic regression or SVMs. Convex optimization

converges starting from any initial parameters (in theory—in practice it is robust

but can encounter numerical problems). Stochastic gradient descent applied to

nonconvex loss functions has no such convergence guarantee and is sensitive to the

values of the initial parameters. For feedforward neural networks, it is important to

initialize all weights to small random values. The biases may be initialized to zero

or to small positive values. The iterative gradient-based optimization algorithms

used to train feedforward networks and almost all other deep models are described

in detail in chapter 8, with parameter initialization in particular discussed in

section 8.4. For the moment, it suﬃces to understand that the training algorithm

is almost always based on using the gradient to descend the cost function in one

way or another. The speciﬁc algorithms are improvements and reﬁnements on

the ideas of gradient descent, introduced in section 4.3, and, more speciﬁcally, are

most often improvements of the stochastic gradient descent algorithm, introduced

in section 5.9.

We can of course train models such as linear regression and support vector

machines with gradient descent too, and in fact this is common when the training

set is extremely large. From this point of view, training a neural network is not

much diﬀerent from training any other model. Computing the gradient is slightly

more complicated for a neural network but can still be done eﬃciently and exactly.

In Section 6.5 we describe how to obtain the gradient using the back-propagation

algorithm and modern generalizations of the back-propagation algorithm.

As with other machine learning models, to apply gradient-based learning we

must choose a cost function, and we must choose how to represent the output of

the model. We now revisit these design considerations with special emphasis on

the neural networks scenario.

6.2.1 Cost Functions

An important aspect of the design of a deep neural network is the choice of the

cost function. Fortunately, the cost functions for neural networks are more or less

173

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

the same as those for other parametric models, such as linear models.

In most cases, our parametric model deﬁnes a distribution

(

y | x

;

) and

we simply use the principle of maximum likelihood. This means we use the

cross-entropy between the training data and the model’s predictions as the cost

function.

Sometimes, we take a simpler approach, where rather than predicting a complete

probability distribution over

, we merely predict some statistic of

conditioned

on x. Specialized loss functions enable us to train a predictor of these estimates.

The total cost function used to train a neural network will often combine one

of the primary cost functions described here with a regularization term. We have

already seen some simple examples of regularization applied to linear models in

section 5.2.2. The weight decay approach used for linear models is also directly

applicable to deep neural networks and is among the most popular regulariza-

tion strategies. More advanced regularization strategies for neural networks are

described in chapter 7.

6.2.1.1 Learning Conditional Distributions with Maximum Likelihood

Most modern neural networks are trained using maximum likelihood. This means

that the cost function is simply the negative log-likelihood, equivalently described

as the cross-entropy between the training data and the model distribution. This

cost function is given by

J(θ) = −E

x,y∼ˆp

data

log p

model

(y | x). (6.12)

The speciﬁc form of the cost function changes from model to model, depending

on the speciﬁc form of

log p

model

. The expansion of the above equation typically

yields some terms that do not depend on the model parameters and may be dis-

carded. For example, as we saw in section 5.5.1, if

model

(

y | x

) =

(

;

(

;

)

, I

then we recover the mean squared error cost,

J(θ) =

x,y∼ˆp

data

||y − f(x; θ)||

+ const, (6.13)

up to a scaling factor of

and a term that does not depend on θ. The discarded

constant is based on the variance of the Gaussian distribution, which in this case

we chose not to parametrize. Previously, we saw that the equivalence between

maximum likelihood estimation with an output distribution and minimization of

174

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

mean squared error holds for a linear model, but in fact, the equivalence holds

regardless of the f(x; θ) used to predict the mean of the Gaussian.

An advantage of this approach of deriving the cost function from maximum

likelihood is that it removes the burden of designing cost functions for each model.

Specifying a model

(

y | x

) automatically determines a cost function

log p

(

y | x

One recurring theme throughout neural network design is that the gradient of

the cost function must be large and predictable enough to serve as a good guide

for the learning algorithm. Functions that saturate (become very ﬂat) undermine

this objective because they make the gradient become very small. In many cases

this happens because the activation functions used to produce the output of the

hidden units or the output units saturate. The negative log-likelihood helps to

avoid this problem for many models. Several output units involve an

exp

function

that can saturate when its argument is very negative. The

log

function in the

negative log-likelihood cost function undoes the

exp

of some output units. We will

discuss the interaction between the cost function and the choice of output unit in

section 6.2.2.

One unusual property of the cross-entropy cost used to perform maximum

likelihood estimation is that it usually does not have a minimum value when applied

to the models commonly used in practice. For discrete output variables, most

models are parametrized in such a way that they cannot represent a probability

of zero or one, but can come arbitrarily close to doing so. Logistic regression

is an example of such a model. For real-valued output variables, if the model

can control the density of the output distribution (for example, by learning the

variance parameter of a Gaussian output distribution) then it becomes possible

to assign extremely high density to the correct training set outputs, resulting in

cross-entropy approaching negative inﬁnity. Regularization techniques described

in chapter 7 provide several diﬀerent ways of modifying the learning problem so

that the model cannot reap unlimited reward in this way.

6.2.1.2 Learning Conditional Statistics

Instead of learning a full probability distribution

(

y | x

;

), we often want to

learn just one conditional statistic of y given x.

For example, we may have a predictor

(

;

) that we wish to employ to predict

the mean of y.

If we use a suﬃciently powerful neural network, we can think of the neural

network as being able to represent any function

from a wide class of functions,

with this class being limited only by features such as continuity and boundedness

175

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

rather than by having a speciﬁc parametric form. From this point of view, we

can view the cost function as being a

functional

rather than just a function. A

functional is a mapping from functions to real numbers. We can thus think of

learning as choosing a function rather than merely choosing a set of parameters.

We can design our cost functional to have its minimum occur at some speciﬁc

function we desire. For example, we can design the cost functional to have its

minimum lie on the function that maps

to the expected value of

given

Solving an optimization problem with respect to a function requires a mathematical

tool called

calculus of variations

, described in section 19.4.2. It is not necessary

to understand calculus of variations to understand the content of this chapter. At

the moment, it is only necessary to understand that calculus of variations may be

used to derive the following two results.

176

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Our ﬁrst result derived using calculus of variations is that solving the optimiza-

tion problem

∗

= arg min

x,y∼p

data

||y − f(x)||

(6.14)

yields

∗

(x) = E

y∼p

data

(y|x)

[y], (6.15)

so long as this function lies within the class we optimize over. In other words, if we

could train on inﬁnitely many samples from the true data generating distribution,

minimizing the mean squared error cost function would give a function that predicts

the mean of y for each value of x.

Diﬀerent cost functions give diﬀerent statistics. A second result derived using

calculus of variations is that

∗

= arg min

x,y∼p

data

||y − f(x)||

(6.16)

yields a function that predicts the median value of

for each

, as long as such a

function may be described by the family of functions we optimize over. This cost

function is commonly called mean absolute error.

Unfortunately, mean squared error and mean absolute error often lead to poor

results when used with gradient-based optimization. Some output units that

saturate produce very small gradients when combined with these cost functions.

This is one reason that the cross-entropy cost function is more popular than mean

squared error or mean absolute error, even when it is not necessary to estimate an

entire distribution p(y | x).

6.2.2 Output Units

The choice of cost function is tightly coupled with the choice of output unit. Most

of the time, we simply use the cross-entropy between the data distribution and the

model distribution. The choice of how to represent the output then determines

the form of the cross-entropy function.

Any kind of neural network unit that may be used as an output can also be

used as a hidden unit. Here, we focus on the use of these units as outputs of the

model, but in principle they can be used internally as well. We revisit these units

with additional detail about their use as hidden units in section 6.3.

Throughout this section, we suppose that the feedforward network provides a

set of hidden features deﬁned by

(

;

). The role of the output layer is then

to provide some additional transformation from the features to complete the task

that the network must perform.

177

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

6.2.2.1 Linear Units for Gaussian Output Distributions

One simple kind of output unit is based on an aﬃne transformation with no

nonlinearity. These are often just called linear units.

Given features

, a layer of linear output units produces a vector



Linear output layers are often used to produce the mean of a conditional

Gaussian distribution:

p(y | x) = N(y;

y, I). (6.17)

Maximizing the log-likelihood is then equivalent to minimizing the mean squared

error.

The maximum likelihood framework makes it straightforward to learn the

covariance of the Gaussian too, or to make the covariance of the Gaussian be a

function of the input. However, the covariance must be constrained to be a positive

deﬁnite matrix for all inputs. It is diﬃcult to satisfy such constraints with a linear

output layer, so typically other output units are used to parametrize the covariance.

Approaches to modeling the covariance are described shortly, in section 6.2.2.4.

Because linear units do not saturate, they pose little diﬃculty for gradient-

based optimization algorithms and may be used with a wide variety of optimization

algorithms.

6.2.2.2 Sigmoid Units for Bernoulli Output Distributions

Many tasks require predicting the value of a binary variable

. Classiﬁcation

problems with two classes can be cast in this form.

The maximum likelihood approach is to deﬁne a Bernoulli distribution over

conditioned on x.

A Bernoulli distribution is deﬁned by just a single number. The neural net

needs to predict only

(

= 1

| x

). For this number to be a valid probability, it

must lie in the interval [0, 1].

Satisfying this constraint requires some careful design eﬀort. Suppose we were

to use a linear unit and threshold its value to obtain a valid probability:

P (y = 1 | x) = max



0, min



1, w



h + b



. (6.18)

This would indeed deﬁne a valid conditional distribution, but we would not be able

to train it very eﬀectively with gradient descent. Any time that



strayed

178

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

outside the unit interval, the gradient of the output of the model with respect to

its parameters would be

. A gradient of

is typically problematic because the

learning algorithm no longer has a guide for how to improve the corresponding

parameters.

Instead, it is better to use a diﬀerent approach that ensures there is always a

strong gradient whenever the model has the wrong answer. This approach is based

on using sigmoid output units combined with maximum likelihood.

A sigmoid output unit is deﬁned by

ˆy = σ





h + b



, (6.19)

where σ is the logistic sigmoid function described in section 3.10.

We can think of the sigmoid output unit as having two components. First, it

uses a linear layer to compute



. Next, it uses the sigmoid activation

function to convert z into a probability.

We omit the dependence on

for the moment to discuss how to deﬁne a

probability distribution over

using the value

. The sigmoid can be motivated

by constructing an unnormalized probability distribution

(

), which does not

sum to 1. We can then divide by an appropriate constant to obtain a valid

probability distribution. If we begin with the assumption that the unnormalized log

probabilities are linear in

and

, we can exponentiate to obtain the unnormalized

probabilities. We then normalize to see that this yields a Bernoulli distribution

controlled by a sigmoidal transformation of z:

log

P (y) = yz, (6.20)

P (y) = exp(yz), (6.21)

P (y) =

exp(yz)





exp(y



z),

(6.22)

P (y) = σ ((2y − 1)z) . (6.23)

Probability distributions based on exponentiation and normalization are common

throughout the statistical modeling literature. The

variable deﬁning such a

distribution over binary variables is called a logit.

This approach to predicting the probabilities in log space is natural to use

with maximum likelihood learning. Because the cost function used with maximum

likelihood is

−log P

(

y | x

), the

log

in the cost function undoes the

exp

of the

sigmoid. Without this eﬀect, the saturation of the sigmoid could prevent gradient-

based learning from making good progress. The loss function for maximum

179

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

likelihood learning of a Bernoulli parametrized by a sigmoid is

J(θ) = −log P (y | x) (6.24)

= −log σ ((2y − 1)z) (6.25)

= ζ ((1 − 2y)z) . (6.26)

This derivation makes use of some properties from section 3.10. By rewriting

the loss in terms of the softplus function, we can see that it saturates only when

−

)

is very negative. Saturation thus occurs only when the model already

has the right answer—when

= 1 and

is very positive, or

= 0 and

is very

negative. When

has the wrong sign, the argument to the softplus function,

−

)

, may be simpliﬁed to

|z|

. As

|z|

becomes large while

has the wrong sign,

the softplus function asymptotes toward simply returning its argument

|z|

. The

derivative with respect to

asymptotes to

sign

(

), so, in the limit of extremely

incorrect

, the softplus function does not shrink the gradient at all. This property

is useful because it means that gradient-based learning can act to quickly correct

a mistaken z.

When we use other loss functions, such as mean squared error, the loss can

saturate anytime

(

) saturates. The sigmoid activation function saturates to 0

when

becomes very negative and saturates to 1 when

becomes very positive.

The gradient can shrink too small to be useful for learning when this happens,

whether the model has the correct answer or the incorrect answer. For this reason,

maximum likelihood is almost always the preferred approach to training sigmoid

output units.

Analytically, the logarithm of the sigmoid is always deﬁned and ﬁnite, because

the sigmoid returns values restricted to the open interval (0

1), rather than using

the entire closed interval of valid probabilities [0

1]. In software implementations,

to avoid numerical problems, it is best to write the negative log-likelihood as a

function of

, rather than as a function of

ˆy

(

). If the sigmoid function

underﬂows to zero, then taking the logarithm of ˆy yields negative inﬁnity.

6.2.2.3 Softmax Units for Multinoulli Output Distributions

Any time we wish to represent a probability distribution over a discrete variable

with

possible values, we may use the softmax function. This can be seen as a

generalization of the sigmoid function, which was used to represent a probability

distribution over a binary variable.

Softmax functions are most often used as the output of a classiﬁer, to represent

the probability distribution over

diﬀerent classes. More rarely, softmax functions

180

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

can be used inside the model itself, if we wish the model to choose between one of

n diﬀerent options for some internal variable.

In the case of binary variables, we wished to produce a single number

ˆy = P (y = 1 | x). (6.27)

Because this number needed to lie between 0 and 1, and because we wanted the

logarithm of the number to be well behaved for gradient-based optimization of

the log-likelihood, we chose to instead predict a number

log

(

= 1

| x

Exponentiating and normalizing gave us a Bernoulli distribution controlled by the

sigmoid function.

To generalize to the case of a discrete variable with

values, we now need

to produce a vector

, with

ˆy

(

i | x

). We require not only that each

element of

ˆy

be between 0 and 1, but also that the entire vector sums to 1 so that

it represents a valid probability distribution. The same approach that worked for

the Bernoulli distribution generalizes to the multinoulli distribution. First, a linear

layer predicts unnormalized log probabilities:

z = W



h + b, (6.28)

where

log

(

i | x

)

The softmax function can then exponentiate and

normalize z to obtain the desired

y. Formally, the softmax function is given by

softmax(z)

exp(z

)



exp(z

)

. (6.29)

As with the logistic sigmoid, the use of the

exp

function works well when

training the softmax to output a target value

using maximum log-likelihood. In

this case, we wish to maximize

log P

(

;

) =

log softmax

(

)

. Deﬁning the

softmax in terms of

exp

is natural because the

log

in the log-likelihood can undo

the exp of the softmax:

log softmax(z)

= z

− log



exp(z

). (6.30)

The ﬁrst term of equation 6.30 shows that the input

always has a direct

contribution to the cost function. Because this term cannot saturate, we know

that learning can proceed, even if the contribution of

to the second term of

equation 6.30 becomes very small. When maximizing the log-likelihood, the ﬁrst

term encourages

to be pushed up, while the second term encourages all of

to be

pushed down. To gain some intuition for the second term,

log



exp

(

), observe

181

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

that this term can be roughly approximated by

max

. This approximation is

based on the idea that

exp

(

) is insigniﬁcant for any

that is noticeably less than

max

. The intuition we can gain from this approximation is that the negative

log-likelihood cost function always strongly penalizes the most active incorrect

prediction. If the correct answer already has the largest input to the softmax, then

the

−z

term and the

log



exp

(

)

≈ max

terms will roughly cancel.

This example will then contribute little to the overall training cost, which will be

dominated by other examples that are not yet correctly classiﬁed.

So far we have discussed only a single example. Overall, unregularized maximum

likelihood will drive the model to learn parameters that drive the softmax to predict

the fraction of counts of each outcome observed in the training set:

softmax(z(x; θ))

≈



j=1

(j)

=i,x

(j)



j=1

(j)

. (6.31)

Because maximum likelihood is a consistent estimator, this is guaranteed to happen

as long as the model family is capable of representing the training distribution. In

practice, limited model capacity and imperfect optimization will mean that the

model is only able to approximate these fractions.

Many objective functions other than the log-likelihood do not work as well

with the softmax function. Speciﬁcally, objective functions that do not use a

log

undo the

exp

of the softmax fail to learn when the argument to the

exp

becomes

very negative, causing the gradient to vanish. In particular, squared error is a poor

loss function for softmax units and can fail to train the model to change its output,

even when the model makes highly conﬁdent incorrect predictions (Bridle, 1990).

To understand why these other loss functions can fail, we need to examine the

softmax function itself.

Like the sigmoid, the softmax activation can saturate. The sigmoid function

has a single output that saturates when its input is extremely negative or extremely

positive. The softmax has multiple output values. These output values can saturate

when the diﬀerences between input values become extreme. When the softmax

saturates, many cost functions based on the softmax also saturate, unless they are

able to invert the saturating activating function.

To see that the softmax function responds to the diﬀerence between its inputs,

observe that the softmax output is invariant to adding the same scalar to all its

inputs:

softmax(z) = softmax(z + c). (6.32)

Using this property, we can derive a numerically stable variant of the softmax:

softmax(z) = softmax(z − max

). (6.33)

182

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

The reformulated version enables us to evaluate softmax with only small numerical

errors, even when

contains extremely large or extremely negative numbers.

Examining the numerically stable variant, we see that the softmax function is

driven by the amount that its arguments deviate from max

An output

softmax

(

)

saturates to 1 when the corresponding input is maximal

(

max

) and

is much greater than all the other inputs. The output

softmax

(

)

can also saturate to 0 when

is not maximal and the maximum is

much greater. This is a generalization of the way that sigmoid units saturate and

can cause similar diﬃculties for learning if the loss function is not designed to

compensate for it.

The argument

to the softmax function can be produced in two diﬀerent ways.

The most common is simply to have an earlier layer of the neural network output

every element of

, as described above using the linear layer



. While

straightforward, this approach actually overparametrizes the distribution. The

constraint that the

outputs must sum to 1 means that only

n −

1 parameters are

necessary; the probability of the

-th value may be obtained by subtracting the

ﬁrst

n−

1 probabilities from 1. We can thus impose a requirement that one element

be ﬁxed. For example, we can require that

= 0. Indeed, this is exactly

what the sigmoid unit does. Deﬁning

(

= 1

| x

) =

(

) is equivalent to deﬁning

(

= 1

| x

) =

softmax

(

)

with a two-dimensional

and

= 0. Both the

n −

argument and the

argument approaches to the softmax can describe the same

set of probability distributions but have diﬀerent learning dynamics. In practice,

there is rarely much diﬀerence between using the overparametrized version or the

restricted version, and it is simpler to implement the overparametrized version.

From a neuroscientiﬁc point of view, it is interesting to think of the softmax as

a way to create a form of competition between the units that participate in it: the

softmax outputs always sum to 1 so an increase in the value of one unit necessarily

corresponds to a decrease in the value of others. This is analogous to the lateral

inhibition that is believed to exist between nearby neurons in the cortex. At the

extreme (when the diﬀerence between the maximal

and the others is large in

magnitude) it becomes a form of

winner-take-all

(one of the outputs is nearly 1,

and the others are nearly 0).

The name “softmax” can be somewhat confusing. The function is more closely

related to the

arg max

function than the max function. The term “soft” derives

from the fact that the softmax function is continuous and diﬀerentiable. The

arg max

function, with its result represented as a one-hot vector, is not continuous

or diﬀerentiable. The softmax function thus provides a “softened” version of the

arg max

. The corresponding soft version of the maximum function is

softmax

(

)



183

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

It would perhaps be better to call the softmax function “softargmax,” but the

current name is an entrenched convention.

6.2.2.4 Other Output Types

The linear, sigmoid, and softmax output units described above are the most

common. Neural networks can generalize to almost any kind of output layer that

we wish. The principle of maximum likelihood provides a guide for how to design

a good cost function for nearly any kind of output layer.

In general, if we deﬁne a conditional distribution

(

y | x

;

), the principle of

maximum likelihood suggests we use −log p(y | x; θ) as our cost function.

In general, we can think of the neural network as representing a function

(

;

The outputs of this function are not direct predictions of the value

. Instead,

(

;

) =

provides the parameters for a distribution over

. Our loss function

can then be interpreted as −log p(y; ω(x)).

For example, we may wish to learn the variance of a conditional Gaussian for

given

. In the simple case, where the variance

is a constant, there is a closed

form expression because the maximum likelihood estimator of variance is simply the

empirical mean of the squared diﬀerence between observations

and their expected

value. A computationally more expensive approach that does not require writing

special-case code is to simply include the variance as one of the properties of the

distribution

(

y | x

) that is controlled by

(

;

). The negative log-likelihood

−log p

(

;

(

)) will then provide a cost function with the appropriate terms

necessary to make our optimization procedure incrementally learn the variance. In

the simple case where the standard deviation does not depend on the input, we

can make a new parameter in the network that is copied directly into

. This new

parameter might be

itself or could be a parameter

representing

or it could

be a parameter

representing

, depending on how we choose to parametrize

the distribution. We may wish our model to predict a diﬀerent amount of variance

for diﬀerent values of

. This is called a

heteroscedastic

model. In the

heteroscedastic case, we simply make the speciﬁcation of the variance be one of

the values output by

(

;

). A typical way to do this is to formulate the Gaussian

distribution using precision, rather than variance, as described in equation 3.22.

In the multivariate case, it is most common to use a diagonal precision matrix

diag(β). (6.34)

This formulation works well with gradient descent because the formula for the

log-likelihood of the Gaussian distribution parametrized by

involves only mul-

tiplication by

and addition of

log β

. The gradient of multiplication, addition,

184

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

and logarithm operations is well behaved. By comparison, if we parametrized the

output in terms of variance, we would need to use division. The division function

becomes arbitrarily steep near zero. While large gradients can help learning, arbi-

trarily large gradients usually result in instability. If we parametrized the output in

terms of standard deviation, the log-likelihood would still involve division as well

as squaring. The gradient through the squaring operation can vanish near zero,

making it diﬃcult to learn parameters that are squared. Regardless of whether we

use standard deviation, variance, or precision, we must ensure that the covariance

matrix of the Gaussian is positive deﬁnite. Because the eigenvalues of the precision

matrix are the reciprocals of the eigenvalues of the covariance matrix, this is

equivalent to ensuring that the precision matrix is positive deﬁnite. If we use a

diagonal matrix, or a scalar times the diagonal matrix, then the only condition

we need to enforce on the output of the model is positivity. If we suppose that a

is the raw activation of the model used to determine the diagonal precision, we

can use the softplus function to obtain a positive precision vector:

(

)

This

same strategy applies equally if using variance or standard deviation rather than

precision or if using a scalar times identity rather than diagonal matrix.

It is rare to learn a covariance or precision matrix with richer structure than

diagonal. If the covariance is full and conditional, then a parametrization must

be chosen that guarantees positive deﬁniteness of the predicted covariance matrix.

This can be achieved by writing

Σ(x) = B(x)B



(x)

, where

is an unconstrained

square matrix. One practical issue if the matrix is full rank is that computing the

likelihood is expensive, with a

d × d

matrix requiring

(

) computation for the

determinant and inverse of

(

) (or equivalently, and more commonly done, its

eigendecomposition or that of B(x)).

We often want to perform multimodal regression, that is, to predict real

values from a conditional distribution

(

y | x

) that can have several diﬀerent

peaks in

space for the same value of

. In this case, a Gaussian mixture is

a natural representation for the output (Jacobs et al., 1991; Bishop, 1994).

Neural networks with Gaussian mixtures as their output are often called

mixture

density networks

. A Gaussian mixture output with

components is deﬁned by

the conditional probability distribution:

p(y | x) =



i=1

p(c = i | x)N(y; µ

(i)

(x), Σ

(i)

(x)). (6.35)

The neural network must have three outputs: a vector deﬁning

(

i | x

), a

matrix providing

(i)

(

) for all

, and a tensor providing

(i)

(

) for all

. These

outputs must satisfy diﬀerent constraints:

185

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Mixture components

(

i | x

): these form a multinoulli distribution

over the

diﬀerent components associated with latent variable

, and can

typically be obtained by a softmax over an

-dimensional vector, to guarantee

that these outputs are positive and sum to 1.

Means

(i)

(

): these indicate the center or mean associated with the

-th

Gaussian component and are unconstrained (typically with no nonlinearity at

all for these output units). If

is a

-vector, then the network must output

n × d

matrix containing all

of these

-dimensional vectors. Learning

these means with maximum likelihood is slightly more complicated than

learning the means of a distribution with only one output mode. We only

want to update the mean for the component that actually produced the

observation. In practice, we do not know which component produced each

observation. The expression for the negative log-likelihood naturally weights

each example’s contribution to the loss for each component by the probability

that the component produced the example.

Covariances

(i)

(

): these specify the covariance matrix for each component

. As when learning a single Gaussian component, we typically use a diagonal

matrix to avoid needing to compute determinants. As with learning the means

of the mixture, maximum likelihood is complicated by needing to assign

partial responsibility for each point to each mixture component. Gradient

descent will automatically follow the correct process if given the correct

speciﬁcation of the negative log-likelihood under the mixture model.

It has been reported that gradient-based optimization of conditional Gaussian

mixtures (on the output of neural networks) can be unreliable, in part because one

gets divisions (by the variance) which can be numerically unstable (when some

variance gets to be small for a particular example, yielding very large gradients).

One solution is to

clip gradients

(see section 10.11.1), while another is to scale

the gradients heuristically (Uria et al., 2014).

Gaussian mixture outputs are particularly eﬀective in generative models of

speech (Schuster, 1999) and movements of physical objects (Graves, 2013). The

mixture density strategy gives a way for the network to represent multiple output

modes and to control the variance of its output, which is crucial for obtaining

We consider

to be latent because we do not observe it in the data: given input

and target

, it is not possible to know with certainty which Gaussian component was responsible for

, but

we can imagine that

was generated by picking one of them, and we can make that unobserved

choice a random variable.

186

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.4: Samples drawn from a neural network with a mixture density output layer.

The input

is sampled from a uniform distribution, and the output

is sampled from

model

(

y | x

). The neural network is able to learn nonlinear mappings from the input to

the parameters of the output distribution. These parameters include the probabilities

governing which of three mixture components will generate the output as well as the

parameters for each mixture component. Each mixture component is Gaussian with

predicted mean and variance. All these aspects of the output distribution are able to vary

with respect to the input x, and to do so in nonlinear ways.

a high degree of quality in these real-valued domains. An example of a mixture

density network is shown in ﬁgure 6.4.

In general, we may wish to continue to model larger vectors

containing more

variables, and to impose richer and richer structures on these output variables. For

example, if we want our neural network to output a sequence of characters that

forms a sentence, we might continue to use the principle of maximum likelihood

applied to our model

(

;

(

)). In this case, the model we use to describe

would

become complex enough to be beyond the scope of this chapter. In Chapter 10 we

describe how to use recurrent neural networks to deﬁne such models over sequences,

and in part III we describe advanced techniques for modeling arbitrary probability

distributions.

6.3 Hidden Units

So far we have focused our discussion on design choices for neural networks that

are common to most parametric machine learning models trained with gradient-

based optimization. Now we turn to an issue that is unique to feedforward neural

187

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

networks: how to choose the type of hidden unit to use in the hidden layers of the

model.

The design of hidden units is an extremely active area of research and does not

yet have many deﬁnitive guiding theoretical principles.

Rectiﬁed linear units are an excellent default choice of hidden unit. Many other

types of hidden units are available. It can be diﬃcult to determine when to use

which kind (though rectiﬁed linear units are usually an acceptable choice). We

describe here some of the basic intuitions motivating each type of hidden unit.

These intuitions can help decide when to try out which unit. Predicting in advance

which will work best is usually impossible. The design process consists of trial

and error, intuiting that a kind of hidden unit may work well, and then training

a network with that kind of hidden unit and evaluating its performance on a

validation set.

Some of the hidden units included in this list are not actually diﬀerentiable

at all input points. For example, the rectiﬁed linear function

(

) =

max{

, z}

is not diﬀerentiable at

= 0. This may seem like it invalidates

for use with

a gradient-based learning algorithm. In practice, gradient descent still performs

well enough for these models to be used for machine learning tasks. This is in

part because neural network training algorithms do not usually arrive at a local

minimum of the cost function, but instead merely reduce its value signiﬁcantly, as

shown in ﬁgure 4.3. (These ideas are described further in chapter 8.) Because we do

not expect training to actually reach a point where the gradient is

, it is acceptable

for the minima of the cost function to correspond to points with undeﬁned gradient.

Hidden units that are not diﬀerentiable are usually nondiﬀerentiable at only a

small number of points. In general, a function

(

) has a left derivative deﬁned

by the slope of the function immediately to the left of

and a right derivative

deﬁned by the slope of the function immediately to the right of

. A function

is diﬀerentiable at

only if both the left derivative and the right derivative are

deﬁned and equal to each other. The functions used in the context of neural

networks usually have deﬁned left derivatives and deﬁned right derivatives. In the

case of

(

) =

max{

, z}

, the left derivative at

= 0 is 0, and the right derivative

is 1. Software implementations of neural network training usually return one of

the one-sided derivatives rather than reporting that the derivative is undeﬁned or

raising an error. This may be heuristically justiﬁed by observing that gradient-

based optimization on a digital computer is subject to numerical error anyway.

When a function is asked to evaluate

(0), it is very unlikely that the underlying

value truly was 0. Instead, it was likely to be some small value



that was rounded

to 0. In some contexts, more theoretically pleasing justiﬁcations are available, but

188

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

these usually do not apply to neural network training. The important point is

that in practice one can safely disregard the nondiﬀerentiability of the hidden unit

activation functions described below.

Unless indicated otherwise, most hidden units can be described as accepting

a vector of inputs

, computing an aﬃne transformation



, and

then applying an element-wise nonlinear function

(

). Most hidden units are

distinguished from each other only by the choice of the form of the activation

function g(z).

6.3.1 Rectiﬁed Linear Units and Their Generalizations

Rectiﬁed linear units use the activation function g(z) = max{0, z}.

These units are easy to optimize because they are so similar to linear units. The

only diﬀerence between a linear unit and a rectiﬁed linear unit is that a rectiﬁed

linear unit outputs zero across half its domain. This makes the derivatives through

a rectiﬁed linear unit remain large whenever the unit is active. The gradients

are not only large but also consistent. The second derivative of the rectifying

operation is 0 almost everywhere, and the derivative of the rectifying operation is

1 everywhere that the unit is active. This means that the gradient direction is far

more useful for learning than it would be with activation functions that introduce

second-order eﬀects.

Rectiﬁed linear units are typically used on top of an aﬃne transformation:

h = g(W



x + b). (6.36)

When initializing the parameters of the aﬃne transformation, it can be a good

practice to set all elements of

to a small positive value, such as 0

1. Doing so

makes it very likely that the rectiﬁed linear units will be initially active for most

inputs in the training set and allow the derivatives to pass through.

Several generalizations of rectiﬁed linear units exist. Most of these general-

izations perform comparably to rectiﬁed linear units and occasionally perform

better.

One drawback to rectiﬁed linear units is that they cannot learn via gradient-

based methods on examples for which their activation is zero. Various generaliza-

tions of rectiﬁed linear units guarantee that they receive gradient everywhere.

Three generalizations of rectiﬁed linear units are based on using a nonzero

slope

when

(

z, α

)

max

, z

) +

min

, z

Absolute value

189

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

rectiﬁcation

ﬁxes

−

1 to obtain

(

) =

|z|

. It is used for object recognition

from images (Jarrett et al., 2009), where it makes sense to seek features that are

invariant under a polarity reversal of the input illumination. Other generalizations

of rectiﬁed linear units are more broadly applicable. A

leaky ReLU

(Maas et al.,

2013) ﬁxes

to a small value like 0.01, while a

parametric ReLU

, or

PReLU

treats α

as a learnable parameter (He et al., 2015).

Maxout units

(Goodfellow et al., 2013a) generalize rectiﬁed linear units

further. Instead of applying an element-wise function

(

), maxout units divide

into groups of

values. Each maxout unit then outputs the maximum element of

one of these groups:

g(z)

= max

j∈G

(i)

, (6.37)

where

(i)

is the set of indices into the inputs for group

{

(

i −

+ 1

, . . . , ik}

This provides a way of learning a piecewise linear function that responds to multiple

directions in the input x space.

A maxout unit can learn a piecewise linear, convex function with up to

pieces.

Maxout units can thus be seen as learning the activation function itself rather than

just the relationship between units. With large enough

, a maxout unit can learn

to approximate any convex function with arbitrary ﬁdelity. In particular, a maxout

layer with two pieces can learn to implement the same function of the input

as a traditional layer using the rectiﬁed linear activation function, the absolute

value rectiﬁcation function, or the leaky or parametric ReLU, or it can learn to

implement a totally diﬀerent function altogether. The maxout layer will of course

be parametrized diﬀerently from any of these other layer types, so the learning

dynamics will be diﬀerent even in the cases where maxout learns to implement the

same function of x as one of the other layer types.

Each maxout unit is now parametrized by

weight vectors instead of just one,

so maxout units typically need more regularization than rectiﬁed linear units. They

can work well without regularization if the training set is large and the number of

pieces per unit is kept low (Cai et al., 2013).

Maxout units have a few other beneﬁts. In some cases, one can gain some sta-

tistical and computational advantages by requiring fewer parameters. Speciﬁcally,

if the features captured by

diﬀerent linear ﬁlters can be summarized without

losing information by taking the max over each group of

features, then the next

layer can get by with k times fewer weights.

Because each unit is driven by multiple ﬁlters, maxout units have some redun-

dancy that helps them resist a phenomenon called

catastrophic forgetting

, in

which neural networks forget how to perform tasks that they were trained on in

190

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

the past (Goodfellow et al., 2014a).

Rectiﬁed linear units and all these generalizations of them are based on the

principle that models are easier to optimize if their behavior is closer to linear.

This same general principle of using linear behavior to obtain easier optimization

also applies in other contexts besides deep linear networks. Recurrent networks can

learn from sequences and produce a sequence of states and outputs. When training

them, one needs to propagate information through several time steps, which is much

easier when some linear computations (with some directional derivatives being of

magnitude near 1) are involved. One of the best-performing recurrent network

architectures, the LSTM, propagates information through time via summation—a

particular straightforward kind of linear activation. This is discussed further in

section 10.10.

6.3.2 Logistic Sigmoid and Hyperbolic Tangent

Prior to the introduction of rectiﬁed linear units, most neural networks used the

logistic sigmoid activation function

g(z) = σ(z) (6.38)

or the hyperbolic tangent activation function

g(z) = tanh(z). (6.39)

These activation functions are closely related because tanh(z) = 2σ(2z) − 1.

We have already seen sigmoid units as output units, used to predict the

probability that a binary variable is 1. Unlike piecewise linear units, sigmoidal

units saturate across most of their domain—they saturate to a high value when

is very positive, saturate to a low value when

is very negative, and are only

strongly sensitive to their input when

is near 0. The widespread saturation of

sigmoidal units can make gradient-based learning very diﬃcult. For this reason,

their use as hidden units in feedforward networks is now discouraged. Their use

as output units is compatible with the use of gradient-based learning when an

appropriate cost function can undo the saturation of the sigmoid in the output

layer.

When a sigmoidal activation function must be used, the hyperbolic tangent

activation function typically performs better than the logistic sigmoid. It resembles

the identity function more closely, in the sense that

tanh

(0) = 0 while

(0) =

Because

tanh

is similar to the identity function near 0, training a deep neural

network

ˆy



tanh

(



tanh

(



)) resembles training a linear model

ˆy

191

CHAPTER 6. DEEP FEEDFORWARD NETWORKS



as long as the activations of the network can be kept small. This

makes training the tanh network easier.

Sigmoidal activation functions are more common in settings other than feed-

forward networks. Recurrent networks, many probabilistic models, and some

autoencoders have additional requirements that rule out the use of piecewise

linear activation functions and make sigmoidal units more appealing despite the

drawbacks of saturation.

6.3.3 Other Hidden Units

Many other types of hidden units are possible but are used less frequently.

In general, a wide variety of diﬀerentiable functions perform perfectly well.

Many unpublished activation functions perform just as well as the popular ones. To

provide a concrete example, we tested a feedforward network using

cos

(

W x

)

on the MNIST dataset and obtained an error rate of less than 1 percent, which is

competitive with results obtained using more conventional activation functions.

During research and development of new techniques, it is common to test many

diﬀerent activation functions and ﬁnd that several variations on standard practice

perform comparably. This means that usually new hidden unit types are published

only if they are clearly demonstrated to provide a signiﬁcant improvement. New

hidden unit types that perform roughly comparably to known types are so common

as to be uninteresting.

It would be impractical to list all the hidden unit types that have appeared in

the literature. We highlight a few especially useful and distinctive ones.

One possibility is to not have an activation

(

) at all. One can also think of

this as using the identity function as the activation function. We have already

seen that a linear unit can be useful as the output of a neural network. It may

also be used as a hidden unit. If every layer of the neural network consists of only

linear transformations, then the network as a whole will be linear. However, it

is acceptable for some layers of the neural network to be purely linear. Consider

a neural network layer with

inputs and

outputs,

(



). We may

replace this with two layers, with one layer using weight matrix

and the other

using weight matrix

. If the ﬁrst layer has no activation function, then we have

essentially factored the weight matrix of the original layer based on

. The

factored approach is to compute

(



). If

produces

outputs,

then

and

together contain only (

)

parameters, while

contains

parameters. For small

, this can be a considerable saving in parameters. It

comes at the cost of constraining the linear transformation to be low rank, but

192

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

these low-rank relationships are often suﬃcient. Linear hidden units thus oﬀer an

eﬀective way of reducing the number of parameters in a network.

Softmax units are another kind of unit that is usually used as an output (as

described in section 6.2.2.3) but may sometimes be used as a hidden unit. Softmax

units naturally represent a probability distribution over a discrete variable with

possible values, so they may be used as a kind of switch. These kinds of hidden

units are usually only used in more advanced architectures that explicitly learn to

manipulate memory, as described in section 10.12.

A few other reasonably common hidden unit types include

• Radial basis function

(RBF),

unit

exp



−

||W

:,i

− x||



. This

function becomes more active as

approaches a template

:,i

. Because it

saturates to 0 for most x, it can be diﬃcult to optimize.

• Softplus

(

) =

(

) =

log

(1+

). This is a smooth version of the rectiﬁer,

introduced by Dugas et al. (2001) for function approximation and by Nair

and Hinton (2010) for the conditional distributions of undirected probabilistic

models. Glorot et al. (2011a) compared the softplus and rectiﬁer and found

better results with the latter. The use of the softplus is generally discouraged.

The softplus demonstrates that the performance of hidden unit types can

be very counterintuitive—one might expect it to have an advantage over

the rectiﬁer due to being diﬀerentiable everywhere or due to saturating less

completely, but empirically it does not.

• Hard tanh

. This is shaped similarly to the

tanh

and the rectiﬁer, but unlike

the latter, it is bounded,

(

) =

max

(

−

, min

, a

)). It was introduced

by Collobert (2004).

Hidden unit design remains an active area of research, and many useful hidden

unit types remain to be discovered.

6.4 Architecture Design

Another key design consideration for neural networks is determining the architecture.

The word

architecture

refers to the overall structure of the network: how many

units it should have and how these units should be connected to each other.

Most neural networks are organized into groups of units called layers. Most

neural network architectures arrange these layers in a chain structure, with each

193

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

layer being a function of the layer that preceded it. In this structure, the ﬁrst layer

is given by

(1)

= g

(1)



(1)

x + b

(1)



; (6.40)

the second layer is given by

(2)

= g

(2)



(2)

(1)

+ b

(2)



; (6.41)

and so on.

In these chain-based architectures, the main architectural considerations are

choosing the depth of the network and the width of each layer. As we will

see, a network with even one hidden layer is suﬃcient to ﬁt the training set.

Deeper networks are often able to use far fewer units per layer and far fewer

parameters, as well as frequently generalizing to the test set, but they also tend to

be harder to optimize. The ideal network architecture for a task must be found

via experimentation guided by monitoring the validation set error.

6.4.1 Universal Approximation Properties and Depth

A linear model, mapping from features to outputs via matrix multiplication, can

by deﬁnition represent only linear functions. It has the advantage of being easy to

train because many loss functions result in convex optimization problems when

applied to linear models. Unfortunately, we often want our systems to learn

nonlinear functions.

At ﬁrst glance, we might presume that learning a nonlinear function requires

designing a specialized model family for the kind of nonlinearity we want to learn.

Fortunately, feedforward networks with hidden layers provide a universal approxi-

mation framework. Speciﬁcally, the

universal approximation theorem

(Hornik

et al., 1989; Cybenko, 1989) states that a feedforward network with a linear output

layer and at least one hidden layer with any “squashing” activation function (such

as the logistic sigmoid activation function) can approximate any Borel measurable

function from one ﬁnite-dimensional space to another with any desired nonzero

amount of error, provided that the network is given enough hidden units. The

derivatives of the feedforward network can also approximate the derivatives of the

function arbitrarily well (Hornik et al., 1990). The concept of Borel measurability

is beyond the scope of this book; for our purposes it suﬃces to say that any

continuous function on a closed and bounded subset of

is Borel measurable

and therefore may be approximated by a neural network. A neural network may

also approximate any function mapping from any ﬁnite dimensional discrete space

194

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

to another. While the original theorems were ﬁrst stated in terms of units with

activation functions that saturate for both very negative and very positive argu-

ments, universal approximation theorems have also been proved for a wider class

of activation functions, which includes the now commonly used rectiﬁed linear

unit (Leshno et al., 1993).

The universal approximation theorem means that regardless of what function

we are trying to learn, we know that a large MLP will be able to represent this

function. We are not guaranteed, however, that the training algorithm will be

able to learn that function. Even if the MLP is able to represent the function,

learning can fail for two diﬀerent reasons. First, the optimization algorithm used

for training may not be able to ﬁnd the value of the parameters that corresponds

to the desired function. Second, the training algorithm might choose the wrong

function as a result of overﬁtting. Recall from section 5.2.1 that the no free lunch

theorem shows that there is no universally superior machine learning algorithm.

Feedforward networks provide a universal system for representing functions in the

sense that, given a function, there exists a feedforward network that approximates

the function. There is no universal procedure for examining a training set of

speciﬁc examples and choosing a function that will generalize to points not in the

training set.

According to the universal approximation theorem, there exists a network large

enough to achieve any degree of accuracy we desire, but the theorem does not

say how large this network will be. Barron (1993) provides some bounds on the

size of a single-layer network needed to approximate a broad class of functions.

Unfortunately, in the worst case, an exponential number of hidden units (possibly

with one hidden unit corresponding to each input conﬁguration that needs to be

distinguished) may be required. This is easiest to see in the binary case: the

number of possible binary functions on vectors

v ∈ {

}

is 2

and selecting

one such function requires 2

bits, which will in general require

) degrees of

freedom.

In summary, a feedforward network with a single layer is suﬃcient to represent

any function, but the layer may be infeasibly large and may fail to learn and

generalize correctly. In many circumstances, using deeper models can reduce the

number of units required to represent the desired function and can reduce the

amount of generalization error.

Various families of functions can be approximated eﬃciently by an architecture

with depth greater than some value

, but they require a much larger model if

depth is restricted to be less than or equal to

. In many cases, the number

of hidden units required by the shallow model is exponential in

. Such results

195

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.5: An intuitive, geometric explanation of the exponential advantage of deeper

rectiﬁer networks formally by Montufar et al. (2014). (Left) An absolute value rectiﬁcation

unit has the same output for every pair of mirror points in its input. The mirror axis

of symmetry is given by the hyperplane deﬁned by the weights and bias of the unit. A

function computed on top of that unit (the green decision surface) will be a mirror image

of a simpler pattern across that axis of symmetry. (Center) The function can be obtained

by folding the space around the axis of symmetry. (Right) Another repeating pattern can

be folded on top of the ﬁrst (by another downstream unit) to obtain another symmetry

(which is now repeated four times, with two hidden layers). Figure reproduced with

permission from Montufar et al. (2014).

were ﬁrst proved for models that do not resemble the continuous, diﬀerentiable

neural networks used for machine learning but have since been extended to these

models. The ﬁrst results were for circuits of logic gates (Håstad, 1986). Later

work extended these results to linear threshold units with nonnegative weights

(Håstad and Goldmann, 1991; Hajnal et al., 1993), and then to networks with

continuous-valued activations (Maass, 1992; Maass et al., 1994). Many modern

neural networks use rectiﬁed linear units. Leshno et al. (1993) demonstrated

that shallow networks with a broad family of non-polynomial activation functions,

including rectiﬁed linear units, have universal approximation properties, but these

results do not address the questions of depth or eﬃciency—they specify only that

a suﬃciently wide rectiﬁer network could represent any function. Montufar et al.

(2014) showed that functions representable with a deep rectiﬁer net can require

an exponential number of hidden units with a shallow (one hidden layer) network.

More precisely, they showed that piecewise linear networks (which can be obtained

from rectiﬁer nonlinearities or maxout units) can represent functions with a number

of regions that is exponential in the depth of the network. Figure 6.5 illustrates how

a network with absolute value rectiﬁcation creates mirror images of the function

computed on top of some hidden unit, with respect to the input of that hidden

unit. Each hidden unit speciﬁes where to fold the input space in order to create

mirror responses (on both sides of the absolute value nonlinearity). By composing

these folding operations, we obtain an exponentially large number of piecewise

linear regions that can capture all kinds of regular (e.g., repeating) patterns.

196

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

The main theorem in Montufar et al. (2014) states that the number of linear

regions carved out by a deep rectiﬁer network with

inputs, depth

, and

units

per hidden layer is







d(l−1)



, (6.42)

that is, exponential in depth

. In the case of maxout networks with

ﬁlters per

unit, the number of linear regions is



(l−1)+d



. (6.43)

Of course, there is no guarantee that the kinds of functions we want to learn in

applications of machine learning (and in particular for AI) share such a property.

We may also want to choose a deep model for statistical reasons. Any time

we choose a speciﬁc machine learning algorithm, we are implicitly stating some

set of prior beliefs we have about what kind of function the algorithm should

learn. Choosing a deep model encodes a very general belief that the function we

want to learn should involve composition of several simpler functions. This can be

interpreted from a representation learning point of view as saying that we believe

the learning problem consists of discovering a set of underlying factors of variation

that can in turn be described in terms of other, simpler underlying factors of

variation. Alternately, we can interpret the use of a deep architecture as expressing

a belief that the function we want to learn is a computer program consisting of

multiple steps, where each step makes use of the previous step’s output. These

intermediate outputs are not necessarily factors of variation but can instead be

analogous to counters or pointers that the network uses to organize its internal

processing. Empirically, greater depth does seem to result in better generalization

for a wide variety of tasks (Bengio et al., 2007; Erhan et al., 2009; Bengio, 2009;

Mesnil et al., 2011; Ciresan et al., 2012; Krizhevsky et al., 2012; Sermanet et al.,

2013; Farabet et al., 2013; Couprie et al., 2013; Kahou et al., 2013; Goodfellow

et al., 2014d; Szegedy et al., 2014a). See ﬁgure 6.6 and ﬁgure 6.7 for examples of

some of these empirical results. These results suggest that using deep architectures

does indeed express a useful prior over the space of functions the model learns.

6.4.2 Other Architectural Considerations

So far we have described neural networks as being simple chains of layers, with the

main considerations being the depth of the network and the width of each layer.

In practice, neural networks show considerably more diversity.

197

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

3 4 5 6 7 8 9 10 11

Number of hidden layers

92.0

92.5

93.0

93.5

94.0

94.5

95.0

95.5

96.0

96.5

Test accuracy (percent)

Figure 6.6: Eﬀect of depth. Empirical results showing that deeper networks generalize

better when used to transcribe multidigit numbers from photographs of addresses. Data

from Goodfellow et al. (2014d). The test set accuracy consistently increases with increasing

depth. See ﬁgure 6.7 for a control experiment demonstrating that other increases to the

model size do not yield the same eﬀect.

Many neural network architectures have been developed for speciﬁc tasks.

Specialized architectures for computer vision called convolutional networks are

described in chapter 9. Feedforward networks may also be generalized to the

recurrent neural networks for sequence processing, described in chapter 10, which

have their own architectural considerations.

In general, the layers need not be connected in a chain, even though this is the

most common practice. Many architectures build a main chain but then add extra

architectural features to it, such as skip connections going from layer

to layer

+ 2 or higher. These skip connections make it easier for the gradient to ﬂow from

output layers to layers nearer the input.

Another key consideration of architecture design is exactly how to connect a

pair of layers to each other. In the default neural network layer described by a linear

transformation via a matrix

, every input unit is connected to every output

unit. Many specialized networks in the chapters ahead have fewer connections, so

that each unit in the input layer is connected to only a small subset of units in the

output layer. These strategies for decreasing the number of connections reduce

the number of parameters and the amount of computation required to evaluate

the network but are often highly problem dependent. For example, convolutional

198

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

0.0 0.2 0.4 0.6 0.8 1.0

Number of parameters

×10

Test accuracy (percent)

3, convolutional

3, fully connected

11, convolutional

Figure 6.7: Eﬀect of number of parameters. Deeper models tend to perform better.

This is not merely because the model is larger. This experiment from Goodfellow et al.

(2014d) shows that increasing the number of parameters in layers of convolutional networks

without increasing their depth is not nearly as eﬀective at increasing test set performance,

as illustrated in this ﬁgure. The legend indicates the depth of network used to make

each curve and whether the curve represents variation in the size of the convolutional

or the fully connected layers. We observe that shallow models in this context overﬁt at

around 20 million parameters while deep ones can beneﬁt from having over 60 million.

This suggests that using a deep model expresses a useful preference over the space of

functions the model can learn. Speciﬁcally, it expresses a belief that the function should

consist of many simpler functions composed together. This could result either in learning

a representation that is composed in turn of simpler representations (e.g., corners deﬁned

in terms of edges) or in learning a program with sequentially dependent steps (e.g., ﬁrst

locate a set of objects, then segment them from each other, then recognize them).

199

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

networks, described in chapter 9, use specialized patterns of sparse connections

that are very eﬀective for computer vision problems. In this chapter, it is diﬃcult

to give more speciﬁc advice concerning the architecture of a generic neural network.

In subsequent chapters we develop the particular architectural strategies that have

been found to work well for diﬀerent application domains.

6.5 Back-Propagation and Other Diﬀerentiation

Algorithms

When we use a feedforward neural network to accept an input

and produce an

output

, information ﬂows forward through the network. The input

provides

the initial information that then propagates up to the hidden units at each layer

and ﬁnally produces

. This is called

forward propagation

. During training,

forward propagation can continue onward until it produces a scalar cost

(

The

back-propagation

algorithm (Rumelhart et al., 1986a), often simply called

backprop

, allows the information from the cost to then ﬂow backward through

the network in order to compute the gradient.

Computing an analytical expression for the gradient is straightforward, but

numerically evaluating such an expression can be computationally expensive. The

back-propagation algorithm does so using a simple and inexpensive procedure.

The term back-propagation is often misunderstood as meaning the whole

learning algorithm for multi layer neural networks. Actually, back-propagation

refers only to the method for computing the gradient, while another algorithm,

such as stochastic gradient descent, is used to perform learning using this gradient.

Furthermore, back-propagation is often misunderstood as being speciﬁc to multi-

layer neural networks, but in principle it can compute derivatives of any function

(for some functions, the correct response is to report that the derivative of the

function is undeﬁned). Speciﬁcally, we will describe how to compute the gradient

∇

(

x, y

) for an arbitrary function

, where

is a set of variables whose derivatives

are desired, and

is an additional set of variables that are inputs to the function

but whose derivatives are not required. In learning algorithms, the gradient we most

often require is the gradient of the cost function with respect to the parameters,

∇

(

). Many machine learning tasks involve computing other derivatives, either

as part of the learning process, or to analyze the learned model. The back-

propagation algorithm can be applied to these tasks as well and is not restricted to

computing the gradient of the cost function with respect to the parameters. The

idea of computing derivatives by propagating information through a network is

very general and can be used to compute values such as the Jacobian of a function

200

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

with multiple outputs. We restrict our description here to the most commonly

used case, where f has a single output.

6.5.1 Computational Graphs

So far we have discussed neural networks with a relatively informal graph language.

To describe the back-propagation algorithm more precisely, it is helpful to have a

more precise computational graph language.

Many ways of formalizing computation as graphs are possible.

Here, we use each node in the graph to indicate a variable. The variable may

be a scalar, vector, matrix, tensor, or even a variable of another type.

To formalize our graphs, we also need to introduce the idea of an

operation

An operation is a simple function of one or more variables. Our graph language

is accompanied by a set of allowable operations. Functions more complicated

than the operations in this set may be described by composing many operations

together.

Without loss of generality, we deﬁne an operation to return only a single

output variable. This does not lose generality because the output variable can have

multiple entries, such as a vector. Software implementations of back-propagation

usually support operations with multiple outputs, but we avoid this case in our

description because it introduces many extra details that are not important to

conceptual understanding.

If a variable

is computed by applying an operation to a variable

, then

we draw a directed edge from

. We sometimes annotate the output node

with the name of the operation applied, and other times omit this label when the

operation is clear from context.

Examples of computational graphs are shown in ﬁgure 6.8.

6.5.2 Chain Rule of Calculus

The chain rule of calculus (not to be confused with the chain rule of probability) is

used to compute the derivatives of functions formed by composing other functions

whose derivatives are known. Back-propagation is an algorithm that computes the

chain rule, with a speciﬁc order of operations that is highly eﬃcient.

Let

be a real number, and let

and

both be functions mapping from a real

number to a real number. Suppose that

(

) and

(

)) =

(

). Then

201

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

(a)

(b)

(1)

dot

(2)

ˆyˆy

(c)

(1)

matmul

(2)

relu

(d)

ˆyˆy

dot

λλ

(1)

sqr

(2)

sum

(3)

Figure 6.8: Examples of computational graphs. (a) The graph using the

operation to

compute

. (b) The graph for the logistic regression prediction

ˆy





w + b



Some of the intermediate expressions do not have names in the algebraic expression

but need names in the graph. We simply name the

-th such variable

(i)

. (c) The

computational graph for the expression

max{

, XW

, which computes a design

matrix of rectiﬁed linear unit activations

given a design matrix containing a minibatch

of inputs

. (d) Examples a–c applied at most one operation to each variable, but it

is possible to apply more than one operation. Here we show a computation graph that

applies more than one operation to the weights

of a linear regression model. The

weights are used to make both the prediction ˆy and the weight decay penalty λ



202

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

the chain rule states that

. (6.44)

We can generalize this beyond the scalar case. Suppose that

x ∈ R

y ∈ R

maps from

, and

maps from

. If

(

) and

(

), then

∂z

∂x



∂z

∂y

∂x

. (6.45)

In vector notation, this may be equivalently written as

∇

z =



∂y

∂x





∇

z, (6.46)

where

∂y

∂x

is the n × m Jacobian matrix of g.

From this we see that the gradient of a variable

can be obtained by multiplying

a Jacobian matrix

∂y

∂x

by a gradient

∇

. The back-propagation algorithm consists

of performing such a Jacobian-gradient product for each operation in the graph.

Usually we apply the back-propagation algorithm to tensors of arbitrary di-

mensionality, not merely to vectors. Conceptually, this is exactly the same as

back-propagation with vectors. The only diﬀerence is how the numbers are ar-

ranged in a grid to form a tensor. We could imagine ﬂattening each tensor into

a vector before we run back-propagation, computing a vector-valued gradient,

and then reshaping the gradient back into a tensor. In this rearranged view,

back-propagation is still just multiplying Jacobians by gradients.

To denote the gradient of a value

with respect to a tensor

, we write

∇

just as if

were a vector. The indices into

now have multiple coordinates—for

example, a 3-D tensor is indexed by three coordinates. We can abstract this away

by using a single variable

to represent the complete tuple of indices. For all

possible index tuples

, (

∇

)

gives

∂z

∂X

. This is exactly the same as how for all

possible integer indices

into a vector, (

∇

)

gives

∂z

∂x

. Using this notation, we

can write the chain rule as it applies to tensors. If Y = g(X) and z = f(Y), then

∇

z =



(∇

)

∂z

∂Y

. (6.47)

203

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

6.5.3 Recursively Applying the Chain Rule to Obtain Backprop

Using the chain rule, it is straightforward to write down an algebraic expression for

the gradient of a scalar with respect to any node in the computational graph that

produced that scalar. Actually evaluating that expression in a computer, however,

introduces some extra considerations.

Speciﬁcally, many subexpressions may be repeated several times within the

overall expression for the gradient. Any procedure that computes the gradient

will need to choose whether to store these subexpressions or to recompute them

several times. An example of how these repeated subexpressions arise is given in

ﬁgure 6.9. In some cases, computing the same subexpression twice would simply

be wasteful. For complicated graphs, there can be exponentially many of these

wasted computations, making a naive implementation of the chain rule infeasible.

In other cases, computing the same subexpression twice could be a valid way to

reduce memory consumption at the cost of higher runtime.

We begin with a version of the back-propagation algorithm that speciﬁes the

actual gradient computation directly (algorithm 6.2 along with algorithm 6.1 for the

associated forward computation), in the order it will actually be done and according

to the recursive application of chain rule. One could either directly perform these

computations or view the description of the algorithm as a symbolic speciﬁcation

of the computational graph for computing the back-propagation. However, this

formulation does not make explicit the manipulation and the construction of the

symbolic graph that performs the gradient computation. Such a formulation is

presented in section 6.5.6, with algorithm 6.5, where we also generalize to nodes

that contain arbitrary tensors.

First consider a computational graph describing how to compute a single scalar

(n)

(say, the loss on a training example). This scalar is the quantity whose

gradient we want to obtain, with respect to the

input nodes

(1)

)

. In

other words, we wish to compute

∂u

(n)

∂u

(i)

for all

i ∈ {

, . . . , n

}

. In the application

of back-propagation to computing gradients for gradient descent over parameters,

(n)

will be the cost associated with an example or a minibatch, while

(1)

)

correspond to the parameters of the model.

We will assume that the nodes of the graph have been ordered in such a way

that we can compute their output one after the other, starting at

+1)

and

going up to

(n)

. As deﬁned in algorithm 6.1, each node

(i)

is associated with an

operation f

(i)

and is computed by evaluating the function

(i)

= f(A

(i)

), (6.48)

where A

(i)

is the set of all nodes that are parents of u

(i)

204

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.9: A computational graph that results in repeated subexpressions when computing

the gradient. Let

w ∈ R

be the input to the graph. We use the same function

R → R

as the operation that we apply at every step of a chain:

(

To compute

∂z

∂w

, we apply equation 6.44 and obtain:

∂z

∂w

(6.49)

∂z

∂y

∂x

∂w

(6.50)



(y)f



(x)f



(w) (6.51)



(f(f(w)))f



(f(w))f



(w). (6.52)

Equation 6.51 suggests an implementation in which we compute the value of

(

) only

once and store it in the variable

. This is the approach taken by the back-propagation

algorithm. An alternative approach is suggested by equation 6.52, where the subexpression

(

) appears more than once. In the alternative approach,

(

) is recomputed each time

it is needed. When the memory required to store the value of these expressions is low, the

back-propagation approach of equation 6.51 is clearly preferable because of its reduced

runtime. However, equation 6.52 is also a valid implementation of the chain rule and is

useful when memory is limited.

205

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

That algorithm speciﬁes the forward propagation computation, which we could

put in a graph

. To perform back-propagation, we can construct a computational

graph that depends on

and adds to it an extra set of nodes. These form a

subgraph

with one node per node of

. Computation in

proceeds in exactly

the reverse of the order of computation in

, and each node of

computes the

derivative

∂u

(n)

∂u

(i)

associated with the forward graph node

(i)

. This is done using

the chain rule with respect to scalar output u

(n)

∂u

(n)

∂u

(j)



i:j∈P a(u

(i)

)

∂u

(n)

∂u

(i)

∂u

(i)

∂u

(j)

(6.53)

as speciﬁed by algorithm 6.2. The subgraph

contains exactly one edge for each

edge from node

(j)

to node

(i)

. The edge from

(j)

(i)

is associated with

the computation of

∂u

(i)

∂u

(j)

. In addition, a dot product is performed for each node,

between the gradient already computed with respect to nodes

(i)

that are children

(j)

and the vector containing the partial derivatives

∂u

(i)

∂u

(j)

for the same children

nodes

(i)

. To summarize, the amount of computation required for performing

the back-propagation scales linearly with the number of edges in

, where the

computation for each edge corresponds to computing a partial derivative (of one

node with respect to one of its parents) as well as performing one multiplication

and one addition. Below, we generalize this analysis to tensor-valued nodes, which

is just a way to group multiple scalar values in the same node and enable more

Algorithm 6.1

A procedure that performs the computations mapping

inputs

(1)

)

to an output

(n)

. This deﬁnes a computational graph where each node

computes numerical value

(i)

by applying a function

(i)

to the set of arguments

(i)

that comprises the values of previous nodes

(j)

j < i

, with

j ∈ P a

(

(i)

). The

input to the computational graph is the vector

, and is set into the ﬁrst

nodes

(1)

)

. The output of the computational graph is read oﬀ the last (output)

node u

(n)

for i = 1, . . . , n

(i)

← x

end for

for i = n

+ 1, . . . , n do

(i)

← {u

(j)

| j ∈ P a(u

(i)

)}

(i)

← f

(i)

)

end for

return u

(n)

206

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

eﬃcient implementations.

The back-propagation algorithm is designed to reduce the number of common

subexpressions without regard to memory. Speciﬁcally, it performs on the order

of one Jacobian product per node in the graph. This can be seen from the fact

that backprop (algorithm 6.2) visits each edge from node

(j)

to node

(i)

the graph exactly once in order to obtain the associated partial derivative

∂u

(i)

∂u

(j)

Back-propagation thus avoids the exponential explosion in repeated subexpres-

sions. Other algorithms may be able to avoid more subexpressions by performing

simpliﬁcations on the computational graph, or may be able to conserve memory

by recomputing rather than storing some subexpressions. We revisit these ideas

after describing the back-propagation algorithm itself.

Algorithm 6.2

Simpliﬁed version of the back-propagation algorithm for computing

the derivatives of

(n)

with respect to the variables in the graph. This example is

intended to further understanding by showing a simpliﬁed case where all variables

are scalars, and we wish to compute the derivatives with respect to

(1)

, . . . , u

)

This simpliﬁed version computes the derivatives of all nodes in the graph. The

computational cost of this algorithm is proportional to the number of edges in

the graph, assuming that the partial derivative associated with each edge requires

a constant time. This is of the same order as the number of computations for

the forward propagation. Each

∂u

(i)

∂u

(j)

is a function of the parents

(j)

(i)

, thus

linking the nodes of the forward graph to those added for the back-propagation

graph.

Run forward propagation (algorithm 6.1 for this example) to obtain the activa-

tions of the network.

Initialize

grad_table

, a data structure that will store the derivatives that have

been computed. The entry

grad_table

[

(i)

] will store the computed value of

∂u

(n)

∂u

(i)

grad_table[u

(n)

] ← 1

for j = n − 1 down to 1 do

The next line computes

∂u

(n)

∂u

(j)



i:j∈P a(u

(i)

)

∂u

(n)

∂u

(i)

∂u

(i)

∂u

(j)

using stored values:

grad_table[u

(j)

] ←



i:j∈P a(u

(i)

)

grad_table[u

(i)

]

∂u

(i)

∂u

(j)

end for

return {grad_table[u

(i)

] | i = 1, . . . , n

}

207

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

6.5.4 Back-Propagation Computation in Fully Connected MLP

To clarify the above deﬁnition of the back-propagation computation, let us consider

the speciﬁc graph associated with a fully-connected multi layer MLP.

Algorithm 6.3 ﬁrst shows the forward propagation, which maps parameters to

the supervised loss

(

y, y

) associated with a single (input,target) training example

(x, y), with

y the output of the neural network when x is provided in input.

Algorithm 6.4 then shows the corresponding computation to be done for

applying the back-propagation algorithm to this graph.

Algorithms 6.3 and 6.4 are demonstrations chosen to be simple and straightfor-

ward to understand. However, they are specialized to one speciﬁc problem.

Modern software implementations are based on the generalized form of back-

propagation described in section 6.5.6 below, which can accommodate any compu-

tational graph by explicitly manipulating a data structure for representing symbolic

computation.

Algorithm 6.3

Forward propagation through a typical deep neural network and

the computation of the cost function. The loss

(

y, y

) depends on the output

and on the target

(see section 6.2.1.1 for examples of loss functions). To

obtain the total cost

, the loss may be added to a regularizer Ω(

), where

contains all the parameters (weights and biases). Algorithm 6.4 shows how to

compute gradients of

with respect to parameters

and

. For simplicity, this

demonstration uses only a single input example

. Practical applications should

use a minibatch. See section 6.5.7 for a more realistic demonstration.

Require: Network depth, l

Require: W

(i)

, i ∈ {1, . . . , l}, the weight matrices of the model

Require: b

(i)

, i ∈ {1, . . . , l}, the bias parameters of the model

Require: x, the input to process

Require: y, the target output

(0)

= x

for k = 1, . . . , l do

(k)

= b

(k)

+ W

(k)

(k−1)

(k)

= f(a

(k)

)

end for

y = h

(l)

J = L(

y, y) + λΩ(θ)

208

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Algorithm 6.4

Backward computation for the deep neural network of algo-

rithm 6.3, which uses, in addition to the input

, a target

. This computation

yields the gradients on the activations

(k)

for each layer

, starting from the

output layer and going backwards to the ﬁrst hidden layer. From these gradients,

which can be interpreted as an indication of how each layer’s output should change

to reduce error, one can obtain the gradient on the parameters of each layer. The

gradients on weights and biases can be immediately used as part of a stochas-

tic gradient update (performing the update right after the gradients have been

computed) or used with other gradient-based optimization methods.

After the forward computation, compute the gradient on the output layer:

g ← ∇

J = ∇

y, y)

for k = l, l − 1, . . . , 1 do

Convert the gradient on the layer’s output into a gradient on the pre-

nonlinearity activation (element-wise multiplication if

is element-wise):

g ← ∇

(k)

J = g  f



(k)

)

Compute gradients on weights and biases (including the regularization term,

where needed):

∇

(k)

J = g + λ∇

(k)

Ω(θ)

∇

(k)

J = g h

(k−1)

+ λ∇

(k)

Ω(θ)

Propagate the gradients w.r.t. the next lower-level hidden layer’s activations:

g ← ∇

(k−1)

J = W

(k)

end for

6.5.5 Symbol-to-Symbol Derivatives

Algebraic expressions and computational graphs both operate on

symbols

, or

variables that do not have speciﬁc values. These algebraic and graph-based

representations are called

symbolic representations

. When we actually use

or train a neural network, we must assign speciﬁc values to these symbols. We

replace a symbolic input to the network

with a speciﬁc

numeric

value, such as

[1.2, 3.765, −1.8]



Some approaches to back-propagation take a computational graph and a set of

numerical values for the inputs to the graph, then return a set of numerical values

describing the gradient at those input values. We call this approach

symbol-to-

number diﬀerentiation

. This is the approach used by libraries such as Torch

(Collobert et al., 2011b) and Caﬀe (Jia, 2013).

Another approach is to take a computational graph and add additional nodes

209

CHAPTER 6. DEEP FEEDFORWARD NETWORKS



Figure 6.10: An example of the symbol-to-symbol approach to computing derivatives. In

this approach, the back-propagation algorithm does not need to ever access any actual

speciﬁc numeric values. Instead, it adds nodes to a computational graph describing how

to compute these derivatives. A generic graph evaluation engine can later compute the

derivatives for any speciﬁc numeric values. (Left) In this example, we begin with a graph

representing

(

))). (Right) We run the back-propagation algorithm, instructing

it to construct the graph for the expression corresponding to

. In this example, we do

not explain how the back-propagation algorithm works. The purpose is only to illustrate

what the desired result is: a computational graph with a symbolic description of the

derivative.

to the graph that provide a symbolic description of the desired derivatives. This

is the approach taken by Theano (Bergstra et al., 2010; Bastien et al., 2012) and

TensorFlow (Abadi et al., 2015). An example of how it works is illustrated in

ﬁgure 6.10. The primary advantage of this approach is that the derivatives are

described in the same language as the original expression. Because the derivatives

are just another computational graph, it is possible to run back-propagation

again, diﬀerentiating the derivatives to obtain higher derivatives. (Computation of

higher-order derivatives is described in section 6.5.10.)

We will use the latter approach and describe the back-propagation algorithm in

terms of constructing a computational graph for the derivatives. Any subset of the

graph may then be evaluated using speciﬁc numerical values at a later time. This

allows us to avoid specifying exactly when each operation should be computed.

Instead, a generic graph evaluation engine can evaluate every node as soon as its

parents’ values are available.

The description of the symbol-to-symbol based approach subsumes the symbol-

210

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

to-number approach. The symbol-to-number approach can be understood as

performing exactly the same computations as are done in the graph built by the

symbol-to-symbol approach. The key diﬀerence is that the symbol-to-number

approach does not expose the graph.

6.5.6 General Back-Propagation

The back-propagation algorithm is very simple. To compute the gradient of some

scalar

with respect to one of its ancestors

in the graph, we begin by observing

that the gradient with respect to

is given by

= 1. We can then compute

the gradient with respect to each parent of

in the graph by multiplying the

current gradient by the Jacobian of the operation that produced

. We continue

multiplying by Jacobians, traveling backward through the graph in this way until

we reach

. For any node that may be reached by going backward from

through

two or more paths, we simply sum the gradients arriving from diﬀerent paths at

that node.

More formally, each node in the graph

corresponds to a variable. To achieve

maximum generality, we describe this variable as being a tensor

. Tensors in

general can have any number of dimensions. They subsume scalars, vectors, and

matrices.

We assume that each variable V is associated with the following subroutines:

• get_operation

(

): This returns the operation that computes

, repre-

sented by the edges coming into

in the computational graph. For example,

there may be a Python or C++ class representing the matrix multiplication

operation, and the

get_operation

function. Suppose we have a variable that

is created by matrix multiplication,

. Then

get_operation

(

)

returns a pointer to an instance of the corresponding C++ class.

• get_consumers

(

V, G

): This returns the list of variables that are children of

V in the computational graph G.

• get_inputs(V, G)

: This returns the list of variables that are parents of

in the computational graph G.

Each operation

is also associated with a

bprop

operation. This

bprop

operation can compute a Jacobian-vector product as described by equation 6.47.

This is how the back-propagation algorithm is able to achieve great generality.

Each operation is responsible for knowing how to back-propagate through the

edges in the graph that it participates in. For example, we might use a matrix

211

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

multiplication operation to create a variable

. Suppose that the gradient

of a scalar

with respect to

is given by

. The matrix multiplication operation

is responsible for deﬁning two back-propagation rules, one for each of its input

arguments. If we call the

bprop

method to request the gradient with respect to

given that the gradient on the output is

, then the

bprop

method of the

matrix multiplication operation must state that the gradient with respect to

is given by



. Likewise, if we call the

bprop

method to request the gradient

with respect to

, then the matrix operation is responsible for implementing the

bprop

method and specifying that the desired gradient is given by



. The

back-propagation algorithm itself does not need to know any diﬀerentiation rules. It

only needs to call each operation’s

bprop

rules with the right arguments. Formally,

op.bprop(inputs, X, G) must return



(∇

op.f(inputs)

) G

, (6.54)

which is just an implementation of the chain rule as expressed in equation 6.47.

Here,

inputs

is a list of inputs that are supplied to the operation,

op.f

is the

mathematical function that the operation implements,

is the input whose gradient

we wish to compute, and G is the gradient on the output of the operation.

The

op.bprop

method should always pretend that all its inputs are distinct

from each other, even if they are not. For example, if the

mul

operator is passed

two copies of

to compute

, the

op.bprop

method should still return

as the

derivative with respect to both inputs. The back-propagation algorithm will later

add both of these arguments together to obtain 2

, which is the correct total

derivative on x.

Software implementations of back-propagation usually provide both the opera-

tions and their

bprop

methods, so that users of deep learning software libraries are

able to back-propagate through graphs built using common operations like matrix

multiplication, exponents, logarithms, and so on. Software engineers who build a

new implementation of back-propagation or advanced users who need to add their

own operation to an existing library must usually derive the

op.bprop

method for

any new operations manually.

The back-propagation algorithm is formally described in algorithm 6.5.

In section 6.5.2, we explained that back-propagation was developed in order to

avoid computing the same subexpression in the chain rule multiple times. The naive

algorithm could have exponential runtime due to these repeated subexpressions.

Now that we have speciﬁed the back-propagation algorithm, we can understand its

computational cost. If we assume that each operation evaluation has roughly the

212

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Algorithm 6.5

The outermost skeleton of the back-propagation algorithm. This

portion does simple setup and cleanup work. Most of the important work happens

in the build_grad subroutine of algorithm 6.6

Require: T, the target set of variables whose gradients must be computed.

Require: G, the computational graph

Require: z, the variable to be diﬀerentiated

Let



pruned to contain only nodes that are ancestors of

and descendents

of nodes in T.

Initialize grad_table, a data structure associating tensors to their gradients

grad_table[z] ← 1

for V in T do

build_grad(V, G, G



, grad_table)

end for

Return grad_table restricted to T

Algorithm 6.6

The inner loop subroutine

build_grad

(

V, G, G



, grad_table

) of

the back-propagation algorithm, called by the back-propagation algorithm deﬁned

in algorithm 6.5.

Require: V, the variable whose gradient should be added to G and grad_table

Require: G, the graph to modify

Require: G



, the restriction of G to nodes that participate in the gradient

Require: grad_table, a data structure mapping nodes to their gradients

if V is in grad_table then

Return grad_table[V]

end if

i ← 1

for C in get_consumers(V, G



) do

op ← get_operation(C)

D ← build_grad(C, G, G



, grad_table)

(i)

← op.bprop(get_inputs(C, G



), V, D)

i ← i + 1

end for

G ←



(i)

grad_table[V] = G

Insert G and the operations creating it into G

Return G

213

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

same cost, then we may analyze the computational cost in terms of the number

of operations executed. Keep in mind here that we refer to an operation as the

fundamental unit of our computational graph, which might actually consist of

several arithmetic operations (for example, we might have a graph that treats matrix

multiplication as a single operation). Computing a gradient in a graph with

nodes

will never execute more than

(

) operations or store the output of more than

(

) operations. Here we are counting operations in the computational graph, not

individual operations executed by the underlying hardware, so it is important to

remember that the runtime of each operation may be highly variable. For example,

multiplying two matrices that each contain millions of entries might correspond to

a single operation in the graph. We can see that computing the gradient requires at

most

(

) operations because the forward propagation stage will at worst execute

all

nodes in the original graph (depending on which values we want to compute,

we may not need to execute the entire graph). The back-propagation algorithm

adds one Jacobian-vector product, which should be expressed with

(1) nodes, per

edge in the original graph. Because the computational graph is a directed acyclic

graph it has at most

(

) edges. For the kinds of graphs that are commonly used

in practice, the situation is even better. Most neural network cost functions are

roughly chain-structured, causing back-propagation to have

(

) cost. This is far

better than the naive approach, which might need to execute exponentially many

nodes. This potentially exponential cost can be seen by expanding and rewriting

the recursive chain rule (equation 6.53) nonrecursively:

∂u

(n)

∂u

(j)



path (u

(π

)

(π

)

,...,u

(π

)

from π

=j to π



k=2

∂u

(π

)

∂u

(π

k−1

)

. (6.55)

Since the number of paths from node

to node

can grow exponentially in the

length of these paths, the number of terms in the above sum, which is the number

of such paths, can grow exponentially with the depth of the forward propagation

graph. This large cost would be incurred because the same computation for

∂u

(i)

∂u

(j)

would be redone many times. To avoid such recomputation, we can think

of back-propagation as a table-ﬁlling algorithm that takes advantage of storing

intermediate results

∂u

(n)

∂u

(i)

. Each node in the graph has a corresponding slot in a

table to store the gradient for that node. By ﬁlling in these table entries in order,

back-propagation avoids repeating many common subexpressions. This table-ﬁlling

strategy is sometimes called dynamic programming.

214

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

6.5.7 Example: Back-Propagation for MLP Training

As an example, we walk through the back-propagation algorithm as it is used to

train a multilayer perceptron.

Here we develop a very simple multilayer perceptron with a single hidden

layer. To train this model, we will use minibatch stochastic gradient descent.

The back-propagation algorithm is used to compute the gradient of the cost on a

single minibatch. Speciﬁcally, we use a minibatch of examples from the training

set formatted as a design matrix

and a vector of associated class labels

The network computes a layer of hidden features

max{

, XW

(1)

}

. To

simplify the presentation we do not use biases in this model. We assume that our

graph language includes a

relu

operation that can compute

max{

, Z}

element-

wise. The predictions of the unnormalized log probabilities over classes are then

given by

(2)

. We assume that our graph language includes a

cross_entropy

operation that computes the cross-entropy between the targets

and the probability

distribution deﬁned by these unnormalized log probabilities. The resulting cross-

entropy deﬁnes the cost

MLE

. Minimizing this cross-entropy performs maximum

likelihood estimation of the classiﬁer. However, to make this example more realistic,

we also include a regularization term. The total cost

J = J

MLE

+ λ







i,j



(1)

i,j





i,j



(2)

i,j







(6.56)

consists of the cross-entropy and a weight decay term with coeﬃcient

. The

computational graph is illustrated in ﬁgure 6.11.

The computational graph for the gradient of this example is large enough that

it would be tedious to draw or to read. This demonstrates one of the beneﬁts

of the back-propagation algorithm, which is that it can automatically generate

gradients that would be straightforward but tedious for a software engineer to

derive manually.

We can roughly trace out the behavior of the back-propagation algorithm

by looking at the forward propagation graph in ﬁgure 6.11. To train, we wish

to compute both

∇

(1)

and

∇

(2)

. There are two diﬀerent paths leading

backward from

to the weights: one through the cross-entropy cost, and one

through the weight decay cost. The weight decay cost is relatively simple; it will

always contribute 2λW

(i)

to the gradient on W

(i)

The other path through the cross-entropy cost is slightly more complicated.

Let

be the gradient on the unnormalized log probabilities

(2)

provided by

the

cross_entropy

operation. The back-propagation algorithm now needs to

215

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

(1)

matmul

relu

(3)

sqr

(4)

sum

λλ

(7)

(2)

matmul

MLE

cross_entropy

(5)

sqr

(6)

sum

(8)

Figure 6.11: The computational graph used to compute the cost to train our example of a

single-layer MLP using the cross-entropy loss and weight decay.

explore two diﬀerent branches. On the shorter branch, it adds



to the

gradient on

(2)

, using the back-propagation rule for the second argument to

the matrix multiplication operation. The other branch corresponds to the longer

chain descending further along the network. First, the back-propagation algorithm

computes

∇

(2)

using the back-propagation rule for the ﬁrst argument

to the matrix multiplication operation. Next, the

relu

operation uses its back-

propagation rule to zero out components of the gradient corresponding to entries

(1)

that are less than 0. Let the result be called



. The last step of the

back-propagation algorithm is to use the back-propagation rule for the second

argument of the matmul operation to add X





to the gradient on W

(1)

After these gradients have been computed, the gradient descent algorithm, or

another optimization algorithm, uses these gradients to update the parameters.

For the MLP, the computational cost is dominated by the cost of matrix

multiplication. During the forward propagation stage, we multiply by each weight

matrix, resulting in

(

) multiply-adds, where

is the number of weights. During

the backward propagation stage, we multiply by the transpose of each weight

matrix, which has the same computational cost. The main memory cost of the

algorithm is that we need to store the input to the nonlinearity of the hidden layer.

216

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

This value is stored from the time it is computed until the backward pass has

returned to the same point. The memory cost is thus

(

), where

is the

number of examples in the minibatch and n

is the number of hidden units.

6.5.8 Complications

Our description of the back-propagation algorithm here is simpler than the imple-

mentations actually used in practice.

As noted above, we have restricted the deﬁnition of an operation to be a

function that returns a single tensor. Most software implementations need to

support operations that can return more than one tensor. For example, if we wish

to compute both the maximum value in a tensor and the index of that value, it is

best to compute both in a single pass through memory, so it is most eﬃcient to

implement this procedure as a single operation with two outputs.

We have not described how to control the memory consumption of back-

propagation. Back-propagation often involves summation of many tensors together.

In the naive approach, each of these tensors would be computed separately, then

all of them would be added in a second step. The naive approach has an overly

high memory bottleneck that can be avoided by maintaining a single buﬀer and

adding each value to that buﬀer as it is computed.

Real-world implementations of back-propagation also need to handle various

data types, such as 32-bit ﬂoating point, 64-bit ﬂoating point, and integer values.

The policy for handling each of these types takes special care to design.

Some operations have undeﬁned gradients, and it is important to track these

cases and determine whether the gradient requested by the user is undeﬁned.

Various other technicalities make real-world diﬀerentiation more complicated.

These technicalities are not insurmountable, and this chapter has described the key

intellectual tools needed to compute derivatives, but it is important to be aware

that many more subtleties exist.

6.5.9 Diﬀerentiation outside the Deep Learning Community

The deep learning community has been somewhat isolated from the broader

computer science community and has largely developed its own cultural attitudes

concerning how to perform diﬀerentiation. More generally, the ﬁeld of

automatic

diﬀerentiation

is concerned with how to compute derivatives algorithmically.

The back-propagation algorithm described here is only one approach to automatic

diﬀerentiation. It is a special case of a broader class of techniques called

reverse

217

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

mode accumulation

. Other approaches evaluate the subexpressions of the chain

rule in diﬀerent orders. In general, determining the order of evaluation that

results in the lowest computational cost is a diﬃcult problem. Finding the optimal

sequence of operations to compute the gradient is NP-complete (Naumann, 2008),

in the sense that it may require simplifying algebraic expressions into their least

expensive form.

For example, suppose we have variables

, p

, . . . , p

representing probabilities,

and variables

, z

, . . . , z

representing unnormalized log probabilities. Suppose

we deﬁne

exp(z

)



exp(z

)

, (6.57)

where we build the softmax function out of exponentiation, summation and division

operations, and construct a cross-entropy loss

−



log q

. A human

mathematician can observe that the derivative of

with respect to

takes a very

simple form:

−p

. The back-propagation algorithm is not capable of simplifying

the gradient this way and will instead explicitly propagate gradients through all

the logarithm and exponentiation operations in the original graph. Some software

libraries such as Theano (Bergstra et al., 2010; Bastien et al., 2012) are able to

perform some kinds of algebraic substitution to improve over the graph proposed

by the pure back-propagation algorithm.

When the forward graph

has a single output node and each partial derivative

∂u

(i)

∂u

(j)

can be computed with a constant amount of computation, back-propagation

guarantees that the number of computations for the gradient computation is of

the same order as the number of computations for the forward computation: this

can be seen in algorithm 6.2, because each local partial derivative

∂u

(i)

∂u

(j)

needs to

be computed only once along with an associated multiplication and addition for

the recursive chain-rule formulation (equation 6.53). The overall computation is

therefore

(# edges). It can potentially be reduced, however, by simplifying the

computational graph constructed by back-propagation, and this is an NP-complete

task. Implementations such as Theano and TensorFlow use heuristics based on

matching known simpliﬁcation patterns to iteratively attempt to simplify the graph.

We deﬁned back-propagation only for computing a scalar output gradient, but

back-propagation can be extended to compute a Jacobian (either of

diﬀerent

scalar nodes in the graph, or of a tensor-valued node containing

values). A

naive implementation may then need

times more computation: for each scalar

internal node in the original forward graph, the naive implementation computes

gradients instead of a single gradient. When the number of outputs of the graph is

larger than the number of inputs, it is sometimes preferable to use another form of

automatic diﬀerentiation called

forward mode accumulation

. Forward mode

218

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

accumulation has been proposed for obtaining real-time computation of gradients

in recurrent networks, for example (Williams and Zipser, 1989). This approach also

avoids the need to store the values and gradients for the whole graph, trading oﬀ

computational eﬃciency for memory. The relationship between forward mode and

backward mode is analogous to the relationship between left-multiplying versus

right-multiplying a sequence of matrices, such as

ABCD, (6.58)

where the matrices can be thought of as Jacobian. For example, if

is a column

vector while

has many rows, the graph will have a single output and many inputs,

and starting the multiplications from the end and going backward requires only

matrix-vector products. This order corresponds to the backward mode. Instead,

starting to multiply from the left would involve a series of matrix-matrix products,

which makes the whole computation much more expensive. If

has fewer rows

than

has columns, however, it is cheaper to run the multiplications left-to-right,

corresponding to the forward mode.

In many communities outside machine learning, it is more common to implement

diﬀerentiation software that acts directly on traditional programming language

code, such as Python or C code, and automatically generates programs that

diﬀerentiate functions written in these languages. In the deep learning community,

computational graphs are usually represented by explicit data structures created

by specialized libraries. The specialized approach has the drawback of requiring

the library developer to deﬁne the

bprop

methods for every operation and limiting

the user of the library to only those operations that have been deﬁned. Yet the

specialized approach also has the beneﬁt of allowing customized back-propagation

rules to be developed for each operation, enabling the developer to improve speed

or stability in nonobvious ways that an automatic procedure would presumably be

unable to replicate.

Back-propagation is therefore not the only way or the optimal way of computing

the gradient, but it is a practical method that continues to serve the deep learning

community well. In the future, diﬀerentiation technology for deep networks may

improve as deep learning practitioners become more aware of advances in the

broader ﬁeld of automatic diﬀerentiation.

6.5.10 Higher-Order Derivatives

Some software frameworks support the use of higher-order derivatives. Among the

deep learning software frameworks, this includes at least Theano and TensorFlow.

219

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

These libraries use the same kind of data structure to describe the expressions for

derivatives as they use to describe the original function being diﬀerentiated. This

means that the symbolic diﬀerentiation machinery can be applied to derivatives.

In the context of deep learning, it is rare to compute a single second derivative

of a scalar function. Instead, we are usually interested in properties of the Hessian

matrix. If we have a function

→ R

, then the Hessian matrix is of size

n ×n

In typical deep learning applications,

will be the number of parameters in the

model, which could easily number in the billions. The entire Hessian matrix is

thus infeasible to even represent.

Instead of explicitly computing the Hessian, the typical deep learning approach

is to use

Krylov methods

. Krylov methods are a set of iterative techniques for

performing various operations, such as approximately inverting a matrix or ﬁnding

approximations to its eigenvectors or eigenvalues, without using any operation

other than matrix-vector products.

To use Krylov methods on the Hessian, we only need to be able to com-

pute the product between the Hessian matrix

and an arbitrary vector

. A

straightforward technique (Christianson, 1992) for doing so is to compute

Hv = ∇



(∇

f(x))





. (6.59)

Both gradient computations in this expression may be computed automatically by

the appropriate software library. Note that the outer gradient expression takes the

gradient of a function of the inner gradient expression.

is itself a vector produced by a computational graph, it is important to

specify that the automatic diﬀerentiation software should not diﬀerentiate through

the graph that produced v.

While computing the Hessian is usually not advisable, it is possible to do with

Hessian vector products. One simply computes

(i)

for all

= 1

, . . . , n,

where

(i)

is the one-hot vector with e

(i)

= 1 and all other entries are equal to 0.

6.6 Historical Notes

Feedforward networks can be seen as eﬃcient nonlinear function approximators

based on using gradient descent to minimize the error in a function approximation.

From this point of view, the modern feedforward network is the culmination of

centuries of progress on the general function approximation task.

The chain rule that underlies the back-propagation algorithm was invented in

the seventeenth century (Leibniz, 1676; L’Hôpital, 1696). Calculus and algebra

220

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

have long been used to solve optimization problems in closed form, but gradient

descent was not introduced as a technique for iteratively approximating the solution

to optimization problems until the nineteenth century (Cauchy, 1847).

Beginning in the 1940s, these function approximation techniques were used to

motivate machine learning models such as the perceptron. However, the earliest

models were based on linear models. Critics including Marvin Minsky pointed out

several of the ﬂaws of the linear model family, such as its inability to learn the

XOR function, which led to a backlash against the entire neural network approach.

Learning nonlinear functions required the development of a multilayer per-

ceptron and a means of computing the gradient through such a model. Eﬃcient

applications of the chain rule based on dynamic programming began to appear

in the 1960s and 1970s, mostly for control applications (Kelley, 1960; Bryson and

Denham, 1961; Dreyfus, 1962; Bryson and Ho, 1969; Dreyfus, 1973) but also for

sensitivity analysis (Linnainmaa, 1976). Werbos (1981) proposed applying these

techniques to training artiﬁcial neural networks. The idea was ﬁnally developed

in practice after being independently rediscovered in diﬀerent ways (LeCun, 1985;

Parker, 1985; Rumelhart et al., 1986a). The book

Parallel Distributed Pro-

cessing

presented the results of some of the ﬁrst successful experiments with

back-propagation in a chapter (Rumelhart et al., 1986b) that contributed greatly

to the popularization of back-propagation and initiated a very active period of re-

search in multilayer neural networks. The ideas put forward by the authors of that

book, particularly by Rumelhart and Hinton, go much beyond back-propagation.

They include crucial ideas about the possible computational implementation of

several central aspects of cognition and learning, which came under the name

“connectionism” because of the importance this school of thought places on the

connections between neurons as the locus of learning and memory. In particular,

these ideas include the notion of distributed representation (Hinton et al., 1986).

Following the success of back-propagation, neural network research gained pop-

ularity and reached a peak in the early 1990s. Afterwards, other machine learning

techniques became more popular until the modern deep learning renaissance that

began in 2006.

The core ideas behind modern feedforward networks have not changed sub-

stantially since the 1980s. The same back-propagation algorithm and the same

approaches to gradient descent are still in use. Most of the improvement in neural

network performance from 1986 to 2015 can be attributed to two factors. First,

larger datasets have reduced the degree to which statistical generalization is a

challenge for neural networks. Second, neural networks have become much larger,

because of more powerful computers and better software infrastructure. A small

221

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

number of algorithmic changes have also improved the performance of neural

networks noticeably.

One of these algorithmic changes was the replacement of mean squared error

with the cross-entropy family of loss functions. Mean squared error was popular in

the 1980s and 1990s but was gradually replaced by cross-entropy losses and the

principle of maximum likelihood as ideas spread between the statistics community

and the machine learning community. The use of cross-entropy losses greatly

improved the performance of models with sigmoid and softmax outputs, which

had previously suﬀered from saturation and slow learning when using the mean

squared error loss.

The other major algorithmic change that has greatly improved the performance

of feedforward networks was the replacement of sigmoid hidden units with piecewise

linear hidden units, such as rectiﬁed linear units. Rectiﬁcation using the

max{

, z}

function was introduced in early neural network models and dates back at least as far

as the cognitron and neocognitron (Fukushima, 1975, 1980). These early models did

not use rectiﬁed linear units but instead applied rectiﬁcation to nonlinear functions.

Despite the early popularity of rectiﬁcation, it was largely replaced by sigmoids

in the 1980s, perhaps because sigmoids perform better when neural networks are

very small. As of the early 2000s, rectiﬁed linear units were avoided because of

a somewhat superstitious belief that activation functions with nondiﬀerentiable

points must be avoided. This began to change in about 2009. Jarrett et al. (2009)

observed that “using a rectifying nonlinearity is the single most important factor

in improving the performance of a recognition system,” among several diﬀerent

factors of neural network architecture design.

For small datasets, Jarrett et al. (2009) observed that using rectifying non-

linearities is even more important than learning the weights of the hidden layers.

Random weights are suﬃcient to propagate useful information through a rectiﬁed

linear network, enabling the classiﬁer layer at the top to learn how to map diﬀerent

feature vectors to class identities.

When more data is available, learning begins to extract enough useful knowledge

to exceed the performance of randomly chosen parameters. Glorot et al. (2011a)

showed that learning is far easier in deep rectiﬁed linear networks than in deep

networks that have curvature or two-sided saturation in their activation functions.

Rectiﬁed linear units are also of historical interest because they show that

neuroscience has continued to have an inﬂuence on the development of deep

learning algorithms. Glorot et al. (2011a) motivated rectiﬁed linear units from

biological considerations. The half-rectifying nonlinearity was intended to capture

these properties of biological neurons: (1) For some inputs, biological neurons

222

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

are completely inactive. (2) For some inputs, a biological neuron’s output is

proportional to its input. (3) Most of the time, biological neurons operate in the

regime where they are inactive (i.e., they should have sparse activations).

When the modern resurgence of deep learning began in 2006, feedforward

networks continued to have a bad reputation. From about 2006 to 2012, it was

widely believed that feedforward networks would not perform well unless they were

assisted by other models, such as probabilistic models. Today, it is now known

that with the right resources and engineering practices, feedforward networks

perform very well. Today, gradient-based learning in feedforward networks is used

as a tool to develop probabilistic models, such as the variational autoencoder

and generative adversarial networks, described in chapter 20. Rather than being

viewed as an unreliable technology that must be supported by other techniques,

gradient-based learning in feedforward networks has been viewed since 2012 as a

powerful technology that can be applied to many other machine learning tasks. In

2006, the community used unsupervised learning to support supervised learning,

and now, ironically, it is more common to use supervised learning to support

unsupervised learning.

Feedforward networks continue to have unfulﬁlled potential. In the future, we

expect they will be applied to many more tasks, and that advances in optimization

algorithms and model design will improve their performance even further. This

chapter has primarily described the neural network family of models. In the

subsequent chapters, we turn to how to use these models—how to regularize and

train them.

223