CHAPTER 11. PRACTICAL METHODOLOGY
because increasing the number of hidden units increases the capacity of the model.
For some hyperparameters, overﬁtting occurs when the value of the hyperparame-
ter is small. For example, the smallest allowable weight decay coeﬃcient of zero
corresponds to the greatest eﬀective capacity of the learning algorithm.
Not every hyperparameter will be able to explore the entire U-shaped curve.
Many hyperparameters are discrete, such as the number of units in a layer or the
number of linear pieces in a maxout unit, so it is only possible to visit a few points
along the curve. Some hyperparameters are binary. Usually these hyperparameters
are switches that specify whether or not to use some optional component of
the learning algorithm, such as a preprocessing step that normalizes the input
features by subtracting their mean and dividing by their standard deviation. These
hyperparameters can only explore two points on the curve. Other hyperparameters
have some minimum or maximum value that prevents them from exploring some
part of the curve. For example, the minimum weight decay coeﬃcient is zero. This
means that if the model is underﬁtting when weight decay is zero, we can not enter
the overﬁtting region by modifying the weight decay coeﬃcient. In other words,
some hyperparameters can only subtract capacity.
The learning rate is perhaps the most important hyperparameter. If you
have time to tune only one hyperparameter, tune the learning rate. It con-
trols the eﬀective capacity of the model in a more complicated way than other
hyperparameters—the eﬀective capacity of the model is highest when the learning
rate is correct for the optimization problem, not when the learning rate is especially
large or especially small. The learning rate has a U-shaped curve for training error,
illustrated in ﬁgure 11.1. When the learning rate is too large, gradient descent
can inadvertently increase rather than decrease the training error. In the idealized
quadratic case, this occurs if the learning rate is at least twice as large as its
optimal value (LeCun et al., 1998a). When the learning rate is too small, training
is not only slower, but may become permanently stuck with a high training error.
This eﬀect is poorly understood (it would not happen for a convex loss function).
Tuning the parameters other than the learning rate requires monitoring both
training and test error to diagnose whether your model is overﬁtting or underﬁtting,
then adjusting its capacity appropriately.
If your error on the training set is higher than your target error rate, you have
no choice but to increase capacity. If you are not using regularization and you are
conﬁdent that your optimization algorithm is performing correctly, then you must
add more layers to your network or add more hidden units. Unfortunately, this
increases the computational costs associated with the model.
If your error on the test set is higher than than your target error rate, you can