
CHAPTER 12. APPLICATIONS
neural nets (Bourlard and Wellekens, 1989; Waibel et al., 1989; Robinson and
Fallside, 1991; Bengio et al., 1991, 1992; Konig et al., 1996). At the time, the
performance of ASR based on neural nets approximately matched the performance
of GMM-HMM systems. For example, Robinson and Fallside (1991) achieved 26
percent phoneme error rate on the TIMIT (Garofolo et al., 1993) corpus (with
39 phonemes to discriminate among), which was better than or comparable to
HMM-based systems. Since then, TIMIT has been a benchmark for phoneme
recognition, playing a role similar to the role MNIST plays for object recognition.
Nonetheless, because of the complex engineering involved in software systems for
speech recognition and the effort that had been invested in building these systems
on the basis of GMM-HMMs, the industry did not see a compelling argument
for switching to neural networks. As a consequence, until the late 2000s, both
academic and industrial research in using neural nets for speech recognition mostly
focused on using neural nets to learn extra features for GMM-HMM systems.
Later, with much larger and deeper models and much larger datasets, recognition
accuracy was dramatically improved by using neural networks to replace GMMs
for the task of associating acoustic features to phonemes (or subphonemic states).
Starting in 2009, speech researchers applied a form of deep learning based on
unsupervised learning to speech recognition. This approach to deep learning was
based on training undirected probabilistic models called restricted Boltzmann
machines (RBMs) to model the input data. RBMs are described in part III.
To solve speech recognition tasks, unsupervised pretraining was used to build
deep feedforward networks whose layers were each initialized by training an RBM.
These networks take spectral acoustic representations in a fixed-size input window
(around a center frame) and predict the conditional probabilities of HMM states
for that center frame. Training such deep networks helped to significantly improve
the recognition rate on TIMIT (Mohamed et al., 2009, 2012a), bringing down the
phoneme error rate from about 26 percent to 20.7 percent. See Mohamed et al.
(2012b) for an analysis of reasons for the success of these models. Extensions to the
basic phone recognition pipeline included the addition of speaker-adaptive features
(Mohamed et al., 2011) that further reduced the error rate. This was quickly
followed up by work to expand the architecture from phoneme recognition (which
is what TIMIT is focused on) to large-vocabulary speech recognition (Dahl et al.,
2012), which involves not just recognizing phonemes but also recognizing sequences
of words from a large vocabulary. Deep networks for speech recognition eventually
shifted from being based on pretraining and Boltzmann machines to being based
on techniques such as rectified linear units and dropout (Zeiler et al., 2013; Dahl
et al., 2013). By that time, several of the major speech groups in industry had
started exploring deep learning in collaboration with academic researchers. Hinton
454