Time delay neural network

Time delay neural network classify patterns with shift-invariance, and 2) model context at each layer of the network.
Shift-invariant classification means that the classifier does not require explicit segmentation prior to classification. For the classification of a temporal pattern, the TDNN thus avoids having to determine the beginning and end points of sounds before classifying them.
For contextual modelling in a TDNN, each neural unit at each layer receives input not only from activations/features at the layer below, but from a pattern of unit output and its context. For time signals each unit receives as input the activation patterns over time from units below. Applied to two-dimensional classification, the TDNN can be trained with shift-invariance in the coordinate space and avoids precise segmentation in the coordinate space.

History

The TDNN was introduced in the late 1980s and applied to a task of phoneme classification for automatic speech recognition in speech signals where the automatic determination of precise segments or feature boundaries was difficult or impossible. Because the TDNN recognizes phonemes and their underlying acoustic/phonetic features, independent of position in time, it improved performance over static classification. It was also applied to two-dimensional signals.

Max pooling

In 1990, Yamaguchi et al. introduced the concept of max pooling. They did so by combining TDNNs with max pooling in order to realize a speaker independent isolated word recognition system.

Overview

The Time Delay Neural Network, like other neural networks, operates with multiple interconnected layers of perceptrons, and is implemented as a feedforward neural network. All neurons of a TDNN receive inputs from the outputs of neurons at the layer below but with two differences:

Unlike regular Multi-Layer perceptrons, all units in a TDNN, at each layer, obtain inputs from a contextual window of outputs from the layer below. For time varying signals, each unit has connections to the output from units below but also to the time-delayed outputs from these same units. This models the units' temporal pattern/trajectory. For two-dimensional signals, a 2-D context window is observed at each layer. Higher layers have inputs from wider context windows than lower layers and thus generally model coarser levels of abstraction.
Shift-invariance is achieved by explicitly removing position dependence during backpropagation training. This is done by making time-shifted copies of a network across the dimension of invariance. The error gradient is then computed by backpropagation through all these networks from an overall target vector, but before performing the weight update, the error gradients associated with shifted copies are averaged and thus shared and constraint to be equal. Thus, all position dependence from backpropagation training through the shifted copies is removed and the copied networks learn the most salient hidden features shift-invariantly, i.e. independent of their precise position in the input data. Shift-invariance is also readily extended to multiple dimensions by imposing similar weight-sharing across copies that are shifted along multiple dimensions.
Example

In the case of a speech signal, inputs are spectral coefficients over time.
In order to learn critical acoustic-phonetic features without first requiring precise localization, the TDNN is trained time-shift-invariantly. Time-shift invariance is achieved through weight sharing across time during training: Time shifted copies of the TDNN are made over the input range. Backpropagation is then performed from an overall classification target vector, resulting in gradients that will generally vary for each of the time-shifted network copies. Since such time-shifted networks are only copies, however, the position dependence is removed by weight sharing. In this example, this is done by averaging the gradients from each time-shifted copy before performing the weight update. In speech, time-shift invariant training was shown to learn weight matrices that are independent of precise positioning of the input. The weight matrices could also be shown to detect important acoustic-phonetic features that are known to be important for human speech perception, such as formant transitions, bursts, etc. TDNNs could also be combined or grown by way of pre-training.

Implementation

The precise architecture of TDNNs is mostly determined by the designer depending on the classification problem and the most useful context sizes. The delays or context windows are chosen specific to each application. Work has also been done to create adaptable time-delay TDNNs where this manual tuning is eliminated.

State of the art

TDNN-based phoneme recognizers compared favourably in early comparisons with HMM-based phone models. Modern deep TDNN architectures include many more hidden layers and sub-sample or pool connections over broader contexts at higher layers. They achieve up to 50% word error reduction over GMM-based acoustic models. While the different layers of TDNNs are intended to learn features of increasing context width, they do model local contexts. When longer-distance relationships and pattern sequences have to be processed, learning states and state-sequences is important and TDNNs can be combined with other modelling techniques.

Applications

Speech recognition

TDNNs used to solve problems in speech recognition that were introduced in 1987 and initially focused on shift-invariant phoneme recognition. Speech lends itself nicely to TDNNs as spoken sounds are rarely of uniform length and precise segmentation is difficult or impossible. By scanning a sound over past and future, the TDNN is able to construct a model for the key elements of that sound in a time-shift invariant manner. This is particularly useful as sounds are smeared out through reverberation. Large phonetic TDNNs can be constructed modularly through pre-training and combining smaller networks.

Large vocabulary speech recognition

Large vocabulary speech recognition requires recognizing sequences of phonemes that make up words subject to the constraints of a large pronunciation vocabulary. Integration of TDNNs into large vocabulary speech recognizers is possible by introducing state transitions and search between phonemes that make up a word. The resulting Multi-State Time-Delay Neural Network can be trained discriminative from the word level, thereby optimizing the entire arrangement toward word recognition instead of phoneme classification.

Speaker independence

Two-dimensional variants of the TDNNs were proposed for speaker independence. Here, shift-invariance is applied to the time as well as to the frequency axis in order to learn hidden features that are independent of precise location in time and in frequency.

Reverberation

One of the persistent problems in speech recognition is recognizing speech when it is corrupted by echo and reverberation. Reverberation can be viewed as corrupting speech with delayed versions of itself. In general, it is difficult, however, to de-reverberate a signal as the impulse response function is not known for any arbitrary space. The TDNN was shown to be effective to recognize speech robustly despite different levels of reverberation.

Lip-reading – audio-visual speech

TDNNs were also successfully used in early demonstrations of audio-visual speech, where the sounds of speech are complemented by visually reading lip movement. Here, TDNN-based recognizers used visual and acoustic features jointly to achieve improved recognition accuracy, particularly in the presence of noise, where complementary information from an alternate modality could be fused nicely in a neural net.

Handwriting recognition

TDNNs have been used effectively in compact and high-performance handwriting recognition systems. Shift-invariance was also adapted to spatial patterns in image offline handwriting recognition.

Video analysis

Video has a temporal dimension that makes a TDNN an ideal solution to analysing motion patterns. An example of this analysis is a combination of vehicle detection and recognizing pedestrians. When examining videos, subsequent images are fed into the TDNN as input where each image is the next frame in the video. The strength of the TDNN comes from its ability to examine objects shifted in time forward and backward to define an object detectable as the time is altered. If an object can be recognized in this manner, an application can plan on that object to be found in the future and perform an optimal action.

Image recognition

Two-dimensional TDNNs were later applied to other image-recognition tasks under the name of "Convolutional Neural Networks", where shift-invariant training is applied to the x/y axes of an image.

Common libraries

TDNNs can be implemented in virtually all machine-learning frameworks using one-dimensional convolutional neural networks, due to the equivalence of the methods.
Matlab: The neural network toolbox has explicit functionality designed to produce a time delay neural network give the step size of time delays and an optional training function. The default training algorithm is a Supervised Learning back-propagation algorithm that updates filter weights based on the Levenberg-Marquardt optimizations. The function is timedelaynet and returns a time-delay neural network architecture that a user can train and provide inputs to.
The Kaldi ASR Toolkit has an implementation of TDNNs with several optimizations for speech recognition.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...