Residual neural network


A residual neural network is an artificial neural network of a kind that builds on constructs known from pyramidal cells in the cerebral cortex. Residual neural networks do this by utilizing skip connections, or shortcuts to jump over some layers. Typical ResNet models are implemented with double- or triple- layer skips that contain nonlinearities and batch normalization in between. An additional weight matrix may be used to learn the skip weights; these models are known as HighwayNets. Models with several parallel skips are referred to as DenseNets. In the context of residual neural networks, a non-residual network may be described as a plain network.
One motivation for skipping over layers is to avoid the problem of vanishing gradients, by reusing activations from a previous layer until the adjacent layer learns its weights. During training, the weights adapt to mute the upstream layer, and amplify the previously-skipped layer. In the simplest case, only the weights for the adjacent layer's connection are adapted, with no explicit weights for the upstream layer. This works best when a single nonlinear layer is stepped over, or when the intermediate layers are all linear. If not, then an explicit weight matrix should be learned for the skipped connection.
Skipping effectively simplifies the network, using fewer layers in the initial training stages. This speeds learning by reducing the impact of vanishing gradients, as there are fewer layers to propagate through. The network then gradually restores the skipped layers as it learns the feature space. Towards the end of training, when all layers are expanded, it stays closer to the manifold and thus learns faster. A neural network without residual parts explores more of the feature space. This makes it more vulnerable to perturbations that cause it to leave the manifold, and necessitates extra training data to recover.

Biological analogue

The brain has structures similar to residual nets, as cortical layer VI neurons get input from layer I, skipping intermediary layers. In the figure this compares to signals from the apical dendrite skipping over layers, while the basal dendrite collects signals from the previous and/or same layer. Similar structures exists for other layers. How many layers in the cerebral cortex compare to layers in an artificial neural network is not clear, nor whether every area in the cerebral cortex exhibits the same structure, but over large areas they appear similar.

Forward propagation

For single skips, the layers may be indexed either as to or as to. The two indexing systems are convenient when describing skips as going backward or forward. As signal flows forward through the network it is easier to describe the skip as from a given layer, but as a learning rule it is easier to describe which activation layer you reuse as, where is the skip number.
Given a weight matrix for connection weights from layer to, and a weight matrix for connection weights from layer to, then the [|forward propagation] through the activation function would be
where
Absent an explicit matrix , forward propagation through the activation function simplifies to
Another way to formulate this is to substitute an identity matrix for, but that is only valid when the dimensions match. This is somewhat confusingly called an identity block, which means that the activations from layer are passed to layer without weighting.
In the cerebral cortex such forward skips are done for several layers. Usually all forward skips start from the same layer, and successively connect to later layers. In the general case this will be expressed as

Backward propagation

During backpropagation learning for the normal path
and for the skip paths
In both cases
If the skip path has fixed weights, then they are not updated. If they can be updated, the rule is an ordinary backpropagation update rule.
In the general case there can be skip path weight matrices, thus
As the learning rules are similar, the weight matrices can be merged and learned in the same step.