24.8. *Neural networks: Backpropagation#
As we have seen the final output of a feed-forward network can be expressed in terms of basic matrix-vector multiplications. The unknowwn quantities are our weights \(w_{ij}\) and we need to find an algorithm for changing them so that our errors are as small as possible. This leads us to the famous back propagation algorithm [RumelhartHintonWilliams86].
Deriving the back propagation code for a multilayer perceptron model#
The questions we want to ask are how do changes in the biases and the weights in our network change the cost function and how can we use the final output to modify the weights?
To derive these equations let us start with a plain regression problem and define our cost function as
where the \(t_i\)s are our \(n\) targets (the values we want to reproduce), while the outputs of the network after having propagated all inputs \(\boldsymbol{x}\) are given by \(y_i\). Other cost functions can also be considered.
Definitions#
With our definition of the targets \(\boldsymbol{t}\), the outputs of the network \(\boldsymbol{y}\) and the inputs \(\boldsymbol{x}\) we define now the activation \(z_j^l\) of node/neuron/unit \(j\) of the \(l\)-th layer as a function of the bias, the weights which add up from the previous layer \(l-1\) and the forward passes/outputs \(\boldsymbol{a}^{l-1}\) from the previous layer as
where \(b_j^l\) are the biases from layer \(l\). Here \(M_{l-1}\) represents the total number of nodes/neurons/units of layer \(l-1\).
Activation outputs
In this derivation we will denote the output of neuron \(j\) in layer \(l\) as a_j^{l}. Collectively, all outputs from layer \(l\) corresponds to the vector \(\boldsymbol{a}^l\).
Final outputs
We will reserve the output notation \(y\) exclusively for the final layer such that an \(L\)-layer network has final output \(\boldsymbol{y} \equiv \boldsymbol{a}^L\). Note that it is quite common to use a different activation function for the final outputs as compared with the inner layers.
We can rewrite this (figure to be inserted) in a more compact form as the matrix-vector products we discussed earlier,
With the activation values \(\boldsymbol{z}^l\) we can in turn define the output of layer \(l\) as \(\boldsymbol{a}^l = f(\boldsymbol{z}^l)\) where \(f\) is our activation function. In the examples here we will use the sigmoid function discussed in the logistic regression lecture. We will also use the same activation function \(f\) for all layers and their nodes. It means we have
Derivatives and the chain rule#
From the definition of the activation \(z_j^l\) we have
and
With our definition of the activation function we have (note that this function depends only on \(z_j^l\))
Derivative of the cost function#
With these definitions we can now compute the derivative of the cost function in terms of the weights.
Let us specialize to the output layer \(l=L\). Our cost function is
The derivative of this function with respect to the weights is
The last partial derivative can easily be computed and reads (by applying the chain rule)
Bringing it together, first back propagation equation#
We have thus
Defining
and using the Hadamard product of two vectors we can write this as
This is an important expression. The second term on the right handside measures how fast the cost function is changing as a function of the \(j\)th output activation. If, for example, the cost function doesnβt depend much on a particular output node \(j\), then \(\delta_j^L\) will be small, which is what we would expect. The first term on the right, measures how fast the activation function \(f\) is changing at a given activation value \(z_j^L\).
Notice that everything in the above equations is easily computed. In particular, we compute \(z_j^L\) while computing the behaviour of the network, and it is only a small additional overhead to compute \(f'(z^L_j)\). The exact form of the derivative with respect to the output depends on the form of the cost function. However, provided the cost function is known there should be little trouble in calculating
With the definition of \(\delta_j^L\) we have a more compact definition of the derivative of the cost function in terms of the weights, namely
Derivatives in terms of \(z_j^L\)#
It is also easy to see that our previous equation can be written as
which can also be interpreted as the partial derivative of the cost function with respect to the biases \(b_j^L\), namely
That is, the error \(\delta_j^L\) is exactly equal to the rate of change of the cost function as a function of the bias.
We have now three equations that are essential for the computations of the derivatives of the cost function at the output layer. These equations are needed to start the algorithm and they are
The starting equations.
and
and
An interesting consequence of the above equations is that when the activation \(a_k^{L-1}\) is small, the gradient term, that is the derivative of the cost function with respect to the weights, will also tend to be small. We say then that the weight learns slowly, meaning that it changes slowly when we minimize the weights via say gradient descent. In this case we say the system learns slowly.
Another interesting feature is that is when the activation function, represented by the sigmoid function here, is rather flat when we move towards its end values \(0\) and \(1\). In these cases, the derivatives of the activation function will also be close to zero, meaning again that the gradients will be small and the network learns slowly again.
We need a fourth equation and we are set. We are going to propagate backwards in order to determine the weights and biases. In order to do so we need to represent the error in the layer before the final one \(L-1\) in terms of the errors in the final output layer.
Final back-propagating equation#
We have that (replacing \(L\) with a general layer \(l\))
We want to express this in terms of the equations for layer \(l+1\). Using the chain rule and summing over all \(k\) entries we have
and recalling that
with \(M_l\) being the number of nodes in layer \(l\), we obtain
This is our final equation.
We are now ready to set up the algorithm for back propagation and learning the weights and biases.
Setting up the back-propagation algorithm#
The four equations provide us with a way of computing the gradient of the cost function. Let us write this out in the form of an algorithm.
Summary.
First, we set up the input data \(\boldsymbol{x}\) and the activations \(\boldsymbol{z}_1\) of the input layer and compute the activation function and the outputs \(\boldsymbol{a}^1\).
Secondly, perform the feed-forward until we reach the output layer. I.e., compute all activation functions and the pertinent outputs \(\boldsymbol{a}^l\) for \(l=2,3,\dots,L\).
Compute the ouput error \(\boldsymbol{\delta}^L\) by
Back-propagate the error for each \(l=L-1,L-2,\dots,2\) as
Finally, update the weights and the biases using gradient descent for each \(l=L-1,L-2,\dots,2\) and update the weights and biases according to the rules
The parameter \(\eta\) is the learning rate. Here it is convenient to use stochastic gradient descent with mini-batches and an outer loop that steps through multiple epochs of training.
Learning challenges#
The back-propagation algorithm works by going from the output layer to the input layer, propagating the error gradient. The learning algorithm uses these gradients to update each parameter with a Gradient Descent (GD) step.
Unfortunately, the gradients often get smaller and smaller as the algorithm progresses down to the first hidden layers. As a result, the GD update step leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution. This is known in the literature as the vanishing gradients problem.
In other cases, the opposite can happen, namely that the gradients grow bigger and bigger. The result is that many of the layers get large updates of the weights and the learning algorithm diverges. This is the exploding gradients problem, which is mostly encountered in recurrent neural networks. More generally, deep neural networks suffer from unstable gradients, different layers may learn at widely different speeds.