We now look at how we can train a neural network. While this analysis is important in its own right, it turns out that the same process can be used when we come to look at the more modern convolutional neural networks later. We now wish to devise a training pressures for the neural network by seeking to minimize the error function we set up in the last lecture. We do that by making small adjustments to the set of weights such that those adjustments lead to a reduction in the error. The approach commonly taken is to modify a weight by subtracting a small amount of the first derivative of the error from its initial value as shown in this slide. The idea is that by doing so, we will move down the error curve as indicated. This is called the gradient descent approach. There are other adjustment procedures too, but the simple gradient descent method is good for illustration. The amount of adjustment is controlled by the parameter ETA, which is called the learning right. A two larger value of ETA leads to greater adjustments, but may lead to instability. That is, oscillations between both sides of the error curve in the illustration. On the other hand, a two smaller value of ETA means training time might be lengthened. The adjustment here is shown for the weights that the j and k layers. To workout a value for the adjustment to the weight, we need to perform the differentiation shown here. That is done by the chain rule as seen in the center of the slide. We now have to get values for each of the derivatives in that chain rule expression. Here we show each of those three derivatives and the final result when they are combined in the chain rule. Choosing b equals one in the activation function, we end up with the correction increment shown on the bottom of the slide. Let's call that equation A for lighter convenience. We now move to the front of the network and look to find the correction increments for the weights which think the i and j layers. We use the same gradient descent procedure as before to do that. But here we have a small problem. Since E is not directly a function of g subscript j, we cannot compute the derivative, dE, dgj simply. Instead, we need to use another chain rule expression as shown on the bottom of the slide. To get a value for dE, dzk in this expression, we again use the chain rule which would be equals one. We have an expression for dE, dzk in terms of the actual and target outputs. That leads to the expression for the correction increment for the i to j weights as soon. Which is not only a function of the actual target outputs, that is also dependent on the k to j linear weights. From the previous step, we know those said that we now have a suitable and usable expression for delta wji correction increments. With the two sets of analyses, we can now formulate a training algorithm for the neural network. We can simplify our two previous equations if we define some new variables delta k and delta j, which allow the correction increments for the two sets of linkages to be written simply as shown at the bottom of the slide. Wall out equations are specifically focused on adjustments to the weights, the thresholds delta j and delta k in the network equations can be evaluated using the same expressions just by making the corresponding inputs unity during training. We now formulate the training strategy. The chosen network is initiated with an arbitrary set of weights that allows outputs, although in error, to be generated by the presentation of training pixel vectors at the input layer. For each training pixel, the network output is computed from the set of network equations and initially, of course, that output will be an error. Correction to the weights is in performed using the equations of the previous slide. The value of delta k is computed first since it depends on the network outputs g subscript k compared with the target outputs t sub k. Then the result can be propagated back through the network, layer by layer, if there is more than one hidden layer using the other equations on the previous slide to generate corrections to the network weights. Specifically delta wkj, which is equal to ETA delta kgj can then be found. Following which, we get the delta j and then the delta wji. When all the weights have been adjusted, the output of the network is computed again using those new weights. Hopefully, the gk will now be closer to the target values tk. New values for the delta k will then be generated and the process of weight adjustment is repeated. This process is iterated as often as needed to reduce the difference between the actual and target outputs, tk minus gk to zero, or to a value acceptably close to zero. If it is zero, then delta k will be zero meaning that no further adjustments to the weights will occur with further iteration. The network is in fully trained. In the terminology of neural networks, an iteration is also called an epoch. Because training involves working back from the outputs at each epoch or iteration, the training process is referred to as back propagation. The interesting thing about training the multilayer perceptron is that when a training pixel is presented at the input, the calculations are propagated forward through the network to generate the output. That output is checked against the correct class for that training pixel, and if found to be an error, the equations were derived in this lecture are used to propagate backwards through the network, the adjustments to those weights. The first question here draws your attention to the possibility of local minima in the error curve.