0:00

Now that we have the preliminaries out the way, we can get back to the central issue,

which is how to learn multiple layers of features.

So in this video, I'm finally going to describe the back propagation algorithm

which was the main advance in the 1980s that led to an explosion of interest in

neural networks. Before I describe back propagation, I'm

going to describe another very obvious algorithm that does not work nearly as

well, but is something that many people think of.

Now that we know how to learn the weights of the logistic units, we're going to

return to the central issue, which is how to learn the weights of hidden units.

If you have neural networks without hidden units, they are very limited in the

mappings they can model. If you add a layer of hand coded features

as in a perceptron, you make the net much more powerful but the difficult bit for a

new task is designing the features. The learning won't solve the hard problem;

you have to solve it by hand. What we'd like is a way of finding good

features without requiring insights into the tasks or repeated trial and error,

where we guess some features and see how well they work.

In effect, what we need to do is automate the loop of designing features for a task

and seeing how well they work. We'd like the computer to do that loop,

instead of having a person in that loop. So the thing that occurs to everybody who

knows about evolution is to learn by perturbing the weights.

You randomly perturb one weight. So that's meant to be like a mutation, and

you see if it improves performance. And if it improves performance of the net,

you save that change in the weight. You can think of this as a form of

reinforcement learning. Your action consists of making a small

change. And then you check whether that pays off,

and if it does, you decide to perform that action.

1:58

The problem is it's very inefficient. Just to decide whether to change one

weight, we need to do multiple forward passes on a representative set of training

cases. We have to see if changing that weight

improves things, and you can't judge that by one training case alone.

Relative to this method of randomly changing weight, and seeing if it helps,

back propagation is much more efficient. It's actually more efficient by a factor

of the number of weights in the network, which could be millions.

2:33

An additional problem with randomly changing weights and seeing if it helps is

that towards the end of learning, any large change in weight will nearly always

make things worse, because the weights have to have the right relative values to

work properly. So towards the end of learning not only do

you have to do a lot of work to decide whether each of these changes helps but

the changes themselves have to be very small.

2:58

There are slightly better ways of using perturbations in order to learn.

One thing we might try is to perturb all the weights in parallel and then correlate

the performance gain with the weight changes.

That actually doesn't really help at all. The problem is that we need to do lots and

lots of trials with different random perturbation of all the weights, in order

to see the effect of changing one weight, through the noise created by changing all

the other weights. So it doesn't help to do it all in

parallel. Something that does help, is to randomly

perturb the activities of the hidden units, instead of perturbing the weight.

3:53

Since there's many fewer activities than weights, there's less things that you're

randomly exploring. And this makes the algorithm more

efficient. But it's still much less efficient than

backpropagation. Backpropagation still wins by a factor of

the number of neurons. So the idea behind back propagation is

that we don't know what the hidden units ought to be doing.

They're called hidden units because nobody's telling us what their states

ought to be. But we can compute how fast the error

changes as we change a hidden activity on a particular training case.

4:57

So that allows us to compute error derivatives for all of the hidden units

efficiently at the same time. Once we've got those error derivatives for

the hidden units, that is, we know how fast the error changes as we changed the

hidden activity on that particular training case, it's easy to convert those

error derivatives for the activities into error derivatives for the weights coming

into a hidden unit. So here's a sketch of how backpropagation

works, for a single training case. First we have to define the error, and

here we'll use the error being the square difference between the target values of

the output unit J and the actual value that the net produces for the output unit

J. And we're gonna imagine there are several

output units in this case. We differentiate that, and we get a

familiar expression for how the error changes as you change the activity of an

output unit J. And I'll use a notation here where the

index on a unit will tell you which layer it's in.

So the output layer has a typical index of J, and the layer in front of that, the

hidden layer below it in the diagram, will have a typical index of I.

And I won't bother to say which layer we're in because the index will tell you.

6:18

So once we've got the aeroderivative with respect to the output of one of these

output units, we then want to use all those aeroderivatives in the output layer

to compute the same quantity in the hidden layer that comes before the output layer.

So back propagation, the core of back propagation is taking error derivatives in

one layer and from them computing the error derivatives in the layer that comes

before that. So we want to compute DE by DY, I.

Now obviously, when we change the output of unit I, it'll change the activities of

all three of those output units, and so we have to sum up all those effects.

So we're going to have an algorithm that takes error derivatives we've already

computed for the top layer here. And combines them using the same weights

as we use in the forward pass to get error derivatives in the layer below.

7:25

So, this slide is going to explain the backpropagation algorithm.

And you really need to understand this slide.

And the first time you see it, you may have to study it for a long time.

This is how you backpropagate the error derivative with respect to the output of a

unit. So we'll consider an output unit J on a

hidden unit I. The output of the hidden unit I will be

YI. The output of the output unit J will be

YJ. And the total input received by the output

unit J will be ZJ. The first thing we need to do is convert

the error derivative with respect to Y J, into an error derivative with respect to Z

J. To do that we use the chain rule.

So we say DE by DZJ, equals DYJ by DZJ, times DE by DYJ.

8:23

And af, as we've seen before, when we were looking at logistic units, that's just YJ

into one minus YJ times the error derivative with respect to the output of

unit J. So now we've got the error derivative with

respect to the total input received by unit J.

8:43

Now we can compute the error derivative with respect to the output of unit I.

It's going to be the sum over all of the three outgoing connections of unit I, of

this quantity, DZJ by DYI times DE by DZJ. So the first term there is how the total

input to unit J changes as we change the output of unit I.

And then we have to multiply that by how the error root of changes as we change the

total input to unit J which we computed on the line above.

And as we saw before when studying the logistic unit dzj by dyi is just the

weight on the connection wij. So what we get is that the error

derivative. We respect to the output of unit I is the

sum over all the outgoing connections to the layer above of the weight wij on that

connection times a quantity we would have already computed which is de by dzj for

the layer above. And so you can see the computation looks

very like what we do on the forward pass, but we're going in the other direction.

What we do for each unit in that hidden layer that contains I, is we compute the

sum of a quantity in the layer above the weights on the connections.

Once we've got to E by DZJ, which we computed on the first line here, it's very

easy to get the error derivatives for all the weights coming into unit J.

To E by DWIJ is simply D, E, by DZJ, which we computed already, times how ZJ changes.

As we change the weight on the connection. And that's simply the activity of the unit

in the layer below YI. So the rule for changing the weight is

just you multiply, this quantity you've computed at a unit, to E by DZJ, by the

activity coming in from the layer below. And that gives you the error of derivative

with respect to weight. So on this slide we have seen how we can

stop with DE by DYJ and back propagate to get DE by DYI we'll come backwards through

one layer and computed the same quantity the derivative of the error with respect

to the output in the previous layer. So we can clearly do that for as many

layers as we like. And after we've done that for all these

layers, we can compute how the error changes as you change the weights on the

connections. That's the backpropagation algorithm.

It's an algorithm for taking one training case, and computing, efficiently, for

every weight in the network, how the error will change as, on that particular

training case, as you change the weight.