0:00

So why do ResNets work so well?

Let's go through one example that illustrates why ResNets worked so well,

at least in the sense of how you can make them deeper and deeper without really

hurting your ability to at least get them to do well on the training set.

And hopefully as you understood from the third course in the sequence,

doing well on the training set is usually a prerequisite to doing well on your hold out,

or on your dev, on your test sets.

So being able to at least train the ResNets to

do well on a training set is a good first step toward that.

Let's look on an example.

What we saw in the last video was that,

if you make a network deeper,

it can hurt your ability to train the network to do well on the training set.

And that's why sometimes you don't want the network that is too deep.

But this is not true,

or at least is much less true when you're training a ResNet.

So let's go through an example.

Let's say you have X feeding in to

some big neural network and this outputs some activation a_l.

Let's say for this example that you going to modify

the neural network to make it a little bit deeper.

So the same Big NN, this outputs a_l,

and let's say we're going to- And this outputs a_l.

And we're going to add a couple extra layers to this network.

So let's add one layer there and another layer there.

And this will output, a_l + 2.

Only let's make this a ResNet block,

a residual block with that extra shortcut.

And for the sake of argument,

let's say throughout this network we're using the Relu activation functions.

So all the activations are going to be greater than or equal to zero,

with positive exception of the input X.

Right. Because the Relu activation outputs numbers that are either zero or positive.

Now let's look at what a_l + 2 will be.

To copy the expression from the previous video,

a_l + 2 will be Relu applied to z_l + 2 and then plus a_l,

where is this addition of a_l comes from the short circuit,

from the skip connection that we just added.

And if we expand this out, this is equal to g of w_l + 2 × a of l + 1 + b_l + 2,

so that's z_l + 2 is equals to that plus a_l.

Now, notice something, if you are using l2 regularisation or weight decay,

that would tend to shrink the value of w_l + 2.

If you are applying weight decay to b,

that will also shrink this.

Although I guess in practice,

sometimes you do and sometimes you don't apply weight decay to b.

But w is really the key term to pay attention to here.

And if w_l + 2 is equal to 0,

and let's say for the sake of argument,

that b is also equal to 0,

then these terms go away because they are equal to 0.

And then g of a_l,

this is just equal to a_l.

Right? Because we assume we are using

the Relu activation function and so all of the activations and all negative.

And so g of a_l is the Relu applied to a non-negative quantity,

so you just get back a_l.

So what this shows is that the identity function is easy for residual block to learn.

And it's easy to get a_l + 2 = a_l because of this skip connection.

And what that means is that,

adding these two layers in neural network,

it doesn't really hurt your neural networks ability to do as

well as this simpler network without these two extra layers.

Because it's quite easy for it to learn the identity function.

To just copy a_l to a_l + 2,

despite the addition of these two layers.

And this is why adding two extra layers,

adding this residual block

somewhere in the middle or to the end of this big neural network,

it doesn't hurt performance.

But of course our goal is to not just not hurt performance,

just the whole performance.

And so you can imagine that if all of these hidden units,

if they actually learn something useful,

then maybe you can do even better than learning the identity function.

And what goes wrong in very deep plain nets,

in very deep network without this residual or the skip connections is that,

when you make the network deeper and deeper,

it's actually very difficult for it to choose

parameters that learn even the identity function,

which is why a lot of layers end up making

your result worse rather than making your result better.

And I think the main reason the residual network

works is that it's so easy for these extra layers

to learn the identity function

that you are kind of guaranteed that it doesn't hurt performance.

And then lot of time you maybe get lucky and even hopes performance.

Or at least is easier to go from a decent baseline of not

hurting performance and then creating the same can only improve the solution from there.

So one more detail in the residual network, that's worth discussing,

which is through this addition here we're assuming

that z_l + 2 and a_l have the same dimension.

And so what you see in ResNets is a lot of use of

same convolutions so that the dimension of this is equal to the dimension,

I guess, of this layer or of the output layer.

So they can actually do

this short-circuit connection because the same convolution preserves dimensions,

and so it makes it easier for you to carry out

this short circuit and then carry out this addition of two equal dimension vectors.

In case the input and output have different dimensions, so for example,

if this is 128 dimensional or therefore,

a_l is 256 dimensional as an example,

what you would do is add an extra matrix.

And we call that w_s over here.

And w_s in this example would be 256 by 128 dimensional matrix.

So then, w_s × a_l becomes 256 dimensional.

And this addition is now between 256 dimensional vectors.

And there are few things you could do if w_s could be a matrix or parameters.

We learned that it could be a fixed matrix that just implements zero paddings.

So it takes a_l and then zero pads it to

be 256 dimensional at either of those versions, I guess could work.

So finally, let's take a look at ResNets on images.

So these are images I got from the paper by He et al.

This is an example of a plain network in which you input an image and

then have a number of conv layers

until eventually you have a soft mass output at the end.

To turn this into a ResNet,

you add those extra skip connections.

And I just mention a few details.

There are a lot of 3 × 3 convolutions here and most of these are 3 × 3 same convolutions.

And that's why you're adding your equal dimension feature vectors.

So rather than a fully committed layer,

these are actually convolutional layers.

But because they're same conclusions,

the dimension is preserved.

And so the z_l + 2 + a_l.

That addition make sense.

And similar to what you've seen in a lot of networks before,

you have a bunch of convolutional layers.

And then they are occasionally pooling layers as well or pooling or pooling-like layers.

And whenever one of those things happen,

there you need to make an adjustment to the dimension which we

saw on the previous slide you can do of the matrix w_s.

And then as is common in these networks,

you have conv, conv, conv, pool, conv, conv, conv, pool, conv, conv, conv, pool.

And then at the end, I have

a fully connected layer that then makes a prediction using a soft max.

So that's it for ResNets.

Next, there's a very interesting idea behind using neural networks with 1 x 1 filters,

1 x 1 convolutions.

So why good is a 1 x 1 convolution?

Let's take a look at the next video.