[MUSIC] In this video, you will learn the tricks that help to train really deep networks. And we start with activation functions. You already know about Sigmoid activation, that takes an x as an input, and outputs sigma(x). What happens when we do back propagation? We get the gradient of the loss with respect to the output of this which is dL d Sigma. Then using a chain rule we can calculate dL dx which is dL d Sigma multiplied by d Sigma dx. For a Sigmoid function, the derivative of Sigmoid function looks like sigma of x multiplied by 1 minus sigma of x. What's wrong with that? Actually the problem is that when the value of sigma is close to zero or one our gradients will vanish. They will be very close to zero, which means that for all the previous layers using a chain rule the parameters one tab date because they will be multiplied by d sigma dx. That is called the problem of vanishing gradients. And other problem is that the output of Sigmoid is not zero-centered. Remember that neural networks like when the inputs have zero mean and standard variance, they are normalized. And this is not the case. Another thing is the exponent of x is computationally expensive when you have millions of neurons. There is another activation function which is hyperbolic tangent or tanh. And this function is zero-centered which is a plus, but it is still pretty much like Sigmoid. So it doesn't help to replace Sigmoid with tanh that much. There is a different activation function which is called ReLU or rectified linear unit. It works like taking a maximum of x and zero. It is fast to compute, its gradients do not vanish for positive x's, and in practice it provides faster convergence. But it has problems too. The first one is that it is not zero-centered. And the second one is the fact that this activation function is zero for a negative axis. Which means that if you're unlucky during initialization of your neuron, you can have such waits that will give you zero activation. And if you're unlucky enough then this neuron will never update because for that part where x is less than zero, you have zero gradient. So it is called the problem of dying ReLu neuron. It is not activated and never updates. We can easily change that. We can have a Leaky ReLU activation which adds a little bit of slope in that part where we had a zero activation. And the formula looks like max(ax, x). This neuron will not die, even if you're unlucky with initialization you will have a small gradient that will provide you with changing weights and the neuron can become alive. Another problem is with the parameter of a. You cannot take it equally one, because that will make it a linear activation and stacking linear activations will give you linear function so it doesn't work. You cannot take a value of a that equals one. Okay, now we know how activation functions work. Let's look at weights initialization. Maybe we can start with all zeros. Let's look at this simple example where we have four inputs and we have three neurons, Sigma 1, Sigma 2, and Sigma 3, which have Sigmoid activation function. If you look at how back propagation works, you can see that dL, dw2, which is a gradient of the loss with a respect to w2, equals the following. First, we take the gradient of the loss with a respect to the output neuron, which is Sigma 1. Then, we must take the derivative of activation function and for Sigmoid, it is Sigma 1 multiplied by 1 minus Sigma 1. Then, we have to take the derivative of a weighted sum of inputs, and the derivative with respect for w2 which is Sigma 2. If you look at the same update rule for the w3 weight, you will see that it stays the same but Sigma 2 is replaced with Sigma 3. What does it mean? It means that, first, Sigma 2 and Sigma 3 are the same when you have zero initialization, they just crunch the same numbers. That means that w2 and w3 which are initially zero as well, they will have the same updates. That means that they will change in the same way. And if you continue that and using a chain rule, you can actually derive that Sigma 2 and Sigma 3 will always get the same updates for their parameters. That means that you will have the same neurons and you will not learn complex representations. It is called a symmetry problem and we need to break that symmetry. How can we break it? Maybe we can start with some smaller random numbers, right? But how small? Let's draw a variable from normal distribution and multiply it by 0.03. Will it be enough? Linear models work best when inputs are normalized. Neuron is a linear combination of inputs plus activation. Neuron output will be used by consecutive layers. That means that it would be great if we can normalize the outputs of the neuron. Let's take a neuron output before activation, which is a linear combination of inputs which are axis. If the expected value of x is zero, and we provide that because inputs are normalized. And we generate weights so that we have a zero mean as well, and we generate them independently from our inputs, then the expected value of our linear combination is zero as well. But the variance is a different story, it can grow with consecutive layers. And that can become a problem when you stack a lot of network layers. Empirically, this hurts convergence for deep networks. Let's look at the variance of our linear combination. The variance of linear combination can be split up into a sum of variances provided that we have identically independently distributed weights because we generated them that way, and we have mostly uncorrelated x's. Then, we can use the fact that we generate weights independently from inputs which are x's and that can split the variance of the product into a sum of these three summands. And let's notice that the first one and the second one actually turn to zero, because we generate weights in such a way that we have zero mean. And we have zero mean inputs as well because we normalize our inputs. That means that if we have a sum of the product of variance of x and variance of w. If we assume that all our inputs and all our weights have the same variance, then we can replace that into the following thing. Variance x multiplied by n variance w. And let's see what we have done. We had a variance of output and it actually translates into a variance of input, multiplied by n and by variance of weight. And imagine if that red part of the equation at the end, which is n variance w, is greater than 1. If it's greater than 1 and you have a lot of hidden layers, then your variance of output of each consecutive layer will grow. So what do we want? We want this to be 1. How do we make it 1? Let's use the fact that variance of aw is a squared variance w. For n variance aw to be 1 we need to multiply our weight which have drawn from standard normal distribution, which has a variance of 1, and we multiply that by a, which is 1 over square root of n. This way, using the first rule, we will have n multiplied by variance of aw to be equal to 1. Actually it is called Xavier initialization. And it multiplies weights by square root of 2 divided by square root of number of inputs plus number of outputs of your hidden layer. Initialization for ReLU neurons uses multiplication by square root of 2 over square root of number of inputs. We know how to initialize our network to constrain variance. But what if it grows during backpropagation? We don't control the variance anymore. Anything can happen. There is a technique known as batch normalization that controls mean and variance of outputs before activations. Let's normalize neuron output before activation, which is denoted as h. The first thing we do is we provide zero mean and unit variance by subtracting the mean value of activation of neuron output and we divide it by a square root of variance. Then, we multiply it by gamma which gives us a new variance of gamma squared and we add beta so we have a new mean which will be beta. Where do mu and sigma come from? We can estimate them having a current training batch. And we can do that on every step of backpropagation. But what do we do when we have a testing part? During testing we will use an exponential moving average over train batches. How does it work? We have a current value of mu and sigma squared and we multiply it by 1 minus alpha, and alpha is some small number from 0 to 1. And we add the current batch mean or variance multiplied by alpha. When we do that over all training batches at the end we will have a moving average of these values and they in practice work better at testing. And what about Gamma and Beta? Normalization is a differentiable operation and we can apply backpropagation here. So it's not a problem. There is one more regularization technique which is known as dropout. It is used to reduce overfitting. It works like the following. We keep neurons active, non-zero, with probability p. If we sample independently our neurons, and some of them become inactive with probability one minus p,. This way we sample the network during training, and change only a subset of its parameters on every duration. During testing all neurons are present, but their outputs are multiplied by p to maintain the scale of inputs of the consecutive layers. Why does it work like that? Let's look at the image. On the left, during training we had a neuron which was present with probability p and consecutive layer multiplied these neuron by the weight w. And what will happen during testing? If we don't change the weight w that means that the consecutive layers will have an expected value which is much bigger. And if we calculate the real expected value the real expected weight of that neuron, it will be the following. With probability p, it was multiplied by w and with probability 1 minus p, it was 0. It was not active. That means that we need to replace during testing these weights w with pw. The authors of dropout say that it is similar to having an ensemble of exponentially large number of smaller networks. Because on every iteration of backpropagation you sample the network, but during testing you use all the neurons. One more technique that is used in modern convolutional neural networks is data augmentation. Modern models have millions of parameters, as we will later see. But data sets are not that huge. We can generate new examples, applying distortions, such as flips, rotations, color shifts, scaling, etcetera. You can see in the image with cats, when you distort that image but the cats are still the cats. And you have some new features that can help our deep neural network to generalize better. Remember, that convolutional neural networks are invariant to translation, so there is no need to add translation distortions. We have reviewed the activation function, weights initializations, and a bunch of new techniques that help to train better networks. What are the takeaways? First, always use ReLU activation because it doesn't saturate and it converges faster. Use He et al initialization which is square root of two divided by square root of number of inputs. Try to add bash normalization or dropout, maybe it will converge better. Try to augment your training data. In the next video you will learn how modern convolutional networks look like. [MUSIC]