Welcome back to our notebook here on gradient descent. In this video, we're going to close out by discussing stochastic gradient descent. Rather than averaging the gradients across the entire dataset before taking any steps, we're now going to take a step for every single data point, as we discussed in lecture. Now, our exercise here is going to be to run the stochastic gradient descent that we have, this function, and then also modify the code so that we can randomly reorder which one of the different data points it picks at each iteration, and I'll walk through that in just a second. First for stochastic gradient descent, much of the function will look similar to our radiant descent function, we'll set Theta equal to this initial Theta. We will have our zeros, which are going to be the same size as not just here the number of iterations, but also the number of observations. Here, the idea is that the number of iterations will be how many times are we going to go through the full dataset. But we're updating it every single datapoint. By the time we run through the full dataset, and our dataset has 100 values, we all have made 100 updates, so we have 100 iterations. If our number of iterations equal to 100 and our number of observations equal to 100, and then we'll be making 10,000 updates, but only running through the entire dataset, 100 times. This'll become more and more familiar as when we work with our deep learning models later on in this course. This will be something called epochs, and the epochs will be the number of times you run through the full dataset, and then, you will also define the batch size as we talked about, mini-batch gradient descent where we will actually find that right balance of the batch size as well as how many times we want to run through the entire dataset. So we're going to have zeros for the Theta path, that's going to be the size of number of iterations, times the number of observations within our pass through matrix. We're then going to set Theta initial equal to the first value, and our loss vector again is equal to the zeros for the number of iterations times the number of observations. Then, for the main stochastic gradient descent loop, we're saying for i in range number of iterations. But now we're also saying for j in range, that the number of our observations, so for i and then for j, and that will allow for every single value within every single for loop. We'll do 100 iterations and for each iteration, we're going to go through each one of the different observations, and that's going to be your value for j. Then, we're seeing to get that gradient vector, so up until the gradient vector, things are the same. For that gradient vector, we're going to say the value for j versus the value of j that was predicted for that specific row, and then we're going to take the x matrix, but only the jth row, in order to get that.product to get our gradient vector just for that single value. Then, with that new gradient vector, we'll use that to update our Theta values at each one of the different steps to get each one of our new Thetas. Then, we update the data path according to the counts, and our count is being added by one each time we run through either the a and the j loop. That's why we have this count variable to keep adding on, rather than just if we were looking at the number observations, 0-100, or just the number of iterations, which is also 0-100, the count will go from 0-10,000 as it will have to go through both of these for loops. Then, we'll just return, as we did before, the Theta path as well as the loss factor. Here, we set the learning rate to 1e_negative 4, the number of iterations here again is just 100, but that means we're making 10,000 steps, but we're only running through the full dataset 100 times and their Theta initial is going to be this 3, 3, 3 again. We can call stochastic gradient descent, get our path as well as our loss vector, and use that same plot all function that we defined above, and plot out that datapath, the loss factor, and then the appropriate labels for the learning rates, the number of iterations, and the initial Theta. We run this, and we can see the path that was taken, and if we look at this bottom left graph, that'll be your first clear observation of the amount made that swerves back and forth rather than creating a straight path towards where we're trying to aim, as we did with the normal gradient descent. So it's a bit more random. Now something to note is if you are doing something like stochastic gradient descent or mini-batch gradient descent. Each one of our updates, as we go through each one of the values, if we do that for loop with that set ordering, the update for iteration 20 will be dependent on the update from iteration 19 and the update from iteration 18, so on and so forth. So it'll be biased according to the ordering of our actual dataframe. Rather than do that, we're going to make this a bit more random, and at each time set rad j equal to a random integer, np.random.randint. Some value, the size of the number of observations, so we can actually say zero through num observations. We run this, and now, rather than that ordering, we had a bit more randomness, and we can see it's a bit more squiggly, but that ensures that it doesn't have, like we saw before, clear pattern going back and forth and it's not dependent on the ordering of the dataframe. Now, we can play around with this, as we see here, we can increase the number of iterations. I'll run this because this'll take a second. Now again, we're making 10,000 times 100 updates along the way. So using stochastic gradient descent, hopefully we will be able to get to that solution. But in general, using stochastic gradient descent, as mentioned during the lecture, will allow you to speed things up, but at the same time it may not exactly get to the right solution as it will be bouncing around in order to get to that ultimate solution. That may have fake in just a second to run. But as we see here, once we increase the number of iterations, that may have been even too many iterations. We didn't have to go as long as we did. We see that we end up at that final point, at the finish that we'd hoped for with the Theta_2 equal to 5, if we look at the bottom left and Theta_1 equal to 2. We see that we have this sharp decrease right at the beginning of the number of iterations, and very slight amount of improvement along the way, we do know where we are aiming towards. Now something to note here, as I said, we may not have needed to go as many iterations as we get towards the bottom of that slope. If we imagine that concave curve that we discussed during lecture, the gradient is going to get smaller and smaller as it approaches that optimum value. So the updates will be smaller and smaller as we get closer and closer to the optimum value. That's going to ensure that if we have a small enough learning rate, that we don't overshoot it, it just stays within that right direction and keeps within that optimal minimal point within that concave curve. Now, that closes out our section here on gradient descent, and with that, we will get back to lecture and discuss further working with our neural networks. I'll see you there.