Welcome to our Notebook here on gradient descent. In this Notebook, we're going to have an overview of working with gradient descent in order to do solve simple linear regression, as well as working with stochastic gradient descent, which we defined in lecture as just taking a single row and seeing the error and moving according to just the error on that single row compared to with vanilla gradient descent where we use the entire data set. To start off, we're going to import all the necessary libraries, here just importing NumPy, pandas, and Matplotlib. We're then going to generate data from a known distribution, so we'll know the actual values that we want to find when we do our gradient descent. If we think about just working with linear regression, in general, what we're trying to solve for is some Y, where that Y is equal to some Betas or some different coefficients multiplied by our different values in our X data set. Here we have Y equals b, which is just our intercept term, plus Theta_1 times X_1 plus Theta_1 times X_2 plus some error term. In order to generate the data where we specify each one of the different Thetas, we're going to have x1 and x2, each be random values between zero and 10, where any value between zero and 10 is equally likely to be picked since we're picking from a uniform distribution. We're then going to actually set the values for b, Theta_1 and Theta_2. So b is going to be 1.5, Theta_1 is going to be equal to 2 and Theta_2 is going to be equal to 5. Then from there we can generate our y values as well as our feature matrix, which will have our x1 and x2 values. How do we do that? First thing we want to do is we're going to set the random seed so that you back home are saying the same solutions that we have here. We're then going to say that we want 100 observations, and we're going to pass that through as our x1s and our x2s are going to be random values between zero and 10. So np.random.uniform, values between zero and 10, and we want 100 different observations between those values of zero and 10. We set that equal to our x1 and our x2, and then for our constant term, we're just going to call np.ones, which will just create an array of ones for a certain shape that you will define and we just define it as a one-dimensional array with 100 different values. Then finally we're going to add on that error term. If you recall up here, we also want to include the error term. This is to ensure that it doesn't fit exactly and we'll set that error term equal to just values from a normal distribution with a mean of zero and a standard deviation of 0.5, and again, we want 100 different values. We're then going to choose our b, our Theta_1, and our Theta_2 to match with the values we defined above. Then y is just going to be equal to b times that constant term, which is just our one's, Theta_1 times our x1 that we defined as random values between zero and 10, and Theta_2 times x2, which is again different values between zero and 10 plus that small error value. We're then going to create an array out of our x1, x2, and our constant term so that we have our x matrix or our feature matrix. We run this, and then we can see what our actual y-value is. That should be some combination of if we look at this x_mat, we should have something along the lines of two times this value and five times this value plus 1.5, since 1.5 will just be multiplied by 1. We'll have that for each x1 and x2. In order to get the right answer directly, we can look at the closed form version of this model rather than using something like gradient descent. With linear regression, we can actually use matrix algebra to get the exact solution that will find the maximum likelihood or the least squares estimate for our data set, and that's just going to be this matrix algebra here. It's not too important, all that's important here is to know that there's a closed form solution and that for linear algebra, you do not necessarily have to use gradient descent to find each one of your parameters. Now, the reason why we introduce gradient descent is because, one, we're doing deep learning or even for many of our other models, we can't find this closed form solution, and we'll need to use gradient descent to move towards that optimal value, as we discussed in lecture. Here we're going to use sklearn's linear regression model, as well as also using the actual matrix algebra that we have defined here, which we can just pull out from NumPy. From sklearn, we're going to import our linear regression model. We're going to call linear regression. We don't want to fit the intercepts since the intercepts are already included in our feature values in our x_mat already that we defined earlier. So we set fit_intercept equal to false, and then we can fit our x_mat and our y and then see what are different coefficients that it comes up with. We can see that it's very close to the values that we wanted for b of 1.5, Theta_1 of two, and Theta_2 of five. Now sklearn, the linear regression model, will be using this closed form matrix algebra in order to solve for it. So we should get the same solution. One we call out this equation just using NumPy. So that's just going to be the inverse of the dot product of the transpose and the value itself, and then the dot product of that with x transpose and then the dot product of that with y. Then when we look at the solution that that comes up with, again, it's exactly the same as what we just saw with linear regression from sklearn. Now that closes out this section of just getting an intro of the data that we're working with. In the next section, in the next video, we're going to discuss actually solving this problem using gradient descent as well as how to visualize that process. I'll see you there.