Use a NumPy and I create my training data. So, my first feature is a square footage, the second feature is the number of bedrooms, and the target value and the thing we're trying to predict with our hypothesis function is the price and I literally just made these up, and they've just made them go in an increasing order. So, that's the training data. Then I created another array of test_data of square footage and number of bedrooms and after we've trained the network, we give it this data and then we see how it performs. So, we can extract the shape of this training data and get the number of rows and the number of columns and I need to extract just the features out of the training data. So, this extracts from row zero to row m, from column zero to column n minus one because the prices was in the third column or in column two if we used zero-based indexing. So, column zero is the square footage and column one is the number of bedrooms and column two is the price of y. I didn't want those, I just want to extract their feature vectors only by themselves and then I want to extract the y values which is just the prices. So, that's what happens on these two lines here because they got a bunch of comments and stuff left in their closed just playing around with getting a feel for all of this. So, here's the hypothesis function. So, you pass in the Theta array and the x array and it just runs the dot product on data transform times x. It's also known as the inner product. So, it's just a matrix multiplication, sum them all up and returns the value as our h of x or hypothesis. I was looking at the name of these functions earlier and I'm not particularly happy with the names that I call these guys. This probably should have been called the error and this should have been called something like next Thetas or something like that, but when you're in a hurry, as I was developing the code down below, I called them costs and I called it error. But this is the rule for updating the Thetas is the y sub i is minus h of Theta and x times x sub j of i. That performs that math and I use it to calculate what the even size of the error is. So, these are some of the hyper-parameters I needed. The Alpha is the learning rate which I needed to make very small and I needed a way to control the iterations, check for out of range values and look for a Delta in the threshold as I was updating the Theta values. I wanted them, that's nice remember from the slides from on Tuesday when we want the Thetas to approach a constant value and to detect some amount of threshold by which we stop the iterating process. So, here's the definition for the batch gradient descent. Have some instrumentation in here to count the number of iterations and count how many times the cost function and the error function is called, but the gist of it is right here. For each j in the range of n, so these are all of our Theta values. So, we're going to do an update on the Theta values, our weights. For the very first one, we then fall down here and we go through every training sample. So, for every Theta value, we go run through all the training in examples. So, here you can see what I call the error function which tells me how much error I had and I can sum it right here. So, this is doing the summation. I chose to do the summation outside in the loop instead of building into the function and I count how many times I call that. Then the cost function gets called down here where I pass the Theta values and the current errors and it returns the new Theta array. The new updated Theta values says is one iteration, some more instrumentation counting how many times I counted that. I was curious to see givens from different datasets, how many times these functions would get called, and then this is all the code to check for convergence and see if my Thetas are running away because my Alpha value, my learning rate was way too high and I use that to keep turning it down. So, I set it to one, then I set it to 0.1, 0.01, and 0.001 and 0.0001 until I got down to something where the Thetas weren't oscillating out of control up and down with these huge amounts. So, that's the routine. So, I initialized my Thetas to all ones, and I initialize a new Theta arrayed to all zeros, I initialized the cost and some of these counters and then I called the batch gradient descent function and does it's thing, iterates through and cranks out the Theta values and prints out some information about them. So, Theta values in the slides and how many times the error in the cost functions were called. I can spend time on plot results, I needed a subroutine to plot all the values for me. So, I'm not going to go through all of that, I want to get down to this stochastic gradient descent. It's very, very similar in structure except the outer loop is for i and range of the m training samples. So, we just stepped through each of the 16 training examples one time, and then we iterate on the Theta values in exactly the same way we did before in the batch gradient descent. Other than that, the code is identical at best. I can tell, I was looking at it at lunchtime today and I didn't see anything that jumped out at me, I was being substantially different. So, for huge numbers of training examples, it can save some computation time. If you have enough samples and I was, I alluded to, I believe the reason why the stochastic gradient descent had a much higher errors was I just didn't have enough data. I only had 16 samples. If it had been 10,000 samples, it might have tightened up the error. What's a hypothesis on my part unproven, but I suspect that's the case. All this code for checking out a range and all that, exactly the same. Again, when I get here, I do the same thing. I initialize a Theta and new Thetas initialize my instrumentations and I call the stochastic gradient descent function and I print its results and then plot the values. The new Thetas and the final Thetas are x transpose times x, take the inverse of that, multiply it by the transpose of x multiplied by y and that's what this line is right there. So, you just do one line of code. And I appended. This only calculated Theta one and Theta two. I appended the one on there to set the intercept term. Forgot about that. So, I thought that was pretty cool that you were able to actually directly calculate what the weights are through linear algebra. That's pretty cool. When you run it, we get the results here in this IPython console window. So, we'll crank and we'll see the graphs. So, there was a first graph for the batch gradient descent, and there's the graph of the stochastic, and then the directly calculated.