Let's now take a brief journey through the history of machine learning, to see how it has evolved over time into the deep learning neural networks that are so popular today. You'll notice, that despite neural networks coming in and out of style over the last several decades, that tricks and techniques developed for other algorithms, can be applied to deep learning neural networks, making them very powerful. Linear regression was invented for predicting the movement of planets, and the size of pea pods based on their appearance. Sir Francis Galton, pioneered the use of statistical methods to measurements of natural phenomena. He was looking at data on the relative sizes of the parents and children, in various species including sweet peas. He observed something that is not immediately obvious, something really strange. Sure, a larger than average parent tends to produce larger than average children, but how much larger is the child to the average of the other children in this generation? It turned out that this ratio for the child is less, than the corresponding ratio for the parent. For example, if the parent's size is 1.5 standard deviations from the mean, within its own generation, then he would predict the child's size will be less than the 1.5 Standard deviations from the mean within its cohort. We say that generation by generation, things in nature regress, or go back to the mean, hence the name, linear regression. This chart here from 1877, is the first ever linear regression. Pretty cool. Compute power in 1800s, was somewhat limited, so they didn't even realize how well this would continue to work once we had large datasets. There is actually a closed-form solution for solving linear regression, but gridding descent methods can also be used, each with their pros and cons, depending on your data set. Let's look under the hood on how linear regression works. Let's go a little more into detail to understand the motivations around linear regression. We begin with a linear equation that we are hypothesizing describes our system, by multiplying various weights with our observed feature vectors, and then summing it all up. We can represent this in the equation above for each example in our data set, y= w0 times x0 + w1 times x1 plus w2 times x2, and so on for each feature in our model. In other words, we apply this equation to every row in our data set, where the weight values are fixed, and the feature values are from each associated column, and our machine learning data set. This could be nicely packaged into the measures equation below, y equals X times w. This hypothesis equation is very important, not just for linear regression, but for other machine learning models, such as deep neural networks, which we will discuss later. But how can I determine if the weights I have chosen are making good or bad guesses? The answer, is we need to create a loss function, which essentially is just an objective function we want to optimize. As explained earlier, typically for regression problems, the loss function is mean squared error, which in matrix form is shown in the equation here, I drop the constant since it will disappear later in the derivation. We are first finding the difference between the actual labels value, with our predicted labels value, y hat, which is just, X times w. Remember though, that my goal is to reduce the loss as much as possible. So, I need to find a way to minimize it, with respect to the weights. To do this, I take the derivative with respect to the weights, in the one-dimensional case, or more generally the gradient when I have multiple features. I can then use this to find the global minimum. The equation here, I won't get into the derivation, provides a closed-form analytical solution for linear regression. Meaning that, if you plug in the x and y values into this formula, you will get the values for the weights. But, this is not very practical, there are issues with the inverse, we are first assuming that the Gram matrix, X transpose X, is non-singular, meaning that all the columns of our feature matrix X are linearly independent. But, in real world data sets, you do have duplicate or nearly duplicate data. The same customer buying the same product again, two photos of the same sunrise taken seconds apart. Even if the grand matrix is technically linearly independent, it could still be ill-conditioned, therefore making it computationally singular, and still causing problems for us. The inverse also has a time complexity of ON cubed, using the naive algorithm, but still, doesn't get much better, using fancier algorithms. And those come with some of their own numerical issues. The same goes for even the multiplication to create the Gram matrix. We might instead solve, the normal equations using something called a Cholesky, or a QRD composition. For ON cubed, or even ON to the 2.5, when N equals 10,000 or more, the algorithm can be very slow. So yes, you can solve exactly for the weights using the normal equation, but it is very dependent on your data, your model, which linear algebra matrix algorithms you are using etc. Thankfully, there is gradient descent optimization algorithm, which is one, less expensive computationally in both time and or memory, two, more amenable to mild generalization, and three, generic enough to work on most problems. Instead, in gradient descent, we have our loss function, or more generally, our objective function, that is parameterized by the weights of our model. Within this space, there are hills and valleys, just like the Earth has. However, in many machine learning problems there will be many more dimensions, in a 3D spatial world that we live in. Since this is gradient descent, minimization along the gradient, and not ascent, which instead will be, maximization, we want to traverse the loss hypersurface, searching for the global minimum. In other words, we hope to find the lowest valley, regardless of where we start on the hyper surface. This can be done by finding the gradient of the loss function, and multiplying that with a hyperparameter, learning rate, and then subtracting that value from the current weights. This process iterates until convergence. Choosing the optimal learning rate and waiting for many iterations, could make you choose to use the normal equation instead, assuming the number of features is small, no collinearity issues etc. Or, add a gradient descent optimizer, such as momentum, or use a decaying learning rate. We'll talk much more about the details of gradient descent in the next module. What is a hyperparameter that helps determine gradient descent's step size, along the hypersurface, to hopefully speed up convergence? The correct answer, is learning rate. The learning rate along with some other hyperparameters that you will learn in the future modules, helps decide the step size in gradient descent. If set too low, then gradient descent takes a very long time to converge. If set too high, the gradient descent might even diverge, and increase the loss more and more. The other three answers all have to do with collinearity and conditioning, which we don't have to worry about with gradient descent, like we would using the normal equation.