In this video we'll discuss linear regression, which is perhaps the most widely used predictive model. Why are we interested in linear regression models? There are at least three reasons. First, linear regression models are easy to interpret. Second, the model is not too complex and relatively concise. Finally, even if we are interested in more complex models, linear regression can still serve as a useful baseline. For those of you who have taken a statistics class before, linear regression is likely not new. I caution here that using linear regression as a predictive model is somewhat different from the linear regression covered in most high school college level statistics classes. In predictive modeling, there is a lot of emphasis on prediction, which is somewhat different from classical statistics. We will also use linear regression as a context to discuss important issues in predictive modeling. To make our discussions concrete, I would like to start with an example using some sample data. This dataset contains 314 homes listed for sale in Boulder, Colorado during July 2014. The original datasets have many columns, however, I will only use a few columns to illustrate the concepts. Here is a list of the data columns. I would like to use the data set to understand what factors determine list prices of the houses for sale. Therefore, list price is the target variable. All other variables are predictor variables. Most of the variables are continuous variables, with the exception of home type, parking type, and ZIP. It is interesting to point out that even though ZIP takes numerical values, it should be treated as a categorical variable, because numbers in ZIP codes do not have a meaning for interpretation. Before building our first predictive model, it is important to ask whether this data is appropriate. Obviously, the answer depends on our purpose. Our purpose here is to understand what factors determine list prices of homes for sale. This data set seems to be the relevant one to look at. Of course, we can supplement the data set with additional data such as historical sales, crime rates, and school district. From my own experience, these additional data fields will be very helpful for our analysis. These additional data are available if we are willing to spend time or money to collect them. I choose not to include them here, but you'll notice that as a limitation of our discussion here. This is a scatter plot showing you the relationship between square footage and list price for all homes in our dataset. As we can see, there seems to be some positive association between the two. As the square footage increases, the list price increases. This positive association is quite intuitive. Larger homes cost more. Linear regression can help us understand this relationship better. In linear regression, we would like to find the line segment that best fits the scatter plot. Here, I show a couple of alternatives. Which one is the best fitting line? Linear regression helps us answer exactly that question. Mathematically, a line can be written as the equation y hat = b0 + b1 times x. Here, y hat is the predicted value of the target variable and x is the value of the predictor variable. b0 is called the intercept and b1 is called the slope. For a given b0 and b1, the value of y hat changes when the value of x changes. When x equals 0, y hat is equal to b0. When the value of x increases by 1 unit, the value of y hat increases by b1 units. Linear regression gives us a way to find in some sense the best fitted line. For the data set we have, we obtain that b0 is about -125, and b1 is 0.43. The red line segment in the graph shows the fitting line. Note that b0 is a value of the the line when the square footage is 0. It is a value of the y-axis at an intersection with the line segment. This explains why b0 is called the intercept. The value of b1 is 0.43. For each unit increase in square footage, the value of y increases by 0.43. Recall that the list price is in thousands of dollars. Therefore, for each additional square foot, the predicted list price increased by about $430. Also note that the fitted line from linear regression does not perfectly explain the relationship between square footage and list price. Indeed, on a scatter plot, most, if not all, points lie either above or below the line segment. Meaning that the predicted list price is either below or above the list price in the data set. The part of the observed list prices that are not explained by the fitted value is called the residual. Which is given by observed value minus fitted value. Let's look at the scatter plot again. The big red point is an observation in the dataset which corresponds with a pair of values of square footage and list price. The big blue point on the line segment shows the predicted value of the fitted line for the same square footage. Obviously, this predicted value is quite far off. How far off this point is from the observed value can be measured by residual, which is the difference between observed value and the fitted value. Note that if a point is above the line segment, the observed value is higher than the predicted value and the residual is a positive number. However, if the point is below the line segment, the observed value is smaller than the predicted value and the residual is a negative number. For the big red point, the observed list price is 6,499. The predicted value can be calculated by using the estimated coefficients b0 and b1. And the squared footage, which is 5,588. Therefore, the predicted value is about 2,260. Residual is the difference between the two, which is 4,238. Is our fitted line a good fit to the data? One way to assess the accuracy of our fitted line is to see whether it explains the relationship between the two variables well. If all data points lie exactly on the regression line, the line perfectly explains the relationship. In most cases, however, the points will scatter around the regression line. When the data points are close to the line, the line does a better job explaining the relationship. To numerically capture this measure of accuracy, we use r squared, which can be interpreted as the percentage of the variation in y that is explained by changes in x. r squared is also called the coefficient of determination and takes values between 0 and 1. The bigger the r squared, the better the model fit, because more variation in y is explained by changes in x. For a linear regression model, r squared is about 0.64. In other words, 64% of variations in list prices can be explained by square footage. This model fit is considered quite good. Intuitively, this is not too hard to understand. Square footage is perhaps one of the most important factors when people assess the value of a house. Another question is whether the coefficient estimates are reliable. This question can be answered using p-values, which tells us whether the coefficient estimates are statistically significant. In other words, p-values tells us how reliable the coefficient estimates are. Smaller p-values imply stronger statistical significance. In other words, coefficient estimates with smaller p-values are considered more reliable. For our model, the p-value for b0 is 0.0266 and that for b1 is close to 0. This shows that both coefficient estimates are statistically significant. We typically use a cutoff 0.05 for p-values. Any coefficient estimates with a p-value less than 0.05 are considered statistically significant. In classical statistics, we make a number of assumptions in linear regression. I choose not to discuss them in detail here. I would like to comment that we are less concerned with violations of these classical assumptions in predictive modeling. However, we need to be careful when we interpret our results when one or more of these assumptions are violated.