[SOUND] Now that you have an understanding of what simple linear regression analysis is, I'd like to tell you about the assumptions needed so that the model can be applied correctly. One commonality about all assumption is that they all check for the models validity by checking the error terms, also known as residuals in the predictions. Recall that when we predict the value for y for a given x, we will have some error. So the actual y value will be the prediction plus some error value. Residuals are checked to make sure that simple linear regression is a valid model to use. If regression assumptions are valid, the population of potential error terms will be normally distributed with the mean equal to zero. This means that we will be over predicting and under predicting as a whole by equal amount. So that when we add up all the error terms, they will all cancel each other out and the mean error will therefor be zero, insuring that our model is not biased to over predicting or under predicting. Second assumption is the assumption of constant variance. That means that any given value of x the population potential of error term values has a variance that doesn't depend on the value of x, the independent variable. As a result the variance is constant across all that is of x. Third is the normality assumption which builds on the first two assumptions. Residuals should look like they have been randomly and independently selected from normally distributed population, have a mean of zero, and a constant variance sigma square. However, with any real data the assumptions will not hold exactly. Mild departures do not effect our ability to make statistical inferences in checking the assumptions. You are looking for pronounced departures from the assumptions. As a result we only require that the residual approximately fit these descriptions. The assumptions are checked through plotting of the error terms. If the normal plot of the error terms look more or less like a straight line, then normality assumption holds. Here's the normal probability plot for error terms and the effect of GPA on starting salary study I showed you in the last lesson. The error terms line up nicely and look like a straight line so the normality assumption holds in this example. For the same example, here's a distribution of error terms around the center line which represents a mean of zero. First you can see that we see about the same number of error terms above and below the zero line which will give us an overall error of zero, so mean of zero assumption holds as well. Now look at the shape of the distribution of these errors, we see that the residuals varying up and down within a contained horizontal band. This meets the assumption of constant variance. The constant variance would have been violated if the plot of errors fanned out starting out small and getting larger which have meant an increasing or the opposite occurs, starting out large and decreasing. So again with visual inspection of these plots we can check for the constant variance assumption. And finally is the linearity assumption which is a condition that is satisfied if the scatter plot of x and y looks straight. Then it's the independence assumption. That is any one value of error term is statistically independent of any other value of the error term. This really is an important assumption. An independent variable must be truly independent. The independence assumption is usually only violated when the data are time-series data. That is because assumption is positional, its value depends on order of the data. When the data is not time-series, it has no meaningful order, so any order is acceptable. Independence is violated when a value of a variable observed in a current time period will be influenced by its value in the previous period or even period before that and so on. This is also known as autocorrelation. Positive autocorrelation, which is more common, is when a positive error term in time period i tends to be followed be followed by another positive value in some future time, i plus k. For example, if you're looking at money spent for leisure and travel, you know we tend to do more of this in the summer months. So the fact that the spending starts to go up in June will also mean July will go up and so on. So there is some time series impact and positive autocorrelation between months of summer and we expect these patterns of spending to reappear again in 12 months. Of course can also have negative autocorrelation which is just the opposite, a negative error term in time period i tends to be followed by a negative value in some future time i+k. Linear regression is not appropriate for these types of data. If the data more or less doesn't violate the assumptions mentioned, then the linear regression can be used. Then you should be mindful of how to apply the model. The model is only valid for the range of data you have analyzed. Consider this case, you did this study which established a relationship between electricity usage and houses' square feet. You have data collected for the house size in square feet, and how much kilowatt hours of electricity is used per month. The houses in your data are all between 1,800 square feet and 3,000 square feet. The power company sees a new housing development coming up and wants to make sure it will have enough capacity for the additional demand for electricity needed for this new development. Homes in this development will be between 5,000 to 7,500 square feet. Can the power company use the model we developed? The answer is no. We must be very careful not to extrapolate beyond the range that was in our data when we developed a regression equation. So in summary, let me remind that the objectives of regression are to understand the relationship between variables in past data to make predictions and conduct what-if analysis. And making predictions, we must be careful not to extrapolate beyond the range within which the model was estimated. Prior to estimating regression model, it is a good practice to use scatter plots showing relationship between pairs of data. Scatter plots provide insight into the strength of relationship between two variables and to do type of relationship, straight line, curve, inverse, so on and so on. Correlation ranging from -1 to positive 1 give the extent of linear relationship and the direction of the linear relationship between the two variables. R-square, the coefficient determination, is the proportion of variability in the dependent variable that can be explained by the independent variables. R-square values are bound by 0 and 1. While it's very exciting to find and describe the relationship between two variables to allow us to make predictions, you should not confuse correlation with causation. Proving causation requires evidence far greater than most of these can meet do or need to do just remember how long it took to establish smoking as a cause for lung cancer. We knew that smoking and cancer correlated for a long time but to establish it as a cause took far greater time, you can see the same argument today with global warming. While most everyone agrees that the climate is changing, the argument on its cause is not as well accepted. So when you do regression don't claim that you have found the cause. Just knowing the correlation on its own gives us a great ability to predict. This allows us to change outcomes when we don't like what will be happening by changing the values of the independent variable.