Prediction is a fundamentally important activity which also turns out to be extremely challenging. In this module, we have discussed the use of linear regression as a predictive modeling tool. Anyone who is involved in predictive modeling can attest to the challenges of the task. We discuss some of them here. When we use linear regression, we have to bear in mind the many explicit and implicit assumptions we make. First and foremost, we assume that the relationship between the predicted variable and targeted variable is linear. This may or may not be a reasonable assumption. It is probably safer to assume that relationship is non-linear, which is more prevalent in practical data sets. However, estimating a non-linear model can be much more challenging. Also, we often have to deal with data limitations. In the housing data example, we expect a strong relationship between list price and square footage. When we include more and more predictors in our model, it becomes difficult to figure that out. Sometimes the relationship is heated, other times, it is simply not there. People engaged in stock price forecasting can attest to this dilemma. There are too many factors that are potentially related to the stock price movement, but figuring out which one is most important has been proven to be an enduring challenge. There are also situations where there are many variables, but very limited data to feed those variables. We need a sufficient amount of data to obtain meaningful results. When constructing predictive models, it is helpful to bear in mind these data limitations. We are beginning to believe that models that are more complex and include more variables will perform better. This turns out not to be true, there is a trade-off between model complexity and the performance. Consider the example here. For the given training data, our goal is to find the best performing model, where performance is typically measured on data not seen by the model. If we choose a model that is too complex, we may be fitting to the noise in the data. This is a core issue in predictive modeling. The fact that our model fits noise in the training data is typically referred to as overfitting. Overfitting can take many forms. For example, if a validation set is used on many different models, overfitting can occur on the validation set, in the sense that we choose the best model that fits the noise in the validation set. This, of course, means that the model we choose will not perform well on new data. This can happen, for example, in best subset selection, where we use a validation set to validate many possible combinations of predictive variables included in the model. In order to avoid the issue, sometimes we partition the data into three sets, instead of only two sets. The first set is used to train the model. The second set is used to validate the models and perform model selection. The tuner model is then tested on the last set, which we conveniently call the test set. Linear regression models can be generalized in many ways. We briefly discussed terms of data transformations, both of which are non-linear terms in the predictor variables. However, the model is still a linear regression model because the model is still linear in the coefficients to be estimated. Another example of non-linear terms are polynomial terms, such as x squared. In data transformation, we can still radically add any non-linear functions of the predictor variables, and estimate the resulting models as linear regression models. There are many possible forms of non-linear terms, and we can end up estimating models with many coefficients, which leads to a greater risk of overfitting. For these reasons, care has to be taken when adding a lot of non-linear terms to a linear regression model. Model selection techniques may turn out to be handy in such situations. Another possible extension is a model which is non-linear, both in the variables and coefficients. In this example here, gk is a non-linear function. Such a model is called a neural network, which we will discuss in module four of the course. It is apparent that neural networks are very flexible, and therefore susceptible to overfitting. Linear regression is a great tool for predictive modeling. It has its deep roots in classical statistics, and can be used for small or large data sets. It is probably the most well-understood predictive model. It is easy for users to influence the models in various ways, however, the model also suffers from limited flexibility. The fact that it requires user input means that it may not scale well when the amount of data or the number of predictors are large. There are many challenges to predictive modeling, and here are the most common ones, each of which deserves a careful examination in the predictive modeling process. What is the right targeted variable? What measures or figures to use? How valid is the model? What predictors to use? Should you transform them? How complex to make the model? How to avoid overfitting? Answering these questions is essential to effectively building predictive models.