I've previously told you about how step-wise regression methods can be used to help with variable selection but that this approach has significant drawbacks. So faced with a large data set and many potential variables, how should you decide which ones to include? Well there's no single recommended approach to modeling, but what's important with model development is to be clear about what has been done and why. It's the transparency in presenting and reporting our methods that allow others to evaluate and debate the decisions made. So in this lecture, I'm going to give you some of suggestions on how you can approach this. So the first stage is to identify known predictors and their interactions from the published literature. It will be important to include all known predictors in your model regardless of the level of significance you observe, as this data set is only one sample drawn from the population. And when we say known predictors, what we're saying is we believe there exists a true relationship between the predictor and the outcome in the population, which may or may not reach a 5 percent significance threshold in this particular sample. So also at this initial stage, it's good to think about what variables either have weak or controversial evidence that you're interested in investigating further. The next thing I suggest is to thoroughly examine the data as there may be some variables it's not sensible or feasible to include. When there's only a moderate or small amount of missing data, it's appropriate to use imputation techniques. However, there's no hard and fast rule as to what percentage of missing this is acceptable for, but certainly 50% seems to be a tipping point. The decision to exclude based on missing will depend on the importance of the variable and the percentage that's missing and that decision will require your judgment. So another reason you may not want to include a variable is if it's got a very narrow distribution and this means there will be very limited variability to explore. With categorical data some cell frequencies may be too low too include, for example, if there were three men and a hundred women then gender is not so a variable we could assess. We saw before that if collinear variables are included in the model it will cause significant problems with estimation so if you've got highly correlated candidate predictors, you will need to decide which of these are the most appropriate or interesting to include. And again, this will require your judgment. So my next suggestion is one that needs to be used with caution. Instead of including several variables separately, it may be useful to combine them to make one variable. For example, a validated symptom score, which is calculated from several different variables, could be included instead of including each of these variables separately. This means you can adjust for many variables in one go whilst only using one regression parameter, but it won't be as accurate as including the individual variables that make up that score and it also means you won't obtain the individual estimates for these variable either so it depends a bit on the purpose of the model and what you need to get out of it. So after examining the data and deriving a list of candidate predictors to include, then that's a good time to think about what interactions you may also want to examine. So it may be just those that you've already identified from the literature or you might have a particular hypothesis in mind and it's good to pre specify all the interactions of interest as there are so many combinations this can end up in a bit of a data dredging exercise if you're not careful. And finally, if you still have too many predictors, then a statistical data reduction method can be used and there aren't many that exists that don't suffer the same limitations as step-wise regression but these are outside the scope of this course and you won't need to know about them for this exercise but you may want to read more about these techniques when you get more experience with regression. So writing down a strategy will help you keep a clear focus as you develop your model and should reduce the temptation of data dredging. So they are my suggestions to help you structure approach to model building, I just want to summarise practices to avoid. Don't test each variable at the 5 percent significance level as a way to select variables to include in your model. This approach is even worse than using stepwise regression as the relationships at the uni-variable level can disappear or appear at the multivariable level depending on confounding. Don't use forwards or step-wise selection procedures. If you're tempted to use any, then a backwards selection procedure is the least offensive. And finally, don't only include known predictors if and only if they are statistically significant at the 5 percent level. If they are known predictors, they should be included in the model as a regardless of their p-value, they will alter the estimates of the other regression coefficients in the model. So in summary, developing a strategy will help you navigate decisions needed when developing a model. The variables should be included based on the published literature and your opinion as an investigator and not wholly on automated selection approaches. Examining the data can help rule out variables based on poor quality and narrow distributions and collinearity issues. If need be, an appropriate data reduction method can be used to help with variable selection. So now it's your chance to bring together everything you've learned on this course, choose an outcome from the COPD data set, map out a model building strategy, then interpret and report your results. Good luck.