0:02

So up to this point, all that we've talked about is the impact of unnecessary or

Â necessary included variables on our regression coefficients.

Â But there's one other parameter that we estimate in our regression model,

Â the residual variance, sigma square.

Â So what happens if we have, assuming IID area, errors

Â then our linear model is additive and that, that part of the model's correct.

Â Then we can mathematically describe the impact of omitting unnecessary variables

Â or including unnecessary variables.

Â 0:39

And it falls along the same lines as what we discussed before.

Â If we underfit the model, in other words,

Â we omit important variables then the variance estimate is biased.

Â Why is that?

Â Because we've attributed to variation things that are actually systematically

Â explained by these covariance that we've omitted.

Â 1:01

On the other hand, if we either correctly fit the model, include all the right

Â terms, or if we over-fit the model, then the variance estimate is unbiased.

Â However, the variance of the variance estimate,

Â gets inflated if we include unnecessary variables.

Â So it's actually kind of the same rule.

Â If we omit that as we discussed before with coefficiency.

Â If we omit variables, then we get bias.

Â If we include variables, then we get a less reliable estimate.

Â So it's roughly the same impact going on.

Â 1:40

So let's talk about model selection in general.

Â Automated model selection, and I just want to briefly mention again that automated

Â model selection is something that we cover in the machine learning class.

Â We really I think at one point it was a statistical topic but

Â it's really moved into the realm of machine learning for the most part.

Â I would say though, even for relatively simple linear regression models,

Â the space of model terms that you have to search among explodes really

Â quickly when you start including interactions and

Â polynomial terms like the square of a regressor and so on.

Â If you have a lot of regressors and you're interested in how do I reduce this space?

Â Then there's a lot of factor analytic and things like principal components.

Â Those kinds of techniques that are available to you

Â to reduce your covariate space down to size.

Â Now however, those come with consequences.

Â Your principal components or factors that you obtained might be less interpretable

Â than the original data that you're interested in again.

Â Again, this is probably better served in a multivariable class,

Â a multivariate class, or a class on machine learning.

Â 2:52

But for us we're going to mostly consider the case where we have a relatively

Â small number of aggressors and we're going to pick through them with a highly

Â interactive process between the analyst, the data, and the scientific context.

Â 3:10

Another thing I would mention is that good design can often eliminate the need for

Â a lot of this model discussion.

Â We've talked a lot about how randomization can really prevent a lot of

Â the problems that we're talking about with making our

Â variable of interest unrelated to nuisance variables that we're not interested in or

Â nuisance variables that we don't even know about.

Â 3:33

However, there's other aspects of design that can serve the same purpose.

Â For example if we stratify and randomized within strata.

Â The classic example of this when this was developed was R.A.

Â Fisher was working in field crop experiments and they needed.

Â Let's say you're trying a different kind of seed,

Â you might block on different areas of the field that you were going to plant in,

Â and randomize the different seeds to those areas.

Â So you might have two different kinds of seeds, but they will have been distributed

Â in a systematic way that is fair across the field, but

Â also that within that design there will also be some randomization.

Â This topic of experimental design is a pretty broad topic.

Â Another great example is, in biostatistics,

Â the field I work in most, a very common kind of design is a crossover design.

Â And in that case, you try to use every subject as their own control.

Â So let's say for

Â example you're interested in looking at two different kinds of aspirin.

Â And you might give the aspirin to one group of people and

Â then the other aspirin to another group of people.

Â Let's say they have different gels or

Â whatever that determine how it gets absorbed in your stomach.

Â So if those two groups aren't the same, either the randomization wasn't

Â very good and there was some sort of imbalance that you just got unlucky about,

Â or if the study was just observational, then the comparison of those two groups

Â might be biased by whatever differentiates the groups rather than group one

Â receiving one kind of aspirin and group two receiving a different kind of aspirin.

Â 5:21

On the other hand if you can give a person one kind of aspirin and

Â later on give them a different kind of aspirin when they have another headache

Â that would compare each person to themselves right?

Â Control block on the person so to speak.

Â So that's a design strategy.

Â Now, there's some nuance with this design strategy as well.

Â 5:41

What happens if there's some residual effect of the first aspirin when you give

Â the second one, right?

Â So maybe you could handle that with some sort of wash-out period,

Â a long wash-out period, something like that.

Â But at any rate, the point of that design is to make it so

Â that you're comparing people with themselves to control and

Â everything that's intrinsic to the person at least across time periods.

Â A control for that by giving both aspirins to each person.

Â Maybe you would randomize the order in which they received them.

Â That's called a crossover design.

Â At any rate, the broader point that I'm trying to make Is it's often the case that

Â good, thoughtful, experimental design can really eliminate the need for

Â some of the main considerations that you'd have to go through in model building.

Â If you were to just collect data in an observational fashion.

Â 6:30

The last thing I'd say is there's one automated model search

Â technique that I like quite a bit, and I find it very useful.

Â And it's the idea of looking at nested models.

Â So I'm often interested in a particular variable.

Â And I'm very interested in how the other

Â variables that I've collected will impact it.

Â So I'm interested in a treatment, or something like that,

Â some important variable, but I'm worried that my treatment groups are imbalanced or

Â That it, with respect to potentially some of these other variables.

Â So, I might look, what I'd like to look at is, the model that just includes

Â the treatment by itself than the model that includes the treatment and

Â let's say age, if the ages weren't really balanced between the two treatment groups.

Â And then one that looks at age and gender,

Â if maybe the genders between the groups weren't really balanced, and then so on.

Â And this idea, creating models that are nested,

Â every successive model contains all the terms of the previous model,

Â leads to a very easy way of testing each successive model.

Â And these nested model examples are very easy to do, so

Â I'm just going to show you some code right here.

Â 7:40

On how you do nested model testing in R.

Â So, I've hit, I fit three linear models to our Swiss dataset.

Â The first one just includes agriculture.

Â Let's pretend that that's the variable that we're interested in, okay?

Â And then the next one includes agriculture and examination in education.

Â I've put both of those in because.

Â I'm thinking they're kind of measuring the same thing.

Â But now, after this lecture, I'm concerned over the possibility that

Â they're too much so measuring the same thing.

Â But let's put that aside for this time being.

Â And then the third model includes examination,

Â education, plus catholic plus infant mortality.

Â So all the terms.

Â So now I have three nested models and I'm interested in seeing what

Â happens to my effect as I go through those three models.

Â 8:27

The point being in this case you can test whether or not the inclusion

Â of the additional set of extra terms is necessary with the Nova functions.

Â So I do a Nova fit 1.

Â Fit three and fit five.

Â Okay, that's what I named them.

Â One, three, five.

Â And, then you see down here what you get is a listing of the models.

Â Model one, model two, model three.

Â And then it gives you the degrees of freedom.

Â That's the number of data points minus the number of parameters that it had to fit.

Â The residual sums of squares and

Â then the excess degrees of freedom of going Df is the excess degrees of freedom

Â going from model one to model two, and then model two to model three.

Â So, we added two parameters in going from model one to model two.

Â That's why that Df is two.

Â And, then we added two additional parameters going from model two to

Â model three.

Â So the 2 parameters we added from going from model one to model two

Â is we added examination and education their two regression coefficients.

Â Going from model two to model three we added catholic and

Â infant mortality there too were crashing coefficients.

Â Okay so with this residual sums of squares and the degrees of

Â freedom you can calculate so called F statistic, and thus get a P value.

Â This gives you the F statistic and the P value associated with each of them.

Â And then here it shows that yes, the inclusion of education

Â examination appears to be necessary over just looking at agriculture by itself.

Â Then when I look at the next one, it says yes.

Â The inclusion of Catholic and infant mortality appears to be necessary

Â beyond just including examination, education, and agriculture.

Â 10:13

So if the way in which you're interested in looking at your

Â data naturally falls into a nested model search, as it often does I think,

Â when you're interested in one variable in specific, as in this case I think this

Â would be a pretty natural way of thinking about the series of analyses.

Â Then some kind of nested model search is a reasonable thing to do.

Â It doesn't work if the models that you're looking at aren't nested.

Â For example, if I had the first model or

Â model two had examination but not education.

Â And the third model had education, but not examination.

Â This wouldn't apply.

Â You'd have to do something else.

Â And there I think you get into the harder world of automated

Â model selection with things like information criteria.

Â So, I would put all that stuff off to our prediction class and

Â then just leave you this one technique that's useful in the one Specific instance

Â where you've decided to kind of look along a series of models

Â each getting increasingly more complicated but including the previous one.

Â 11:16

Okay so I hope in this lecture that you've gotten a couple of

Â model selection techniques that you can use.

Â I hope you've also learned that there are some basic consequences that occur if you

Â include variables that you shouldn't have or exclude variables that you should have.

Â These has consequences to your coefficients that you're interested in,

Â they have consequences to your residual variance estimate.

Â We didn't even touch on some other aspects of poor model fit that can occur

Â such as absence of linearity and other things like that, non-normality and so on.

Â So again, it's generally necessary to take your model fits with a grain of salt,

Â because more than likely, one aspect of your model is wrong.

Â 11:56

And I'll leave you then with this famous quote by George Box,

Â who very famously said, all models are wrong, some models are useful.

Â And I think that's a very good credo to go along with, that yes, for

Â sure your model is wrong but it might be useful in the sense of being a lens

Â to teach you something useful and true about your data set.

Â