0:00

This lecture is about Regularized regression.

Â We learned about linear regression and generalized linear regression previously.

Â For the basic idea here is to fit one of these regression models.

Â And then, penalize or shrink the large

Â coefficients corresponding with some of the predictor variables.

Â The reason why we might do this is

Â because it might help with the bias variance tradeoff.

Â If certain variables are highly coordinated with each other.

Â For example, you might not want to include them both in

Â the linear regression model as they will have a very high variance.

Â Which might slightly, leaving one of them out, might slightly bias your model.

Â In other words, you might lose a little bit of prediction capability, but

Â you'll save a lot on the variants

Â and therefore improve your prediction to error.

Â It can also help with model selection, in

Â certain cases for regular organization techniques like the lasso.

Â 0:46

It may be computationally demanding on large data sets.

Â And in general, it appears that it does not perform as quite as

Â well as random forests or boosting, when applied to prediction in the wild.

Â For example, in cattle competitions.

Â So as a motivating example, suppose we fit a very simple regression model.

Â So there's an outcome Y.

Â And we're trying to predict it with two covariants, x1 and x2.

Â So we have an intercept terms so this is a constant.

Â Plus another constant times x1 plus another constant times x2.

Â >> So assume that x1 and x2 are nearly perfectly correlated.

Â In other words they're almost exactly the same variable.

Â 1:20

The word for this in linear modeling is often called co-linear.

Â You can then approximate this more complicated model.

Â By saying, well, what if we just include only X1

Â and multiply it by both the coefficients for X1 and X2.

Â It won't be exactly right, because X1 and X2 aren't exactly the same variable.

Â But it will be very close to right, if X1 and X2 are very similar to each other.

Â The result may be that you still get a very good estimate of y.

Â The same, almost as good as you would've

Â got by including both predictors in the model.

Â And you, it will be a little bit biased,

Â because we chose to leave one of the predictors out.

Â But we may reduce the variance if those

Â two variables are highly correlated with each other.

Â 2:03

So here's an example, with, prostate cancer data

Â set and the elements of statistical learning library.

Â We can load the prostate data and look at

Â that data set it has, 97 observations on ten variables.

Â And so one thing that we might want to be able

Â to do is make a prediction about, prostate cancer, PSA.

Â Based on a large number of predictors in the data set.

Â And so this is a very typical of

Â what happens when you build these models in practice.

Â So suppose we predict with all possible combinations of predictor variables.

Â We go to the linear regression model.

Â For the outcome where we build one

Â regression model for every possible combination of vectors.

Â Then we can see as the number of predictors increases from

Â left to right here, the training set error always goes down.

Â It has to down.

Â As you include more predictors, the training set error will always decrease.

Â But this is a typical pattern of what you observe with real data.

Â That the test set data on the other hand, as the number

Â of predictors increases, the test set error goes down, which is good.

Â But then eventually it hits a plateau, and it starts to go back up again.

Â This is because we're overfitting the data in the training set, and

Â eventually, we may not want to include so many predictors in our model.

Â 3:19

This is an incredibly common pattern.

Â Basically any measure of model complexity on the x-axis versus

Â the expected Residual Sum of Squares of the predicted error.

Â You can see that in the training set almost always

Â the error goes monotonically down, in other words, as you build.

Â More and more complicated models, the training error will always decrease.

Â But on a testing set, the error will

Â decrease for a while, eventual hit a minimum.

Â And then, start to increase again as the

Â model gets too complex and over fits the data.

Â So, in general, what the best approach might be when, when you

Â have enough data and you have enough computation time to split samples.

Â So the idea is divide your data into training, test, and validation sets.

Â You treat the validation set as the test

Â data and you train every possible competing model.

Â So all possible subsets on the training data.

Â And pick the best, the one that works best on the validation data set.

Â 4:19

But now that we've used the validation set for training, we

Â need to assess the error rate on a completely independent set.

Â So we appropriately assess this performance by applying our

Â prediction to the new data in the test set.

Â 4:33

Sometimes you may re, re-perform the splitting and analysis several times.

Â In order to get a better average estimate of

Â what the out of sample error rate will be.

Â But there are two common problems with this

Â approach, one is that there might be limited data.

Â Here we're breaking the data set up into three different data sets.

Â It might not be possible to get a very

Â good model fit when we split the data that finely.

Â Second, is computational complexity.

Â If you're trying all possible subsets of models it can be very

Â complicated, especially if you have a lot, a lot of predictor variables.

Â 5:04

So another approach is to try to decompose the prediction error.

Â And see if there's another way that we can work directly get

Â at including only the variable that need to be included in the model.

Â So if we assume that the variable y can be

Â predicted as a function of x, plus some error term,.

Â Then the expected prediction error is the expected difference

Â between the outcome and the prediction of the outcome squared.

Â And so, f-hat lambda here is the estimate from the training set.

Â Using a particular set of tuning parameters lambda.

Â 5:38

Then if we look at a new point.

Â So we bring in a new data point and we

Â look at the distance between our var, our observed outcome.

Â And the prediction on the new data point.

Â That can be decomposed after some algebra into irreducible error.

Â This sigma squared.

Â The bias, so this is the difference between our expected

Â prediction and the truth, and the variance of our estimate.

Â So the question, so whenever you're building a prediction

Â model the goal is to reduce this overall quantity.

Â So this is basically the expected.

Â Mean squared error between our our outcome and our prediction.

Â And so, the irreducible error can't usually be reduced.

Â It's value that's just part of the data you're collecting.

Â But you can trade off bias and variance.

Â And that's what the idea behind regularized regression does.

Â 6:32

So another issue for high dimensional data, is suppose I'm just showing you

Â a simple example of what happens when you have a lot of predictors.

Â So here I'm sub setting just a small subset of the prostate

Â data, so imagine that I only had five observations in my training set.

Â It has more than five predictor variables.

Â So I fit a linear model relating

Â the outcome to all of these predictor variables.

Â And there are more than five.

Â Then some of them will get estimates.

Â But some of them will be NA.

Â In other words, r won't be able to estimate

Â them because you have more predictors than you have samples.

Â And, so you have, design matrix that cannot be inverted.

Â 7:13

So here we have one approach to dealing with this problem is

Â we can, and the other problem of trying to select our model.

Â Is we could take this model that we have,

Â and assume that it has a linear form, like this.

Â So assume that it's a linear regression model, like we've talked about before.

Â And then constrain only the lambda of the coefficients to be non-zero.

Â And then the question is, after we pick

Â lambda, so suppose there are only three non-zero coefficients.

Â Then we have to try all the possible combinations of

Â three coefficients that are non-zero, and then fit the best model.

Â So that's still computationally quite demanding.

Â So another approach is to use regularized regression.

Â So if the beta j, so if our coefficients

Â that we're fitting in the linear model are unconstrained.

Â In other words, we don't make, we don't claim that they have any particular form.

Â They may explode if they're very, if you have

Â very highly correlated variables that you're using for prediction.

Â And so, they can be susceptible to high variance.

Â And the high variance means that you'll get predictions that aren't as accurate.

Â 8:15

So to control the variance we might regularize, or shrink the coefficients.

Â So, remember that, we, what we might want to minimize, is some

Â kind of distance between our outcome that we have and our linear model.

Â So here, this is the distance between the outcome and the linear model fit squared.

Â That's the residual sum of squares.

Â Then you might also add a penalty term here.

Â That says, the penalty will basically say: if the beta

Â coefficients are too big, it will shrink them back down.

Â So the penalty is usually used to reduce complexity.

Â It can be used to reduce variance.

Â And it can respect some of the structure in the

Â problem if you set the penalty up in the right way.

Â The first approach that was used in this

Â sort of penalized regression is to fit the regression model.

Â Here again, we're penalizing a distance between

Â our outcome y and our regression model here.

Â And then we also have a term here.

Â That is lambda times the sum of the beta j's squared.

Â So, what does this mean?

Â If the beta j's squares are really big then this term will

Â get too big, so we'll get, we won't get a very good fit.

Â This whole quantity will end up being very big, so

Â it basically requires that some of the beta j's be small.

Â It's actually equivalent to solving this problem where we're looking for

Â the smallest sum of squared here and sum of squared differences here.

Â Subject to a particular constraint that, the sum

Â of squared beta j's is less than s.

Â So the idea here is that the inclusion of

Â this lambda coefficient may also even make the problem non-singular.

Â Even when the x transpose x is not invertible.

Â In other words, in that model fit where we have more predictors

Â than we do observations the ridge regression model can still be fit.

Â 10:08

So this is what the coefficient path looks like.

Â So what do I mean by coefficient path.

Â For every different choice of lamda, that penalized

Â regression problem on the previous page, as gambit increases.

Â That means that we penalize the big datas more and more.

Â So we start off with the betas being equal to

Â a certain of values here when lambda's equal to 0.

Â That's just a standard linear with regression values.

Â And as you increase lambda, all of the coefficients get closer to 0.

Â Because we're penalizing the coefficients, and make, and making them smaller.

Â 10:43

So the tuning parameter lambda controls the size of the control coefficients.

Â Lambda controls the amount of regularization.

Â As lambda get's closer and closer to 0, we basically go back to

Â the least square solution which is what you get from a standard linear model.

Â And as lambda goes to infinitely we basically,

Â so in other words as lambda get's really big.

Â It penalizes the coefficients a lot, and so all of the

Â conditioned coefficients go toward 0 as the, tuning parameter gets really big.

Â So taking that parameter can be done with cross-validation or

Â other techniques to try to pick the optimal tuning parameter.

Â That trades off bias for variance.

Â 11:20

A similar approach can be done with a slight change of penalty.

Â So again, here we might be solving the problem again, this lees squares problem.

Â This is the standard trying to identify the beta

Â values that make this distance to the outcome smallest.

Â And here we can constrain it subject to the sum of

Â the absolute value of the beta j's being less than sum value.

Â You can also write that as a penalized regression of this

Â form, so we're trying to solve this penalized sum of squares.

Â So for ortho normal design matrix which you

Â can, see on Wikipedia or the normal design matrix.

Â The idea is that this actually has a closed form solution.

Â And the closed form solution is basically, take the absolute

Â value of the beta j and subtract off a gamma value.

Â And take only the positive part.

Â In other words if gamma is bigger than your least

Â squared beta hat j then this will be a negative number.

Â And you're taking only the positive part so you set it equal to 0.

Â So if it's a positive though if this absolute

Â beta hat j is bigger than, the gamma value.

Â Then this whole number will be a smaller positive number.

Â It will be shrunk by the amount gamma.

Â And then, we multiply it by the sign of the original coefficient.

Â So what is this doing?

Â Its basically saying the lasso shrinks all of the

Â coefficients and set some of them to exactly 0.

Â And some people like this approach because it both shrinks coefficients.

Â And by setting something exactly to 0

Â it performs model selection for you in advance.

Â 12:51

There are really good set of lecture notes from Hector Corrada Bravo.

Â That you can find here at this link.

Â He also has a very nice list on the large number of penalized regression models.

Â And in the Elements of Statistical Learning book

Â covers this penalized regression idea a quite extensive detail.

Â If you want to follow along there.

Â In caret, if you want to fit these models, you can set the

Â method to ridge, lasso or relaxo to

Â fit different kinds of penalized regression models.

Â