0:01

Let's go through calculating residuals and we're going to use the diamond dataset.

Let's see. So, our data is the diamond dataset.

So let's do that.

I'm going to redefine price as y, x as carat and n as, as the length of the,

the number of pairs, just so I don't have to type so much.

Now, I'm going to assign to a variable named fit.

I'm going to assign my linear regression object that gets created from lm.

So let me do that.

Now the easiest way to get the residuals is just do

resid of fit, so I'm going to define those as e.

Let me show you another way to get the residuals that of course has to do

the same thing.

If I were to get my predicted fitted values and remember, if I don't give

the predict function new data, if I just give it the output from lm, the,

the assignment from ln, then it will just predict at the observed x values.

So yhat now is a vector of predictions at the observed carat values.

Now, I just want to show you that my residual's calculated via the red,

resid functions are the same as the residuals that I'm calculating manually,

which is just subtracting y and yhat.

So the way to do that is just take the difference, the absolute differences and

find the largest one.

And I see that the largest one is on the scale of 10 to the minus 13th.

So, up to numerical precision, it's the same thing.

Then lastly, I just want to show that if I manually even calculate

the fitted values, coef fit 1 and then coef fit 2 times x,

that I of course will get very, I get exactly the same numbers.

So, up to numeric precision, exactly the same.

So the way you want to do it to get the residuals is resid, but

hopefully showing you this other code will illustrate what's going on

in the background with what res, the actual calculation that resid is doing.

Finally, let me show you that the sum of my residuals is zero.

Well, it's 10 to the minus 14th, which is close enough to zero for me.

And then also the sum of my residuals times the price variable x,

that also has to be zero.

Well, 10 to the minus 15th.

So, up to numerical position is zero in both cases.

So the residuals are the sign lengths of the red line that undershown

the following plot.

And I'm going to do this using base R graphics, just so

I mix a little base R with some ggplot graphics.

So, I'm going to create my plot here, there's my plot.

I'm going to add the fitted line and in base R, if you want to add the fitted line

and you fit a regression line, you can just do abline and put the object that you

assign to the lm fit just as an argument and it will add the regression line.

Here I want the line width to be two, so it shows up a little bit better.

And then I'm just going to for loop over,

over the data values to add in the red lines.

Let me zoom in and show you that plot.

There's my plot.

Now my residuals are these red lines.

These distances, where if the point is above the line,

the residual will be positive.

And if the point is below the line, the residual will be negative.

This scatter plot isn't particularly useful for assessing residual variation.

Notice all of the blank space in this part and

this part of the graph, making the plot kind of useless for that purpose.

So, instead, why don't we plot the residuals

on the vertical axis versus mass on the horizontal axis?

Let's go ahead and run the code and here's the plot.

Now we can see the residual variation much more clearly.

When you look at a residual plot, you're looking at,

you're looking for any form of pattern.

The residuals should be mostly patternless.

Also, remember that if we've included an intercept, residuals have to sum to zero.

So they have to lie above and below this horizontal line at zero,

and you'd like them to be sort of nicely in a random looking

fashion distributed both above and below zero.

We can see some interesting patterns by honing in on the residual plot here.

For example, we can see that there were lots of diamonds of

exactly the same mass measured in the,

in this sort of gets lost in the scatter plot by zooming in this way,

we, we notice that particular feature.

Next, what we're going to go through some pathological residual plots,

just to highlight what residual plots can do for you.

So, I've concocted some examples that will

help us to understand how residuals can highlight in on model fit.

So let's look in the R mark down file and I'm going to show it again at the console

rather than going through the slides, so you can actually watch me doing it.

X here is just going to be uniform from minus 3 to plus 3.

So, I've created a x variable that's just a kind of a random

smattering of points between the values 3 and minus 3.

My y is equal to x, so it's an identity line, but

then I'm going to add another term that's sin x.

So, it should look like an identity line, but kind of oscillating around it a little

bit and then I'm adding some normal noise on top of it.

So let me add my y and I'm going to switch back to ggplot, because I like it better.

So then base graphics now.

So you, I've created my gg plot.

I'm going to, I'm going to go ahead and add the smooth first,

because I want it as the bottom most layer and then I'm going to add my two sets

of points and there's my, there's my scatter plot.

And so let me zoom in and it's a little difficult to see the non-linearity,

that sin x term is very, it's a little apparent, but it's,

it's kind of very hard to see.

I think if I was looking at this,

I would immediately notice something pattern asking the for fit here.

But nonetheless, it's maybe a little bit hard to see.

Before I move on to the residual plot, let me make a comment.

This model is actually the not, is, is, is actually not the correct model for

this data and this might happen in practice.

This doesn't mean that this model is unimportant, right?

There is a linear trend and the model is accounting for it,

it's just not accounting for the secondary variation in the sign term.

So, I just want to emphasize this in regression modelling is just because you

aren't fitting the actually correct model,

that doesn't mean the model is itself useless.

You have about, you know, an average identity line here that represents

the relationship between y and x and it explains a lot of the variation.

So, I just want to remind you that in regression,

having the exact right model is not always the primary goal.

You can get meaningful information about trends from incorrect models.

So, I just want to get that statement out of the way, but

now let's hone in on the residuals.

So plot the residuals versus x and

see if it makes this component of the fifth sin x term more apparent.

Okay.

So let me plot the residual.

The residual's versus the x variable.

So just to describe what I have, I'm going to assign g as my ggplot.

My x is in this case, x.

I have defined the x variable as the variable named x.

But now, my y is not the y variable, but

it's going to be the residual from the linear model fit.

In here, I just grab it in that R command right there, then my aesthetic for

my ggplot, just has x and y as the names of variables for the horizontal and

vertical axis variables.

So let me run that command and

then I want to put a horizontal line reference line at zero and

then I want to add my points and set the axes how I'd like and

then let's see the plot and there's the plot.

Let me zoom in.

And here is the plot and

I think what you can see is that this sign term is now extremely apparent.

That what the residual plot has done is, it's zoomed in,

on this part of the model inadequacy and really highlighted it and

that's one thing that residual plots are especially good at.

I'm going to show you another one where by all appearances,

the plot falls perfectly on a line.

But when you highlight in, at the residuals, it looks quite different.

So let me run the commands and then I'll show you.

So there's the plot and then now, so look at that and

it seems like the points fall exactly on an identity line.

Now let me highlight in on the residuals and you see this

trend toward greater variability as you head along the x variable.

That property, where the variability increases with the x

variables called heteroscedasticity.

Heteroscedasticity is one of those things that residual plots are quite good at

diagnosing and you couldn't see it.

If I go back to this earlier plot, you can't see it at all here.

Zoom in on the residuals and there you see it there.

Let me just zoom back here to how I generated the data,

just to illustrate it for you.

My x variable is a bunch of uniform random variables.

My y variable is my x variable, so an identity line.

But then when I added the errors, the standard deviation of the errors,

look right here has the x term involved in it and

that's how I generated data with heteroscedasticity.

So let's run the residual plot for the diamond data.

So here, I'm just going to add a column for the diamond data that is the residuals

from regression model fit where price is the outcome and carat is the predictor.

So, I run that and now my data frame.

My data frame has carat price and now the residuals.

So, I'm going to create my ggplot.

My x label is going to be Mass in carats.

My y label is going to be Residual price and

I just want to emphasize that the residuals have the same units as the ys.

So the residual price is in Singapore dollars.

I'm going to add my horizontal line.

I'm going to add my points and then there's the plot.

So there doesn't appear to be a lot of pattern in the plot, so

this, this is good it seems like it's a pretty good fit.

Let me illustrate something about variability in a diamond

dataset that will help us set the stage for

defining some new properties about our regression model fit.

So I'm going to create now two residual vectors.

The first residual vector is the one where I just fit an intercept.

So the residuals are just the deviations around the average price,

then I'm going to find the residuals where I add in.

Our variation around the regression line.

So the first one is variation around the average price.

The second is the variation around the regression line with carats

as the explanatory variable and price as the outcome.

So, I'm going to run that and then I want to create

a factor variable that labels the set of residuals.

The first one is just going to be labeled as a bunch of intercept only model

residuals and the second set are going to be labeled as a bunch of intercept and

slope residuals, then I want to create a ggplot.

My, with this data frame, my y variable is going to be the residuals.

My x variable is going to be the fit, which is the two fits it is.

Linear model or linear model with just an intercept or

the linear model with carat, the mass as the predictor.

And I want to fill in my points based on the color, based on the which fit it was.

So, I'm going to do that and then the kind of plot I want is a dot plot and

then I want to put my axes labels the way I'd like.

Now let's see the plot.

There it is.

So what we see on this left-hand plot with just the intercept

is the variation in diamond prices around the average diamond price.

So just the, basically the variation in diamond prices from the sample.

What we're seeing in the rightmost plot,

this is displaying the variation around the regression line Mind.

So what we've done is we've explained a lot of this

variation with the relationship with mass.

And what we're going to talk about next is r squared, which basically says, we can

decompose this total variation, this variation in the y's by themselves leaving

you with just the mean into the variation explained by the regression model and

the variation that's left over after accounting for the regression model.

So this is the variation that's left over after counting for the regression, but

also this sort of subtraction of these two would be the variation that was explained

by the regression model.

But there's a formula for that and we're going to dive into that next.