0:32

Now R squared is the number that measures the proportion of

variability in Y explained by the regression model.

It turns out simply to be the square of the correlation between Y and X, but

it has a nicer interpretation, just a straight forward correlation.

It is interpreted as the proportion of variability and Y explained by X.

And so all other things been equal to 1, typically prefers a higher

R squared over a lower 1, because you're interpreting more variability.

RMSE is a different one number summary from a regression and

what RMSE is doing for you, it's measuring the standard deviation of the residuals.

The residuals remember are the vertical distance from the point to the lease

squares or the fitted line and

the standard deviation is the measure of spread.

So, it's telling you how much spread there is about the line in the vertical

direction and I would often informally call that the noise in the system and so

RMSE is the measure of the noise in the In the system.

What I've shown you in the table at the bottom of this slide are the calculations

of R squared and RMSE for the three datasets that we've had a look at.

There was the diamonds dataset, the fuel economy and the production time dataset.

And if you look at R squared, it's frequently reported on a percentage basis.

We've got a 98% R squared for the diamonds dataset,

that's because there was a very strong linear association going on there.

In the fuel economy dataset, it was 77%.

And in the production time dataset, it was only sitting there at 26%.

Now, one of the things you have to be careful about with R squared is that there

is no magic value for R squared.

It's not as even R squared has to be above a certain number for

the regression model to be useful.

You want to think much more about R squared as a comparative bench

mark as opposed to an absolute one.

So if I'm comparing two regression models for the same data,

all other things being equal.

I'm going to typically prefer the one with the higher R squared, but

all because I've got a model with an R squared of say, 5% or 10% doesn't

necessarily mean that that model isn't going to be useful in practice, but

it is a useful comparison metric.

Now the other number, Root Mean Squared Error, I've calculated it for

the three examples here.

And it's 32, 4 and 32, somewhat coincidentally for

the production time dataset.

Now, one key difference between R squared and RMSE are the units of measurement.

So R squared, because it's a proportion,

actually has no units associated with it at all.

So it's easier to compare R squared in that sense where as RMSE certainly,

because it's the standard deviation of the residuals and

the residuals are distance from point to line in vertical direction.

Vertical direction is the Y variable direction.

So RMSE has the units of Y associated with it.

So for the diamonds dataset, that RMSE of roughly 32, that's 32.

You can say, $32.

And for the fuel economy, RMSE is 4.23.

It's 4.23 gallons per thousand miles in the city to be formal about it.

And the 32 for the production time, that's 32, an RSME of 32 minutes.

So these two one number summaries are frequently reported with a regression.

Most software will calculate them automatically as soon as you run your

regression model, for example, within a spreadsheet environment.

And all other things being equal, we like higher values of R squared.

We're explaining more variability and

we like lower values of Root Mean Squared Error.

If there's a low standard deviation of the residuals around the regression line,

that's tantamount to saying that there are residuals are low, they're small and

the points are therefore, close to the regression line, which is what we like.

So those are the two one number summaries that accompany most regression models.

Now perhaps, the most useful thing you can do with a Root Mean Squared Error

is to use it as an input into what we call a prediction interval.

So remember that when you have uncertainty in a process, you don't just want to

give a forecast, you want to give some range of uncertainty about that forecast.

That's just so much more useful than practice.

And with suitable assumptions, we can tie in Root Mean Squared Error

to come up with a prediction interval for a new observation.

So here's our assumption, we're going to assume that at a fixed value of X.

The distribution of points about the true regression line follows a normal

distribution.

So, another module has discussed the normal distribution.

This is one of the places where normality assumptions are very common in

a regression context and what we're assuming is that the distribution of

the points about the true regression line has a normal distribution.

We'll talk about checking that in just a minute, but

let's work with that as an assumption.

Furthermore, that normal distribution is centered on the regression line.

So you can see in the graphic on the page, the assumption being shown to you.

So we believe there's no data here, because we're positing a true model, so

to speak.

So there's a true regression lying there at any particular value.

Let's take the left hand normal distribution.

Let's say, we took lots and lots of diamonds that weighed 0.15 of a carat.

What do we expect their distribution to look like around the regression line?

We expect the distribution of the prices to be normally distributed with the center

of the normal distribution sitting on top of the regression line and

we believe that's true for any value of X.

That's why one of the standard assumptions of a regression wall involves.

Furthermore, we're going to assume that all of these normal distributions

around the true line have the same standard deviation.

That's often termed the constant Variance assumption and with that assumption,

we can estimate that common variance,

the spread of the points about the line in the vertical direction with RMSE.

So RMSE will be our estimate of the noise in the system and

with this assumption of normality,

it's estimating the standard deviation associate with the normal distribution

that captures the spread of the points around the true regression line.

So on this slide, I've introduced an important assumption behind regression.

That of the normality of the points about the regression line.

7:11

Now, we know about root mean squared error as an estimate of

the points about the regression line.

And furthermore,

we believe at least we assume that that spread is normally distributed.

What can we do with that information?

Well, here is what we do with it.

We can put that information together to come up with what we determine

approximate 95% prediction interval for a new observation.

So I'm going to present to you a rule of thumb that comes out of a regression, but

you've gotta be careful with this rule of thumb.

You can only use it within the range of the data.

So if you are extrapolating forecasts outside of the range of the data,

don't use this rule of thumb.

But at least within the range of the data, it's extremely useful.

And with the Normality assumption and overlaying the Empirical Rule,

which was discussed in a separate module.

Then within the range of the data,

an approximate 95% prediction interval for a new observation.

So the idea is that somebody comes to me with a new diamond.

A diamond that wasn't used in the calculation of the regression line,

they got a new diamond, they give it to me.

They say, it weights 0.250 carat.

What do you think it's going to go for?

What do you think the price is going to be?

I could use the prediction interval to give a range of feasible values.

The 95% prediction of all is forecast, which means go up to the regression line

and read off the value and then plus or minus twice the Root Mean Squared Error.

And that plus or

minus twice the Root Mean Squared area is coming straight at of the Empirical Rule,

the 2 is coming because we want a 95% prediction interval and the RMSE is

our estimate of the standard deviation of the underlying normal distribution.

So this interval really captures one of the key goals of a regression,

which is to provide uncertainty with our forecast.

Not just a forecast, but uncertainty range associated with that forecast.

So with the normality assumption and Root Mean Squared Error,

you're in the position at least within the range of the data to get a sense of

the precision of forecast coming out of a model.

So let's have a look at that idea for the diamonds data set.

For the diamonds dataset, the RMSE was equal to 32 and with the Normality

assumption that says, at least within the range of the collective data four

diamonds that are similar to the set that were used in the regression analysis.

The width that with an approximate 95% prediction interval for

a new observation is plus or minus twice the root means squared area.

2 times 32 is 64, so this model is able to price diamonds

using a 95% prediction interval to within about or minus $64.

That's the calculation that is done at the bottom of the slide and

working it out, if a diamond weighs 0.25 of a carat and

I put 0.25 of a carat into the regression equation.

That's the -260 + 3721 x 0.25.

That's my forecast or prediction and then I do plus or minus twice the root mean

square error, which here is plus or minus 64 and I get a range of feasible values.

Somewhere between $606 and 734.

And I say that really captures the essence of what these probabilistic models

are able to do for you that you couldn't do with a deterministic model,

a range of uncertainty.

So there's the 95% prediction interval.

So we've now seen a 95% prediction interval.

Remember that it relied on a normality assumption for the noise in the system,

for the spread of the points about the regression line.

We're assuming that was normally distributed.

Now when you make assumptions, part of the modeling process should be to think

carefully about that those assumptions and make a call on whether or

not they seem reasonable.

So always check your assumptions, if you can.

Now one way that I could check this Normality assumption is to

take the residuals from the regression.