0:00

What about the questions that we can answer once we've run a regression?

Well, perhaps the most used aspect

of a regression model, is as a methodology for predictive analytics.

So, businesses have really embraced predictive analytics in the last

few years.

Always trying to predict outcomes.

Predicting for example, a product that an individual might buy on a website.

We might want to predict the rating that somebody gives to

a movie that they watch on a streaming service.

We might try and predict the price of a stock tomorrow.

So prediction is a very common task that we face in business.

We call our approaches to prediction, predictive analytics in general.

And if you have a regression, you certainly have a tool for prediction.

Because once you got that regression line there,

the prediction is pretty straightforward.

It's take a value of x, go up to the line and read off the value in the y direction.

So, an example question would be, based on our regression model for the diamonds

data set, what price do you expect to pay for a diamond that weighs 0.3 of a carat?

The answer would be, take 0.3 on the x-axis, go up to the line and

read off the value.

Or equivalently you can plug in 0.3 to

the regression equation to work out that expected value.

Now one of the other things though that regression will do for you.

It won't just give you a prediction.

With suitable assumptions, which we will have a look at in a while in this module,

with suitable assumption we're able to get a prediction interval as well.

And that prediction interval gives us a range of feasible values for

where we think the outcome or forecast is going to lie.

And that in practice tends to be much more realistic

than just trying to give a single best guess.

1:51

Another thing that we do with these regression models is

interpret coefficients coming out of the model.

The coefficients themselves can tell us things.

They can give us information.

And so I might ask a question.

How much on average do you except to pay for

diamonds that weigh 0.3 of a carat versus diamonds that weigh 0.2 of a carat?

Well that's a change in x of 0.1.

And given a linear regression with a slope that happens to equal

3,720 basically, what we can do is say.

Well, if we look at diamonds weighing 0.3 of a carat versus 0.2 of a carat,

we can anticipate paying an additional $372 for them,

given the underlying regression equation.

So we're essentially interpreting the slope in the regression.

Likewise, intercepts sometimes have interpretations.

An intercept might be interpreted as a fixed cost,

it might be interpreted as a start-up time.

So, we often want to interpret coefficients.

4:20

And then we come across a diamond, and that diamond weighs

0.25 of a carat and it's being sold for $500.

So I've added that point to the graph here, and it's the big red dot.

Now, if I see a point like that, which is a long,

long way beneath the regression line, then it's potentially of great interest to me.

Because if my believe my model, and this is a huge caveat here.

Given that I believe my model,

then there's something going on with this particular diamond.

Now one of the possibilities is it's been mispriced by the market.

And if it's been mispriced by the market,

then it's potentially a great investment opportunity.

There is another explanation though, is that maybe there's some floor associated

with this diamond and that's why it's going for such a low price.

I don't know which of those two is a potential explanation

until I've gone to have a look at the diamond.

The point that I'm making here, is that this activity of looking

to see how far away the points are away from the regression line,

is a technique for ranking potential candidates.

And some people use the word, triaging,

them to come up with a set of candidates that look the most interesting to me.

And so that's one of the uses that you can put a regression model to.

5:43

In summary, points a long way from the line can be of great interest.

I've shown you some regression lines, but I haven't yet

told you how they're calculated.

So, where does this regression line come from,

sometimes called the line of best fit?

Well there's a methodology, and

that methodology is called the method of least squares.

That is the most frequently used one to calculate these best-fitting lines.

And so it's not the only way of calculating

the line to go through the data, but it's a very commonly used one.

And if you pick up a typical spreadsheet program, it's the one that's going to be

implemented when you run your regressions there.

So, the optimality criteria, because we are going to fit the best line,

is known as the method of least squares.

And in words, what the least square's line is doing is finding the line amongst all

the infinite number of lines that you could potentially draw through the data.

It's finding the line that minimizes

the sum of the squares of the vertical distance from the points to the line.

And I've illustrated that idea by beaming in on the diamond's data,

I've taken a small range and I've drawn a line there.

I've drawn the points around it.

And the red lines are picking up the vertical distance from the point

to the line.

And what we want to do is find a line that minimizes the sum of the squares

of those vertical distances.

And we're going to call such a line, the least squares line, or

the line of best fit.

So basically what you're trying to do,

is find the line that most closely follows the data.

That's another way of thinking about it.

But there is a formal criteria.

That criteria is implemented in software, and

you will use that software to actually calculate a least squares line,

a regression for any particular data set that you might have.

8:23

for any given value of x, the fitted value would be go up to the blue line.

And then the residual is that vertical distance from the blue line to the point,

so you can see, you can ultimately get to one of those points in two steps.

You take your x value beneath it.

First of all, you take a step up to the line, and

then once you're on the line, you add on the little red line, the residual, and

you'll get to the data point.

So that says that the data point can be expressed in two components.

One, the line.

And two, the residual about that line.

So, that decomposing of the data into two parts,

mirrors a basic idea that we take to fitting this regression models.

And that idea is that the data we see is made up of two parts.

We often call that the signal and the noise.

And the regression line is our model for the signal.

And the residuals are encoding the noise in the problem.

Both of these components that come out of the regression, both the fitted values and

the residuals, are useful.

The fitted values become our full cost.

If you bring me a new diamond for a given weight,

let's say 0.25 of a carrot, what do I think it's price is going to be?

I simply go up to the regression line, the so called fitted values, and

I read off the value of y, the price.

Now, the residuals are useful as well because they allow me

to assess the quality of fit of the regression model.

Ideally, all our residuals would be zero.

That would mean that the line went through all the points.

In practice, that is simply not going to happen, but

we will often examine the residuals from a regression, because by examining

the residuals we can potentially gain insight into that regression.

And typically, when I run regression, one of the very first things I'm going to

do is take all of the residuals out of the regression.

I'm going to sort that list of residuals.

And I'm going to look at the most extreme residuals.

The points with the biggest residuals are by definition those points that are not

well fit by the current regression.

10:36

If I'm able to look at those points and explain why they're not well fit,

then I have typically learned something that I can

incorporate in a subsequent iteration of the regression model.

Now if that all sounded a little bit abstract,

I've got an example to show you right now.

So here's another data set that lends itself to a regression analysis.

And in this data set I've got two variables.

The outcome variable, or the y variable, is the fuel economy of a car.

And to be more precise,

it's the fuel economy as measure by gallons per thousand miles in the city.

So let's say you live in the city and

you only drive in the city, how many gallons are you going to have to put

in the tank to be able to drive your car 1,000 miles over some course of time?

That's the outcome variable.

Clearly the more gallons you have to put in the tank,

the less fuel efficient the vehicle is.

That's the idea.

Now we might want to create a predictive model for

fuel economy as a function of the weight of the car.

And so here I've got an X variable as weight.

And I'm going to look for the relationship between the weight of a car and

it's fuel economy.

We collect the set of data.

That's what you can see in the scatter plot.

The bottom left-hand graph on this slide.

And each point is a car.

And for each car, we've found it's weight, we've found it's fuel economy,

we've plotted the variable against one another.

And we have a run a regression through those points

through the method of least squares.

And that regression gives us a way of predicting the fuel economy of a vehicle

of any given weight.

Now why might you want to do that?

Well, one of the things that many vehicle manufacturers are thinking about these

days, is creating more fuel efficient vehicles.

And one approach to doing that is to actually change the materials that

vehicles are manufactured from.

So for example, they might be moving from steel to aluminum.

Well, that will reduce the weight of the vehicle.

Well, if the vehicle's weight is reduce,

I wonder how that will impact the fuel economy?

And so that's sort of question that we'd be able to start addressing through

such a model.

So that's a setup for this problem, but

I want to show you why looking at the residuals can be such a useful thing.

So when I look at the residuals from this particular regression, I know one of

the residuals, actually I found the biggest residual in the whole data set.

And that's the point that I have identified in red on the scatter plot.

And it is the biggest residual, it's a big positive residual.

Which means that the reality is, that this particular vehicle

needs a lot more gas going in the tank than the regression model would predict.

The regression model would predict the value on the line.

The red data point is the actual observed value.

It's above the line, so it's less fuel efficient than the model predicts.

It needs more gas to go in the tank than the model predicts.

So is there anything special about that vehicle?

Well, at that point I go back to the underlying data set and I drill down.

So, when I see big residuals, I'm going to drill down on those residuals.

And drilling down on this residual actually identifies the vehicle.

And the vehicle turns out to be something called a Mazda RX-7.

And this particular vehicle is somewhat unusual,

because it had what's termed a rotary engine,

which is a different sort of engine than any other single vehicle in this data set.

Every other vehicle had a standard engine, but the Mazda RX-7 had a rotary engine.

And that actually explains why its fuel economy is bad in the city.

And so by drilling down on the point, by looking at the residuals,

I've identified feature that I hadn't originally incorporated into the model.

And that would would be the type of engine.

And so, the residual and the exploration of the residual has

generated a new question for me that I didn't have prior to the analysis.

And that questions is,

I wonder how the type of engine impacts the fuel economy as well?

So that's one of the outcomes of a regression that can be very, very useful.

It's not the regression model directly talking to you.

It's the deviations from the underlying model that can sometimes be the most

insightful part of the model itself or the modeling process.

So remember in one of the other modules,

I talked about, what are the benefits of modelling?

And one of them is serendipitous outcomes, things that you find that you hadn't

expected to at the beginning, and I would put this up there as an example of that.

By exploring the residuals carefully, I've learned something new,

something that I hadn't anticipated.

And I might be able to subsequently improve my model by incorporating this

idea of type of engine into the model itself.

So the residuals are an important part of a regression model.