Hi. In this lecture, we're gonna talk about a very simple class of models that

helps us make sense of data and these are known as categorical models. In a

categorical model, what you do is you basically bin reality into different

categories, and you hope these categories help you make better sense of the data.

That they explain some of the variation in the data. I wanna start out by just

describing what a categorical model's like and we'll talk about how they can help us

make sense of data. So let me give an example. A long time ago, over a decade

ago I was at a conference in Amazon, which is a company that, you know, sells all

sorts of stuff over the web, [laugh] right? Had, was just going public. And

there was a discussion whether Amazon was a good investment or not. So one person

who was a Wall Street investor said, you know, I think it's a horrible investment.

If you think of Amazon all it is, is it's just a delivery company. Like they just

got a big warehouse, you order stuff, they deliver it. The margins in that industry

are really small. Like there's already UPS and EPX, [laugh] UPS and FedEx and DHL and

all those sorta places. I just don't think there's any money in it. Now, another

person said, you know what, I'm gonna put Amazon in a very different box. I'm gonna

put this box over here in a box that says information, because I think its part of

the new information economy. They're gonna gather all this information about what

consumers want. It's all going to be centrally held. They're gonna be worth a

ton of money. Now it turns out if you put Amazon in this information box, you

probably would have invested in it, and you'd have made a lot money. If you put

Amazon in this delivery box, you wouldn't have invested in it, and you wouldn't have

made a lot of money. So which box you use [inaudible] how you categorize things

affects how you think about things. And, again, what sort of decisions you make. So

this leads to a phrase that one of my friends who's a physcholy, a psychologist

[laugh] once said, lump to live. What my friend meant is this, is that, we create

these lumps, these boxes, these categories in order to make sense of the world. So I

look out there in the street and I see a vehicle. I don't say oh it looks like a

1997 Ford F150 pickup truck, right? Instead I just say truck or I just say

car. Or if I look at a piece of furniture I just say it's a dresser. I don't say

it's an 1874 Chip N Dale dresser. I don't completely break it down. I just put

things in categories. And these [inaudible]. They're short cuts, right?

They help us make sense of the world. And let's think about why we model again,

right? One of the reasons we model is to help us decide, strategize and design.

Right. So one of the reasons we lump is it helps us make quicker faster decisions

where we just put things in categories and say this is something I like, this is

something I don't like, this is something that's risky, this is something that's not

risky. Let me give some examples. Give some fun examples. So the first one is

let's suppose you?re a kid and you gotta decide what am I gonna eat and what am I

not gonna eat. Well one sorta categorization you might use is the green

categorization. So you might say anything that's green. Proxy screen. A

grasshopper's green, asparagus is green. All these things are green, right?

Everything else, bananas, those are yellow. Candy bars, brown, orange, they're

orange. Pears, pears can be green, but we'll assume they're yellow. And

strawberries, they're red. These other things aren't green. And so your rule

could be, I'm gonna eat anything that's not green and I won't eat anything that's

green. And so that rule will keep you safe from things like grasshoppers and

asparagus. Right, so that's a rule you might follow. Now it's not an optimal rule

because you might run into a green pear and that green pear might be something

you'd really like but if you've been avoiding green things, you may decide,

well not gonna risk it. So that the same, an example of a simple rule. Let's now

show how you can use a rule like that to make sense of data. So now let's suppose

I've got a bunch of data here, and these are different food items, and what you've

got in this column right here are calories. So this is how many calories

there are in each of these, these food items. So what I want to do is I'm trying,

I wanna make sense of why do some things have a lot of calories and some things not

have a lot of calories. And so I've got this list of items. Well the first thing

that I need to try to make sense of is how much variation is there in this data. Well

to understand how much variation there is, first I've got to find out, sort of what's

the average value? And then variation tells me, how far are things, on average,

from that value? So if I add all this up, I've got a 100+250 that's 350, 440, 550,

900 right? So we've got 900 divided by five, so that means the mean here Is 180,

so on average, everything in this group has about 180 calories. And I want to ask

some things are higher. Right? This is 350 and some things are lower, this is 90. I

want some understanding of how much variation is in that data. So one way to

do that we just subtract the mean from everything. So if we take 100 minus 180

that's gonna be minus 80. 250 minus 80 that's gonna be 70. 90 minus 180 that's

minus 90, right? 110 minus 180 is minus 70 and 350 minus 80 is 170. Well if I add all

these things up I'm gonna get minus 80, plus 70, minus 90, minus 170, plus 70,

it's be zero because it's gonna be the same as a mean. So what I need is I need

all these differences to be positive. So one thing I can do is I can just take the

absolute value. Of all these things. Right? And then I could add up the

absolute value. And I could get 80 plus 70 is 150. Plus 90 is 240. Plus 70 is 310.

Plus one 70 is 480. So we could say, the total difference from the mean is 480. But

what we do in statistics, is we tend to do something different. We actually tend to

take the difference and square it. And the reason we square it is really twofold. One

is that again it makes everything positive, which is what it did before. And

the other thing is that it amplifies larger deviations. Because what we'd

really like to do is prevent those huge deviations. This is gonna amplify large

deviations. So if I look at the pair, I would have 100 minus 180, which is 80

squared which is 6,400. So that's how much variation there would be. So that's the,

how much the. Difference from the pair, pair to the mean squared. And if

[inaudible] for the cake, I'm gonna get 250-180 which is 70. And if I square that,

right, I'm gonna get 4900. Now I could do this for everything. All of them right? So

for the pear, I get sixty four hundred, for the cake I get forty nine hundred, for

the apple eighty one, for the banana forty nine, for the pie, twenty eight [laugh],

thousand, nine hundred. So this is again, a long way from the mean, and you square

it at a huge effect, so square amplifies larger mistakes. Now if I add up all these

numbers, I'm gonna get fifty three twenty. That's what we call a total variation. So

I plotted that data, this tells me sort of, how much variation is in that data,

what I'd like to do is keep. [inaudible] categories [inaudible] that reduces that

variation that somehow explains why something are high and some things are

low. So what's the obvious. Categorization. The obvious categorization

here is that pears and apples and bananas are fruit and cakes and pies, right, are

desserts. So let's create a fruit category and a dessert category. So in the fruit,

I've got one thing that's 90, one thing that's 100 and one thing that's 110. And

in the dessert category, I've got one thing that's 250 and one thing that's 350.

So let's look at them in more detail. I've got 90,100, 110, the mean Is gonna be 100

here, right? The average is also 100. What's the total variation? Well, 90 minus

100. Is just ten, so if I square that, I get 100. 100 minus 100. Right. Is zero, so

if I square that, I get zero. And one ten minus 100. Is also ten, so if I square

that, I get 100. So the total variation here is just gonna be 100+100 or 200. So

now what I've done is I've got a mean of 100, and a total variation of 200. And

now, if I go to this case, the mean is gonna be 300, right, for the desserts. And

what's the total variation? Well, for the cake, it's 250-300, which is 50 squared,

which is 2500. And for the pie, it's 350-300, which is also 50 squared. Which

is 2500 [inaudible]. Add those up, I get 5000. Alright, so let's clean this up a

little bit. So what I did is by creating two categories, a fruit category and a

dessert category, I now have a mean in the fruit category of 100 and a variation of

200, and a mean in the dessert category of 300 and a variation of 5000. Now one

[inaudible] I started out with, right? When I had all the stuff together. I had a

mean of 180 and I had a variation. Of 53,200. Now look at how much my variation

has gone down. It went from 53,000 to 5,000. So here's to the idea. These

categories substantially reduce the amount of variation I have left over. So think of

the variation as what's unexplained. So initially I say look I can just say things

on average of 180 calories and we've got 53,000 units of variation that's [laugh]

unexplained. Now I say wait a minute, I'm gonna create a categorical model that says

there's fruit in desserts and fruits have few calories than desserts. And you can

say we?ll look it appears to be the case. Fruits have a mean of 100. Desserts have a

mean of 300 and the variation in the fruits is only 200 and the variation in

the desserts is 5,000. So I've reduced variation a ton. What we want is we want a

formal measure of how much we've reduced variation. That's actually fairly simple,

right? So that a total variation of 5300, fruit variation is 200, dessert variation

is 5000. So that gives me 5200. So 53,000 start and I get 5200 left. So what we

wanna ask is how much did I explain. That's the question. How much of that

radiation did I explain? Well, then I started out with 53,000, right 200. And

now I only have 5,200 left over and so the amount I explained is just 53,000 minus

5,000 right which is 48,000 right. >> Over 53, two. So the percentage of [inaudible]

I explained was 48,000 divided by 53,000 which is a huge amount. And I can write

this more simply as just one minus the amount I didn't, that's left over. One

minus 5,200 over 53,000. So, right? Because that's just a simpler way to do

it. And so when I get that the amount of [inaudible] I explained was 90 thou- 90

percent. So 90.2%. So that's how much of that variation I explained. This is equal

to, again, that 48,000 right, divided by 53,200, the amount of variation that I

explained. Now, formally, this is called the R squared. So this is the [inaudible],

the percentage of variation that I explained just by that simple

categorization. So, if the R squared is near one. That means I explained almost

all the variation, so the model explains a lot. Right? If the R squared is near zero,

that means I didn't explain any of the variation really, and the model doesn't

explain very much at all. Now the better the model, the more R large R squared

it'll have. But depending on, there could be so much variation in the data that even

a great model. Only has an r squared of five or ten%. There also could be

situations where the thing you're trying to explain in pretty understandable and a

good model has to have an r squared of 90%. So there's no fixed rules whether,

you know, what a good r squared is. It depends what the data looks like. But with

a class, you know sort of class of models, or you know that's [inaudible] data class,

you can sorta figure out this is a good model, this is a bad model. Based on

experience. Let's push this a little bit further. We had, you know, fruit and

desserts, right? Those are our two categories. But if I had, you know, a

whole kitchen worth of food, it may be the case, that, like, I'd wanna have more

categories. So I might create a vegetable category and a grains category. And then I

could put everything in one of these four boxes. So one of the differences between

sort of experts and nonexperts is experts tend to have more boxes. They also tend to

put the right things in the right boxes, so they tend to have useful boxes. So if

you want to be good at sort of predicting things or understanding how the world

works, what you have to have is a lot of categories and you have to have those

categories be the right categories. They've got to explain a lot of the

variation. And we can measure how much of the variation it explains, your model

explains, by using that R squared. One last point. Even if you explain a lot of

variation, it doesn't mean you've got a good model. Let's go back to the schools

case. So suppose I'm trying to figure out what makes a good school, what really

leads to a good school performance. So I try all sorts of different boxes. I look

at schools that spend a lot of money versus schools that don't spend a lot of

money. Schools that have small class sizes and big class sizes. Schools that are big

and schools that are small, right? And nothing really seems to explain too much

of the variation. And then I create a box that I call the equestrian box. And I put

all the schools in here that have equestrian teams. And I find, oh my

goodness, every school with an equestrian team is great. Well, the thing is, that

doesn't mean that the equestrian team made the school good, right? So statisticians

make a distinction between correlation. Which is, is there a statistical

relationship between having an equestrian team and being a good school? And

causation, did the equestrian team cause the school to be good? So remember when we

draw, when you think about putting it in this box. Like, this box right here is a

bunch of good outcomes. And this box here has mostly bad outcomes. That doesn't

necessarily mean that the thing that created this box, if it's the equestrian

box. Is the reason that the schools were good. It could be that there's some other

reason. So why would you have an equestrian team? Well you only have an

equestrian team if you had a lot of money. And you probably also only have an

equestrian team if you have a lot of parental involvement, things like that.

Like a lot of support from the community. And you only have an equestrian team if

you have a lot of open space. So it could be that having an equestrian team is a

proxy for things like money, parental involvement, open space, right, those

sorts of things that actually do make a school good. So even if your boxes work,

that's no guarantee that they're actually the cause of why it works. Okay, so what

do we have? We have one way, [inaudible] the simplest model you can possibly have

is a character model. You can say, Amazon is a delivery company. Amazon's an

information company. You can say things are either fruits or desserts. Right? And

by creating these boxes it can help you sort of explain the variation in data.

What we saw then is a simple way to measure how good a categorization is, is

how much of that you explained the percentage of the variation you explained

and that's what we called R squared, right? R squared was just take all the

variation that was there and then ask how much was left how much was left over. That

means and then we subtract those two that tells you how much you explained. So you

ask what percentage of all that variation do you. Let me start over. Okay, what have

we learned. What we've learned is this. We've learned that the simplest kind of

model you can have is just a category based model, right where you just sorta

lump the world in different categories, and you place your data in different boxes

depending on what different data it is. So that could be information companies versus

delivery companies. That could be fruits and desserts, right? And in doing that,

what you could do is you can reduce the amount of variation you see in the data.

So there's a total variation which is sorta of just like how much unexplained

variation there was out there in the world, by putting it in boxes you organize

it in such a way that you reduce the variation. The amount at which you reduce

the variation is what we call the r squared. That's the percent of variation

explained. And the more variation explained the better your categorization

is. Of course if you create more boxes you can explain more of the variation. Where

we're going next, if you've got linear models, which in effect can create a

different box for each value of x, our dependent variable. Okay, thanks.