Hi. In this lecture, we're gonna talk about a very simple class of models that helps us make sense of data and these are known as categorical models. In a categorical model, what you do is you basically bin reality into different categories, and you hope these categories help you make better sense of the data. That they explain some of the variation in the data. I wanna start out by just describing what a categorical model's like and we'll talk about how they can help us make sense of data. So let me give an example. A long time ago, over a decade ago I was at a conference in Amazon, which is a company that, you know, sells all sorts of stuff over the web, [laugh] right? Had, was just going public. And there was a discussion whether Amazon was a good investment or not. So one person who was a Wall Street investor said, you know, I think it's a horrible investment. If you think of Amazon all it is, is it's just a delivery company. Like they just got a big warehouse, you order stuff, they deliver it. The margins in that industry are really small. Like there's already UPS and EPX, [laugh] UPS and FedEx and DHL and all those sorta places. I just don't think there's any money in it. Now, another person said, you know what, I'm gonna put Amazon in a very different box. I'm gonna put this box over here in a box that says information, because I think its part of the new information economy. They're gonna gather all this information about what consumers want. It's all going to be centrally held. They're gonna be worth a ton of money. Now it turns out if you put Amazon in this information box, you probably would have invested in it, and you'd have made a lot money. If you put Amazon in this delivery box, you wouldn't have invested in it, and you wouldn't have made a lot of money. So which box you use [inaudible] how you categorize things affects how you think about things. And, again, what sort of decisions you make. So this leads to a phrase that one of my friends who's a physcholy, a psychologist [laugh] once said, lump to live. What my friend meant is this, is that, we create these lumps, these boxes, these categories in order to make sense of the world. So I look out there in the street and I see a vehicle. I don't say oh it looks like a 1997 Ford F150 pickup truck, right? Instead I just say truck or I just say car. Or if I look at a piece of furniture I just say it's a dresser. I don't say it's an 1874 Chip N Dale dresser. I don't completely break it down. I just put things in categories. And these [inaudible]. They're short cuts, right? They help us make sense of the world. And let's think about why we model again, right? One of the reasons we model is to help us decide, strategize and design. Right. So one of the reasons we lump is it helps us make quicker faster decisions where we just put things in categories and say this is something I like, this is something I don't like, this is something that's risky, this is something that's not risky. Let me give some examples. Give some fun examples. So the first one is let's suppose you?re a kid and you gotta decide what am I gonna eat and what am I not gonna eat. Well one sorta categorization you might use is the green categorization. So you might say anything that's green. Proxy screen. A grasshopper's green, asparagus is green. All these things are green, right? Everything else, bananas, those are yellow. Candy bars, brown, orange, they're orange. Pears, pears can be green, but we'll assume they're yellow. And strawberries, they're red. These other things aren't green. And so your rule could be, I'm gonna eat anything that's not green and I won't eat anything that's green. And so that rule will keep you safe from things like grasshoppers and asparagus. Right, so that's a rule you might follow. Now it's not an optimal rule because you might run into a green pear and that green pear might be something you'd really like but if you've been avoiding green things, you may decide, well not gonna risk it. So that the same, an example of a simple rule. Let's now show how you can use a rule like that to make sense of data. So now let's suppose I've got a bunch of data here, and these are different food items, and what you've got in this column right here are calories. So this is how many calories there are in each of these, these food items. So what I want to do is I'm trying, I wanna make sense of why do some things have a lot of calories and some things not have a lot of calories. And so I've got this list of items. Well the first thing that I need to try to make sense of is how much variation is there in this data. Well to understand how much variation there is, first I've got to find out, sort of what's the average value? And then variation tells me, how far are things, on average, from that value? So if I add all this up, I've got a 100+250 that's 350, 440, 550, 900 right? So we've got 900 divided by five, so that means the mean here Is 180, so on average, everything in this group has about 180 calories. And I want to ask some things are higher. Right? This is 350 and some things are lower, this is 90. I want some understanding of how much variation is in that data. So one way to do that we just subtract the mean from everything. So if we take 100 minus 180 that's gonna be minus 80. 250 minus 80 that's gonna be 70. 90 minus 180 that's minus 90, right? 110 minus 180 is minus 70 and 350 minus 80 is 170. Well if I add all these things up I'm gonna get minus 80, plus 70, minus 90, minus 170, plus 70, it's be zero because it's gonna be the same as a mean. So what I need is I need all these differences to be positive. So one thing I can do is I can just take the absolute value. Of all these things. Right? And then I could add up the absolute value. And I could get 80 plus 70 is 150. Plus 90 is 240. Plus 70 is 310. Plus one 70 is 480. So we could say, the total difference from the mean is 480. But what we do in statistics, is we tend to do something different. We actually tend to take the difference and square it. And the reason we square it is really twofold. One is that again it makes everything positive, which is what it did before. And the other thing is that it amplifies larger deviations. Because what we'd really like to do is prevent those huge deviations. This is gonna amplify large deviations. So if I look at the pair, I would have 100 minus 180, which is 80 squared which is 6,400. So that's how much variation there would be. So that's the, how much the. Difference from the pair, pair to the mean squared. And if [inaudible] for the cake, I'm gonna get 250-180 which is 70. And if I square that, right, I'm gonna get 4900. Now I could do this for everything. All of them right? So for the pear, I get sixty four hundred, for the cake I get forty nine hundred, for the apple eighty one, for the banana forty nine, for the pie, twenty eight [laugh], thousand, nine hundred. So this is again, a long way from the mean, and you square it at a huge effect, so square amplifies larger mistakes. Now if I add up all these numbers, I'm gonna get fifty three twenty. That's what we call a total variation. So I plotted that data, this tells me sort of, how much variation is in that data, what I'd like to do is keep. [inaudible] categories [inaudible] that reduces that variation that somehow explains why something are high and some things are low. So what's the obvious. Categorization. The obvious categorization here is that pears and apples and bananas are fruit and cakes and pies, right, are desserts. So let's create a fruit category and a dessert category. So in the fruit, I've got one thing that's 90, one thing that's 100 and one thing that's 110. And in the dessert category, I've got one thing that's 250 and one thing that's 350. So let's look at them in more detail. I've got 90,100, 110, the mean Is gonna be 100 here, right? The average is also 100. What's the total variation? Well, 90 minus 100. Is just ten, so if I square that, I get 100. 100 minus 100. Right. Is zero, so if I square that, I get zero. And one ten minus 100. Is also ten, so if I square that, I get 100. So the total variation here is just gonna be 100+100 or 200. So now what I've done is I've got a mean of 100, and a total variation of 200. And now, if I go to this case, the mean is gonna be 300, right, for the desserts. And what's the total variation? Well, for the cake, it's 250-300, which is 50 squared, which is 2500. And for the pie, it's 350-300, which is also 50 squared. Which is 2500 [inaudible]. Add those up, I get 5000. Alright, so let's clean this up a little bit. So what I did is by creating two categories, a fruit category and a dessert category, I now have a mean in the fruit category of 100 and a variation of 200, and a mean in the dessert category of 300 and a variation of 5000. Now one [inaudible] I started out with, right? When I had all the stuff together. I had a mean of 180 and I had a variation. Of 53,200. Now look at how much my variation has gone down. It went from 53,000 to 5,000. So here's to the idea. These categories substantially reduce the amount of variation I have left over. So think of the variation as what's unexplained. So initially I say look I can just say things on average of 180 calories and we've got 53,000 units of variation that's [laugh] unexplained. Now I say wait a minute, I'm gonna create a categorical model that says there's fruit in desserts and fruits have few calories than desserts. And you can say we?ll look it appears to be the case. Fruits have a mean of 100. Desserts have a mean of 300 and the variation in the fruits is only 200 and the variation in the desserts is 5,000. So I've reduced variation a ton. What we want is we want a formal measure of how much we've reduced variation. That's actually fairly simple, right? So that a total variation of 5300, fruit variation is 200, dessert variation is 5000. So that gives me 5200. So 53,000 start and I get 5200 left. So what we wanna ask is how much did I explain. That's the question. How much of that radiation did I explain? Well, then I started out with 53,000, right 200. And now I only have 5,200 left over and so the amount I explained is just 53,000 minus 5,000 right which is 48,000 right. >> Over 53, two. So the percentage of [inaudible] I explained was 48,000 divided by 53,000 which is a huge amount. And I can write this more simply as just one minus the amount I didn't, that's left over. One minus 5,200 over 53,000. So, right? Because that's just a simpler way to do it. And so when I get that the amount of [inaudible] I explained was 90 thou- 90 percent. So 90.2%. So that's how much of that variation I explained. This is equal to, again, that 48,000 right, divided by 53,200, the amount of variation that I explained. Now, formally, this is called the R squared. So this is the [inaudible], the percentage of variation that I explained just by that simple categorization. So, if the R squared is near one. That means I explained almost all the variation, so the model explains a lot. Right? If the R squared is near zero, that means I didn't explain any of the variation really, and the model doesn't explain very much at all. Now the better the model, the more R large R squared it'll have. But depending on, there could be so much variation in the data that even a great model. Only has an r squared of five or ten%. There also could be situations where the thing you're trying to explain in pretty understandable and a good model has to have an r squared of 90%. So there's no fixed rules whether, you know, what a good r squared is. It depends what the data looks like. But with a class, you know sort of class of models, or you know that's [inaudible] data class, you can sorta figure out this is a good model, this is a bad model. Based on experience. Let's push this a little bit further. We had, you know, fruit and desserts, right? Those are our two categories. But if I had, you know, a whole kitchen worth of food, it may be the case, that, like, I'd wanna have more categories. So I might create a vegetable category and a grains category. And then I could put everything in one of these four boxes. So one of the differences between sort of experts and nonexperts is experts tend to have more boxes. They also tend to put the right things in the right boxes, so they tend to have useful boxes. So if you want to be good at sort of predicting things or understanding how the world works, what you have to have is a lot of categories and you have to have those categories be the right categories. They've got to explain a lot of the variation. And we can measure how much of the variation it explains, your model explains, by using that R squared. One last point. Even if you explain a lot of variation, it doesn't mean you've got a good model. Let's go back to the schools case. So suppose I'm trying to figure out what makes a good school, what really leads to a good school performance. So I try all sorts of different boxes. I look at schools that spend a lot of money versus schools that don't spend a lot of money. Schools that have small class sizes and big class sizes. Schools that are big and schools that are small, right? And nothing really seems to explain too much of the variation. And then I create a box that I call the equestrian box. And I put all the schools in here that have equestrian teams. And I find, oh my goodness, every school with an equestrian team is great. Well, the thing is, that doesn't mean that the equestrian team made the school good, right? So statisticians make a distinction between correlation. Which is, is there a statistical relationship between having an equestrian team and being a good school? And causation, did the equestrian team cause the school to be good? So remember when we draw, when you think about putting it in this box. Like, this box right here is a bunch of good outcomes. And this box here has mostly bad outcomes. That doesn't necessarily mean that the thing that created this box, if it's the equestrian box. Is the reason that the schools were good. It could be that there's some other reason. So why would you have an equestrian team? Well you only have an equestrian team if you had a lot of money. And you probably also only have an equestrian team if you have a lot of parental involvement, things like that. Like a lot of support from the community. And you only have an equestrian team if you have a lot of open space. So it could be that having an equestrian team is a proxy for things like money, parental involvement, open space, right, those sorts of things that actually do make a school good. So even if your boxes work, that's no guarantee that they're actually the cause of why it works. Okay, so what do we have? We have one way, [inaudible] the simplest model you can possibly have is a character model. You can say, Amazon is a delivery company. Amazon's an information company. You can say things are either fruits or desserts. Right? And by creating these boxes it can help you sort of explain the variation in data. What we saw then is a simple way to measure how good a categorization is, is how much of that you explained the percentage of the variation you explained and that's what we called R squared, right? R squared was just take all the variation that was there and then ask how much was left how much was left over. That means and then we subtract those two that tells you how much you explained. So you ask what percentage of all that variation do you. Let me start over. Okay, what have we learned. What we've learned is this. We've learned that the simplest kind of model you can have is just a category based model, right where you just sorta lump the world in different categories, and you place your data in different boxes depending on what different data it is. So that could be information companies versus delivery companies. That could be fruits and desserts, right? And in doing that, what you could do is you can reduce the amount of variation you see in the data. So there's a total variation which is sorta of just like how much unexplained variation there was out there in the world, by putting it in boxes you organize it in such a way that you reduce the variation. The amount at which you reduce the variation is what we call the r squared. That's the percent of variation explained. And the more variation explained the better your categorization is. Of course if you create more boxes you can explain more of the variation. Where we're going next, if you've got linear models, which in effect can create a different box for each value of x, our dependent variable. Okay, thanks.