So let's talk about base rate. It's a concept that's critical to many prediction problems. Again, we'll start this segment with a quiz. Imagine that a physician has administered a test for prostate cancer to a very large number of men in their 50s. He finds that if a man has cancer, there's an 80% chance that the test will reveal it. That is to say, an 80% chance that the man will test positive for cancer. Charles who is 53 years old has taken the test, and it comes out positive. So, what's the probability that Charles has prostate cancer? The answer is, we can't possibly know because we also have to know the probability that a man will test positive for cancer when he doesn't have it. And in fact all most medical tests do produce such false positives they are called. So let's say that you know that the false positive rate is 10%, 10% of men who test positive do not in fact have prostate cancer. So the question is what is the probability that Charles has prostate cancer given that he tested positive. We still can't know because we also have to know the base rate for prostate cancer in the population in question. Let's say the base rate is 1% 1% of all men in their 50s have prostate cancer. That's the base rate. So now what is the probability that Charles has prostate cancer given that he's tested positive for it. We can actually now answer the question although it's complicated. To make it simpler, let's think about a population of a 1,000 men. And the statistics I'm about to show you were not real ones, they're just chosen to make it reasonably easy to figure out the probability that someone has a disease given that the test says the person does. So it's another old friend here, a two-by-two table. We have some people who get a positive test. Yes, some people who do not test positive. And we have some people, who have cancer and some people who don't. And we're trying to predict the likelihood that someone has cancer when the test says he does. So we got a 1,000 people. We know that 10 of these people have, that's 1 in a 100, have cancer and 8 of them have been correctly diagnosed with it, that's 80% of 10 which is 8. Two of them have been incorrectly diagnosed. They have cancer but it doesn't show up. Then we know that total of 990 men don't have cancer. We know that 10% of those men are going to get false positives, so 10% of 990 is 99. Now, how do we find out whether Charlie actually has prostate cancer? Well, we divide the number of men who have prostate cancer and test positive by the total number of men who test positive. That's 8 plus 99 and the rate there is .008/(.008+.099) or a numbers 8 / 107 Which is 7.5%. So Charles who was panicked when he heard a test that correctly identifies man who have prostate cancer with 80% accuracy. And he has been identified with having prostate cancer. He's going to be relieved when he finds out that the actual probability that he has cancer is not 80%, it's only 7.5%. So if you're given a positive diagnosis for some disease, here's what you need to know. You must know the percent of people who have the disease who will correctly test positive. The percent of people who do not have the disease who will incorrectly test positive and the overall percent of people who have the disease, that's the base rate. Then you divide the number of positives who have the disease by the total number of positives. If you were unable to figure out the probability that Charles has the disease given those three numbers, the good news is that most physician can't figure that out either. But of course that's also a bad news because many of those man who are false positives are going to have biopsies which involve some risk and some discomfort so testing has its dangers. A decision about whether to test depends on knowing what fraction of men who test positive, and do in fact have the disease would have died of it. And in fact the very high fraction of men who test positive with currently available techniques, would not have died of the disease. As a consequence, the government is now recommending against routine testing for prostate cancer. The exact same logic prompted the government recently to recommend that there should not be routine testing for breast cancer for women below the age of 40. Now, I should say both of this recommendation were met with anger and disbelief on the part of many people. It's safe to say, however, that most such people were not sufficiently aware of the implications when the disease has a low base rate of occurrence, it's uncommon. The disease has a high false positive rate. Further diagnostic procedures pose problems. And even if you have the disease, you might not die from it. Now lets look at another kind of problem. A cab was involved in a hit and run accident at night. Two cab companies, the Green and the Blue operate in the city. You are given the following data. 85% of the cabs in the city are Green and 15% are Blue. A witness identified the cab as Blue. The court tested the reliability of the witness under the circumstances that existed on the night of the accident and concluded that the witness correctly identified each one of the two colors 80% of the time and failed 20% of the time. What is the probability that the cab involved in the accident was Blue rather than Green? Please say your answer out loud, or write it down. And now a similar problem, panel truck problem. A panel truck was involved in a hit-and-run accident at dawn in an alley behind a restaurant. The only panel trucks that come down that alley are the white ones of the Pacific Bread Company and the yellow ones of the Mountain Milk Company. You are given the following data. The white Pacific trucks cause slightly more than five times as many panel truck accidents in the city as the yellow Mountain Trucks. A witness identified the truck as yellow. The court tested the reliability of the witness under the circumstances that existed in the early morning hours of the accident and concluded that the witness correctly identified each one of the two colors 80% of the time. And failed 20% of the time. What is the probability that the truck involved in the accident was a Yellow Mountain truck rather than a White Pacific truck? Okay, so I hope you didn't say, for the cab problem, that there's an 80% chance the cab is blue. because the base rate for cabs, blue cabs, versus green cabs is only 15%. So you have to take that into account, somehow, in the problem. To see that, suppose the percent of cabs that are blue was one half of 1%. Very rare, in other words. So 99.5% of the cabs are green. And the witness is wrong 20% of the time. This means that a very high fraction of the time that the witness says the cab is blue, it's actually green. In fact, nearly every time that the witness says it's blue, it's green. So you have to take into account the prior odds for blue, or the base rate for blue. The actual probability that the cab is blue, given that the witness says the cab is blue, is a little tricky to calculate, but it's actually the same procedure that we used for the prostate cancer diagnosis. Let's assume that there are a total of a 100 cabs in the city. We're going to set up a two by two table. So the witness says the cab is blue, or he says it's green, and the cab is in fact blue or in fact green. And we know that 15 of these cabs are blue, and 85 are green. We know that 12 of the cabs that the witness says are blue are blue, because he's right 80% of the time. 80% of 15 is 12. And we know in 68% of these incidents, when the witness says the cab is green, it is green. That's 80% of 85. And we know that 17 times when the witness says it's blue, it's actually green. And three times when the witness says it's green, it's actually blue. So how do we figure out the likelihood that that cab is blue given that the witness said that it's blue? We know that we need to divide the number of times that the witness says it's blue, and the cab is blue by number of times that the witness says it's blue, and it's blue. And the number of times the witness says it's blue, but it's green. That's 12/29, which is 0.41. So, given that a witness with 80% accuracy says that a cab was blue when only 15% of them were blue is 41%. So the fact that the witness said it's blue doesn't even get us up to having our best guess be it was blue. If that's what the base rate is. Because of the weakness of our grasp on the relevance of base rate, we place too much confidence in information about a particular case, relative to the base rate for the kind of event we're trying to predict. Information about a particular case is sometimes called individuating information. This gets us in big trouble when that information is fallible. But even when people know the information is fallible, they tend to ignore the base rate. In trials, when the witness points to someone he saw fleetingly at dusk and says, that's the man, such testimony is usually treated as if it's almost surely true. Even when the numbers tell us it's probably false. Well, I'm betting that you made more use of the base rate for the panel truck problem than you did for the cab problem. We're actually pretty good at taking base rate information into account when there's a causal interpretation. When you hear that white trucks cause more than five times as many accidents as yellow trucks, this creates an image of hellions behind the wheel of white trucks playing bumper cars in the city streets. And we don't forget that when the witness says that the offending truck was yellow. Now we'll look at a problem that will deal with some pretty familiar seeming events. A college senior that we'll call David L was admitted to two colleges. One was a large public university and one was a small private college. The two schools were about equally expensive, equally selective, and about equally far from home. David had several friends at each college. His friends at the large university liked it. They found it intellectually stimulating and they had good friends there. His friends at the small college were not so happy. They didn't find it a very stimulating place to be and they weren't crazy about either their teachers or their fellow students. David went to both schools for a day. At the large university, he was given the brushoff by a couple of teachers he wanted to see, the student he met didn't seem terribly interesting, and he just didn't like the feel of the place. At the small college, he met several students who were bright and fun to be with, a couple of professors who took a personal interest in him, and he left feeling upbeat. Which school do you think David L should attend? Why? The great majority of University of Michigan students with no statistics say, he outta go to the large university. Why is that? Well, because he's choosing for himself, not his friends. Can you think of a principle that might suggests this reasoning is dubious? I hope you're thinking about the law of large numbers. His friends have spent scores, if not hundreds of days, at their school. He spent one day. His sample size is one, compared with hundreds, for the total, for his friends. We know that sample values can be pretty far from population values if they're small. And intuitively, you know that you could meet a few atypical students and encounter a few atypical professors. And you might be off your game. You've got a cold or a spat with your parents. The whole college might be a bit off. Maybe it's a particularly bleak day during exam time. On the other hand, maybe the school just won the play-off in basketball. And it's true, he's not his friends, but, and now I'm going to tell you one of the most important things a social psychologist can tell you, and that is, if you're like most people, then, like most people, you think you're not like most people. But you are. If everybody you know liked the movie, odds are strong that you will too. And by the way, your friends are not most people. They're people you selected as friends, partly because of shared interests and values. So your sample of friends is biased to be more like you than most people are. And that bias means you should be heavily influenced by their experiences. In short, David L has extremely good base rate data. The firsthand information David L has could be misleading. Just for interest's sake, I'll tell you that I actually live my base rates. Some of my friends think I'm a little far gone on that but I don't judge books by their cover. I basically don't read books unless they've been highly recommended by people I trust. Same is true for movies. Let me tell you about how I made my decision to come to the University of Michigan. I was at a school on the East Coast, a good school and I was asked by folks at Michigan, would I come to interview? Everything I knew about the University of Michigan and the psych department there and the City of Ann Arbor was good. I made up my mind to go, if I got an offer. And I made up my mind not to let myself be influenced by the visit. A good thing too, because my visit happened in February, very cold, patchy, dirty snow on the ground. The Chairman and his executive committee spoke mostly about baseball during my interview. The dean, obviously didn't hear a word I said. One of the least interesting and pleasant people in the department monopolized my time. Fortunately, I stuck to my guns and went to the University of Michigan. As it turns out, I actually never heard the word baseball again. Football is a different matter, no baseball. I found out that the Dean was in the midst of an incredible student crisis. There was a demonstration that was feared it would become violent. So it's hardly surprisingly that the Dean was not mentally present. And I was able to avoid pretty easily that guy who wasn't so interesting and pleasant. I want to make one last useful point about base rates for you. in July 1997, the proposed new Scottish Parliament building was estimated to cost 40 million pounds. Two years later, the budget had become 109 million. That estimated cost increased twice in 2002, ending the year 295 million pounds. It rose three times more in 2003, reaching 375 million pounds. The building was finally completed in 2004, way behind the schedule at a final cost of 431 million. A 2005 study examined rail projects undertaken worldwide for the 30 year period beginning in 1969. In more than 90% of the cases, the number of passengers projected to use the system was overestimated. Even though these passengers shortfalls were widely publicized, forecasts did not improve over those 30 years. On average, planners overestimated how many people would use the new rail projects by 106% and the average cost overrun with 45%. There is no evidence that that changed over the course of 30 years. Rail planners didn't bother to take into account this dismal base rate. You might say those are government projects and even government's biggest fans don't tend to praise government for timeliness or thriftiness or efficiency of planning. How about ordinary citizens? How do they do at planning projects? How well do you think most Americans do in estimating the cost of kitchen remodeling projects? Do you think they overestimate and they're pleasantly surprised at how little the kitchen cost? Maybe they hit it on the nose, or underestimate by 10%, 30%, 50%, 70%? A survey of homeowners who had remodeled their kitchens found that on average, they had expected the job to cost $19,000. In fact, they ended up having an average of $39,000. They underestimated by more than 100%. How about you and me? Do the papers and reports that you write get produced in the amount of time that you anticipated? Mine don't. Why not? I sit down and look over my notes and plans and carefully estimate how much time each part of the project will take. I say, let's see, the introduction just covers standard stuff I've done before, a half day is enough. Methods are always a breeze to write, let's say a day at most. Results could be a little tricky, there are some things I don't know exactly how am I going to handle, let's say a day and a half and then discussions are never big deal, they're never very long. So I'm going to say, three and a half days, four days tops. And how long does a paper usually take after those calculations? Nine or ten days, if I'm lucky. What I'm I doing wrong? I'm just paying attention to the things I know have to be done and I can't see around the corners to the myriad little things that are going to crop up. I can't find that reference by Snerdly. My colleague never finished one of the analysis and I don't know how to do it myself really. In short, I pay attention only to the individuating information about this particular case and I'm never going to see all the things that produced delays. How could I do better? Pay attention to my own based rate for paper writing projects. At the very least, after I do the calculations and come up with the estimated time, I should triple it. Do I do this? Not really. I usually just say vaguely that I'm often surprised, so I realize it might take longer than I think but secretly, I think this time will be different. It usually hasn't. So do what I say here, not what I do. I have learned about other people's foibles. When a tradesman says a project will take two days, I triple it in my mind so that I won't be unpleasantly surprised. When I set deadlines for other people, I make them a week or two earlier than I actually need the job done, so I'm not likely to be caught in a bind. Now, that should be as easy for you as it is for me. Set deadlines for other people with an eye to your safety and convenience. The next lesson is on cognitive biases. That's the mental procedures that we use to understand the world and how those procedures can sometimes go awry.