Sampling variability and the essential limit theorem should not be new concepts to you anymore. However, in this unit we're shifting the focus away from numerical variables and focusing on categorical variables only. So, in this video, we're going to start by talking about the sampling distribution for a sample proportion. because remember, when we're dealing with categorical variables, the parameter of interest is no longer a mean but a proportion. And we're also going to define the central limit theorem for proportions, which is very similar to what we've seen before but a different measure of the standard error as expected. And we're going to walk through the conditions for the, that central limit theorem to hold as well. Let's revisit quickly what we mean by a sampling distribution. Say you have a population of interest and you take a random sample from it. And based on that random sample, you calculate a sample statistic If in that sample the variable of interest is a categorical variable, the sample statistic is going to be a sample proportion. Then we take another sample, and also calculate the sample proportion from that. And then another one, and then another one. And this goes on for a long time, because we want to think about taking as many samples as we can. The distributions of the observations with in each sample is called sample distributions. However, when we look at the distribution of the sample statistics, this is what we call our sampling distribution. And remember that these two are not the same thing at all. In the sample distributions, the observations are individual. Let's say people or cases, whatever it is that your sampling verses in a sampling distribution the observations are sampled statistics. Let's give a little bit more concrete example, say we want to estimate the proportion of smokers in the world. So our population is our world population, and capital N is going to be our population size, so this is everybody in the world. And our parameter of interest is p, the proportion, the true proportion of smokers in the world. If we actually had data from the entire population, we could calculate this p as the number of smokers in the world divided by the total population size. But we don't have data from every single person in the world, so we're going to think, so let's say that you're taking many samples from this. So the idea here is not necessarily a realistic situation where you're doing data analysis per se, but we're trying to illustrate what we mean by a sampling distribution. So you start with the first country on the roster, Afghanistan, and you sample 1000 people from Afghanistan. And you ask each individual one are you a smoker or not, and record a yes or a not for each individual person. Then so on and so forth, you go to many countries. Let's say you take another ra, random sample of 1000 from the U.S. again, asking each person are you a smoker or not? And recording a yes or a no for them. And finally you end up in Zimbabwe, the last country on the roster. Another random sample of 1000 people from there as well. Again asking them, are you a smoker or not? So now you have a bunch of samples of thousand observations each, where observation represents a person from that country. And say we summarize these samples. So, we calculate the proportion of smokers in Afghanistan. This is the sample proportion. The proportion of sample smokers in the U.S And you do this for every country, all the way up to proportion of smokers in Zimbabwe. So now, our data set is not individual people, and whether or not they smoke, but actually we have a data set of proportions. The distribution of these proportions is what we call the sampling distribution. And as you can imagine, these should be individually both good guesses somewhat for the true p. Although we probably expect more variability between these than the example we gave before when we were talking about means with the heights of, average heights of US women from various states in the US. Because we actually would expect some trends in the smoking habits of people from various countries. But overall, we would expect the mean of these p hats to be close to our unknown population mean. So, this is very similar to the diagram we drew before. So slightly re, repetitive, but we're basically trying to make sure here that it is very clear what we mean by a sampling distribution. And something that's actually different here is that initially we started with a categorical variable. Is the person a smoker or not a smoker? Then for each one of our samples, we calculated a summary statistic. Our proportion, the proportion of smokers. And now we are dealing with a distribution of numerical data. Where our data, the data items are proportion of smokers in each country. So we started with a categorical variable, but we're once again talking about the distribution of a numerical variable, because we're focusing on the distribution of sample statistics. So what is the sampling distribution going to look like? Well, the central limit theorem tells us about that. It says that the distribution of sample proportions is going to be nearly normal. Just like with sample means, it's going to be centered at the population proportion suppose this population mean. But again generically, it's centered at the population parameter, and with the standard error inversely proportional to sample size, and that's also we've seen before. So the central limit theorem tells us about the shape of the distribution. The center of the distribution, as well as the spread of the distribution. And we can calculate the standard error as the square root of p, which is the proportion of success time one minus p divided by n. Just like with any rule we introduce, there are conditions for the central limit theorem as well. The first condition is very similar to what we've seen before, independence of observation. Our sampled observations must be independent, and to achieve that we either want random sampling or assignment depending on the type of study we have. As well as if we are sampling without replacement, we want to make sure that our sample size is less than 10% of our population. We also have a condition about the sample size. And this time we're not just coming up with a threshold sample size per se, but we're looking for the balance of the sample size and the proportion of success. We said, we are saying that there should be at least 10 successes and 10 failures in the sample. So, n times p and n times 1 minus p must both be greater than 10. This rule should sound familiar to you because we've actually talked about this when we were dealing with the binomial distribution, and we were looking for the normal approximation of it. And the same idea holds here, we want our sample proportion to be nearly normally distributed, and therefore we need to meet the success failure condition one more time. However, if you're p is unknown. So this goes for the both calculation of the standard error as well as the calculation of the number of successes and failures, we usually use our sample proportion. Again, if you don't know your population parameter, your best guess is going to be your sample statistic that you're using as a point estimate for that parameter. So let's give a quick example. We're told that 90% of all plant species are classified as angiosperms. These are flowering plants. If you were to randomly sample 200 plants from the list of all known plant species, what is the probability that at least 95% of the plants in your sample will be flowering plants? So let's take a look at what this question is telling us first, and parse through the information that we're given. We're told that 90% of all plant species are classified as angiosperms, so our proportion of success is 0.9 or 90%. We're also told that our sample size is 200, so n is 200. And we're asked for the probability of at least 95% successes. We're calling an angiosperm plant, a sampled angiosperm plant a success here. So at least 95% is we're looking for the probability that our sample proportion will be greater than 0.95. So if we knew something about the distribution of pea-hat, we should be able to easily calculate this probability. In fact if we knew that pea-hat is distributed nearly normally, we know that we can calculate this probability using the normal distribution z scores, and percentiles. Well, the central element theorem tells us that it may be distributed nearly normally, so let's check to see if the conditions for the central element theorem hold. And if it does, then we can proceed with that. The first condition is about independence. We have our random sample rate-all. And 200 is certainly less than 10% of all plants, so we can assume that whether or not one plant in our sample is angiosperm, is independent of another. Number two is about the success failure condition. 200 is our sample size, our proportion of success is 0.9, so n times p 200 times 0.9 is 180. And n times 1 minus p, that's 200 times 1 minus 0.9 is 20. Both of these are greater than 10. So our success failure condition holds as well, which tells us that the distribution of the sample proportion is going to be nearly normal. In fact, it's going to be nearly normal with mean at the population parameter 0.90 and standard error equal to 0.9 times 0.10 divided by 200. And then we take the square root of all of that, which gives us roughly 2.12%. Now, we have a normal distribution. We know it's mean, we know it's variability and we're looking for a probability associated with this distribution. Well, the first thing we need to do is draw our curve. We draw our curve. We mark our mean at 0.90, and then we shade the area of interest anything beyond 0.95. To calculate this probability, we can refer to a z score. So let's calculate our z score as the observation minus the mean divided by this standard deviation of that observation. And because in this case the observation is a sample proportion, standard deviation of that is going to be measured by the standard error, and that gives us a Z score of 2.36. We can see that we are more than two standard deviations away from the mean at this point, so it's going to be a pretty small probability. By this time hopefully, you guys are comfortable with finding these probabilities. Remember we talked about using the table, using r or using an applet, so you could for practice try one of these methods. And check your solution against what I'm about to reveal. So the probability in that we're interested in here should be roughly about 0.0091. One thing we should mention here is that we were looking for the probability of at least 95%, and so that seems like we should have used the notion p-hat greater than or equal to 0.95. However, remember that under a continuous distribution, which normal distribution is one, the probability of the random variable being equal to a number is defined as 0. Because that would be like finding the area of a line or a sliver under the normal distribution, which doesn't really make sense. To answer this question, we use the central limit theorem, which is a technique that we just recently learned. But we could also do this using the binomial distribution as well. Remember, our sample size is 200, our proportion of overall success is 90% or 0.9. And we're basically being asked for being able to obtain 95% successes, or in other words, 95% of 200, at least 190 successes in 200 trials where the proportion of success is 0.9. We could do this easily using R, we're going to use the debinom function to calculate the binomial probabilities. And since we're looking for a range, we're going to calculate a bunch of binomial probabilities and add them up. So we're looking for the sum of all probabilities under the binomial distribution with n equals 200 and p equals 0.9. Anything between 190 and 200. And this probability comes out to be roughly 0.008. That is not exactly the probability that we calculated, but its awfully close to it. So, before we wrap up our discussion on the sampling distribution of proportions, let's talk about a what if scenario. What if the success failure condition is not met. The center of the sampling distribution will still be around the true population proportion. And the spread of the sampling distribution can still be approximated using the same formula for the standard error. However, the shape of the distribution will depend on whether the true population proportion is closer to 0 or closer to 1. Let's take a look at this, here's our number line, and remember that distribution of proportions have natural boundaries around them. They can only be between zero and one. So we know that the sample proportion cannot be below zero and cannot be greater than one. Let's think about a situation where success failure condition is not met, but let's say that our true population proportion is at 0.2, so at a value that's closer to 0 than closer to 1. We said that the center of the distribution is still going to be around the true population parameter, but we're going to end up with a smaller tail to the left of the distribution and a much longer tail to the right of the distribution. This is because even though from samples taken from this population where the true population proportion is 20%, we would expect majority of them to have sample proportions close to 20%. We'll still get some that are different than 20%, and we might get proportions all the way down to 0, or all the way up to 1. But it's going to be much less likely to get a sample proportion of 100%, and a random sample from a population where the true population proportion is 20%, than something, let's say 5 or 10%. So, the tail to the left is short because we have the natural boundary at 0. But the tail to the right, is much longer because the natural boundary on the higher end doesn't appear until 1, so that yields a right skew distribution. Similarly, if we had a population where the true population proportion is 80%, we would see the opposite of this effect and our sampling distribution would then be expected to be left skewed. This is if the success failure condition is not met. If the success-failure condition is met, remember that means that the sample size is higher. That's going to yield a smaller standard error. So the curves are going to be much more dense around the true population parameters, and they're going to be looking more and more symmetric as the sample size increases.