Hi, in this module I'm going to talk about clustered sampling. This is the second of our three modules related to probability sampling methods. We just talked about simple random sampling, and we're going to talk about clustered sampling. Right now, and in the next module we'll talk about stratified sampling. Multistage clustered sampling is often used in contexts where simple random sampling like we talked about in the previous module is simply too expensive or too logistically complex. Let me explain why. The basic problem is that, if we're thinking about say, a nationally representative household survey with in-person visits. Then in a large country, many of the households in a truly random sample, or a simple random sample may be the only sampled household in their city, or their town, or their village. So we may end up with a situation where we have hundreds of towns or villages, each with a single household that have to be visited to conduct these in person interviews. So our travel budget could be immense. And so, the time and the budget requirements could easily become overwhelming. Let me try to clarify how this works. So say that we have a total population of 25000 households in some hypothetical country. And these households are distributed into sort of towns and villages of 10000 people, 1000 people, or 100 people. Now if they're all adjacent to each other that may not be that big of a problem but image if they're spread over a very large geographic area perhaps a continent or just a very large country. Now just by the laws of probability if we take our list of 25,000 households that's our sampling frame for this population. And we, at random, draw 250 households. On average, we're going to have households that we have to visit on average, by luck of the draw, in almost every single one of our towns and villages. So, on average, our say towns that have population 10,000, our sample is 250 out of 25,000. So one out of 100. So on average a town of 10,000 people will have 100 households. Now that isn't that much of a problem. But on average each of the villages that each just have say, 100 households, on average, are going to, by luck of the draw, probably have one household that we have to visit. Now, of course, in practice, it may be some of them may be zero, some of them may be two. But it's entirely possible that we would have to physically visit every single one of the administrative units, the towns and villages in the population that we want to study. In many cases, for example, if we look at the villages with a population of only 100 there are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 such villages. We might have to visit possibly every single one of them and if they're remote, maybe they require a drive, or maybe they require somebody to fly to. Our travel budget is going to go up very, very rapidly. It might become unwieldy. The most common approach to dealing with this problem is multistage cluster sampling. When we talk about clusters these refer to geographic or administrative units in which sampling units the individuals or households that we're interested in may be nested. And the basic idea is that we sample clusters, and then within clusters we sample our individuals or households. Or if we're truly multistage, we may sample other clusters, until we get all the way down to individuals and households. Stage in multistage refers to the level of aggregation. So as we'll see in an example, we might have multiple stages, where we first draw a sample of states. We draw states at random, we define states as clusters, and we draw some states at random. And then, at the next level down, the next stage within each state, we define counties as clusters and randomly draw a selection of counties. And then perhaps go even further down to further stages. It could be city blocks and so forth. Clusters are sampled at each level, at each stage. Could be states, could be counties. And then within each cluster, either lower level or lower stage clusters are sampled, or we actually sample the units that are the focus of our analysis, so perhaps individuals or households. I'll go through an example in just a second. So for example, we might have a sample of towns within a state. And then within each town, we might draw a sample of households. Now let me go through an example in the next slide. Consider a situation in the United States, where we'd like to conduct a household survey. With a sample size of 2,500. But where we're actually going to conduct in person visits to every single household. Now for the reasons that we just discussed. This could be expensive and time consuming. While we might have a lot of households in New York and L.A. where we could just fly a team of interviewers into one of those cities. And they could visit 50 or 100 households fairly straightforwardly. We might have dozens or hundreds of households that were, in every case, the only household in their own town. So places like Montana or small towns in Utah or Tennessee where we might have to fly an interview team into the nearest city with an airport. And then they might have to rent a car, spend an entire day just getting to a household in a particular town to conduct just one interview. Again, this would be very expensive, very time consuming. And we almost never have the budget for doing a simple random sample like that. So if we took a clustered sampling approach, a multi-staged clustered sampling approach just hypothetically to cut down on the number of places that we have to actually physically visit, we might first sample select five states. And then within each of these five states we select five counties. And then within each county select 10 residential blocks. And then on each block we randomly select 10 households to visit. That gives us a total sample size of 2,500. And it turns out to be nationally representative. But we've really cut down on the amount of travel that we have to do in order to visit each of the households in our survey. So again, 5 x 5 x 10 x 10 is 2,500. So again, we have the same sample size, but again at much, much reduced cost. Now one thing that we have to deal with is that clusters could be states, could be counties, could be provinces depending on what country we're studying are likely to vary in size. And this affects what we do when we sample clusters. Basically what we need to do is make sure that the probability of a cluster being selected, perhaps a state if that's our highest level, or a province, should normally be proportional to its size, that is proportional to its population. If clusters are equally likely to be selected, but the number of people in each cluster vary. Then it turns out, and I'll illustrate this in just a second, that individuals living in smaller clusters will be over represented in our final sample. So sampling with probability proportional to size will address this issue. And it can be repeated at multiple levels. So if we're first sampling states, and then within states counties, and then within counties residential blocks, we can make sure at each level the probability that a cluster is drawn, a state, a county, a block, is proportional to the total number of people living there. One piece of terminology I should mention is that we refer to these first stage units, as primary stage units PSU or FSU, secondary stage units, SSU, and tertiary stage units, TSU. You may see these expressions PSU, FSU and so forth in papers. Now let me explain or clarify what the problem with equal likelihood of selecting clusters is. So if we start with a population which is made up of, say, towns of 10,000, 1,000, or 100 people. And we decide that we're going to sample four of these clusters at random, and then within each of them sample 50 people. If If each of these units whether 100 or 10,000 or 100 is equally likely to be selected and we select five of them at random. Then we end up actually based on the laws of probability likely that we'll actually end up with fairly large number of small clusters, the towns, villages, with just a hundred people. In our final sample in each of which will include 50 people, and maybe again could be luck of the draw but, maybe just one larger town. So we'll end up with a sample of 250 people out of a population of 25,000, but we're essentially 200 out of those 250 people in this hypothetical example live in small towns of 100 people each that only account for a small fraction of the overall population. So if we redo with probability proportional to size where we adjust our sampling mechanism so that the probability of a cluster, a city or a town or a village being selected, is proportional to it's size. We're much more likely to get a sample of clusters that resembles the population as a whole or produces a sample that resembles a population as a whole. So we might end up with say both of the, largest cities, the 10,000 person cities in the final sample, perhaps a city of 1,000 and then maybe just two places of 100. And then 50 people in each of these clusters, for a total of 250 people. So the 250 people in our sample if you look at the way they're distributed across cities, towns and villages. That distribution will resemble what we would see in the larger population. So again, the mathematics for this you have to take in more advanced class in survey sampling. You can't go into that much detail here. I just want to alert you to this issue. So let's talk about some examples of some surveys that make use of multi-staged clustered sampling. One is the China Family Panel Survey, I'll be talking about in a later lecture. They first sample urban districts or rural counties across the entire country. And then, within each of these urban districts or rural counties, they sample urban neighborhoods or rural villages. And then within each of these, they sample households. So again, this makes it possible to keep stay within our reasonable budget, while conducting a survey that is nationally represented for all of China. The China General Social Survey has a very similar strategies Urban district or rural counties, Townships for rural areas, Urban neighborhood or rural villages. And then finally, again, down to households. So multiple stages or multiple levels. Finally, the general social survey starts with standard metropolitan statistical areas as defined by the Census Bureau or rural counties. Within these, block groups or enumeration districts. These are technical terms. Can't get into much detail. But they come from the census. A block group is a selection of city blocks. And enumeration districts might be an area within a rural area. Then actual blocks. And finally down to individuals. One thing that we have to keep in mind if we're conducting a clustered sample is, there are some implications for statistical inference. Units within the same cluster may resemble each other that is households or individuals living in the same village, the same town, may have more in common with each other than they do with households and individuals elsewhere in the country. So units drawn from sampled clusters may not vary as much as units would be if they were drawn evenly from the population at large. So if we drew a simple random sample from an entire country, we'll get a lot of variations between the households and the individuals that are in our sample. But if we are drawing from clusters, because of the fact that people living together in the same town, the same village, the same city, may have more things in common than they do with say random people from elsewhere in the country. We might not get quite as much variation from a clustered sample as we would get from a simple random sample. Technically and this requires more study in a class focused on sample survey design this implies that clustering increases sampling variance and therefore, standard errors. So what this means is that perhaps one survey to another using a clustered approach, we might see our estimates bounce around more than if we were conducting a simple random sample. This effect is more pronounced when clusters are fewer. So if we have a lot of clusters, each with a small number of units, individuals or households, it's less of a problem. This will affect our calculations of statistical significance in our statistical test. And typically, when people make use of data that comes from multistage clustered samples, they apply, or they may apply sample weights. Or other clustering adjustments to a apply to essentially fix the issues that arise with the tests of statistical significance. I would like to recap some of the main issues that come up with multistage clustered sampling. Again, it's most relevant for in-person interviews. That is, where we have to send a team out, and it's physically expensive and time consuming for every single household or person that we want to visit. It can save a lot of time and money by reducing the number of physical locations that we have to send a crew to. Now it does lead to greater sampling error, and increased variance, and reduced statistical power. And may require some adjustments using application of sampling weights during analysis and so forth. Another issue that I really couldn't get into here, but which you have to learn about if you take a more advanced class, is that probability proportion to size may actually be more difficult in settings where the populations of clusters are not known. So pps may be straight forward in the United States, where you have pretty good census data and pretty good estimates of numbers of people living in states, counties and so forth. And then it's fairly easy to make samples that are probability proportional size. Can be a real problem though In other countries that don't have well developed statistical systems where we actually may not know how many people are living in any given province or any given town. And then we may not know what weight to give those towns when we're sampling those clusters. Again, that's an issue for a much more advanced class.