In this module, I'm going to talk about stratification and oversampling. This is the third of our modules related to probability sampling methods. We started with a simple random sampling, moved to clustered sampling and now again, we're going to talk about stratified sampling. In stratified sampling, the sampling design can specify that specific types of units will account for a fixed proportion of the sample. This could be sampling units of the level of the individual of the household. We could want to make sure that people of a certain race were represented in a certain proportion in a sample. Or it could refer to ensuring that some higher level sampling units of these clusters are included in certain proportions in the sample. When we do this, that's what is called a stratified sample. We'll talk a bit about how it's done in just a second. Basically, a stratified sample reduces the chances that luck of the draw will produce a sample with no units of the specified type, or too few or perhaps, too many. We can guarantee the distribution of the sample or the clusters in the sample or the units within the sample on at least one or certain variables of a particular type. And again, we'll give an example in just a second to help clarify that. Though this is especially common when there are substantial differences across units, whether individuals or households, in key characteristics or at higher levels of the clustering. So it's very typical to use stratified sampling in situations where, if we look at the clusters that are the basis of a clustered sample, really differ from each other in key characteristics. So, for example, surveys and training often used stratification to ensure that specified shares of the sample come from major cities, coastal provinces, interior provinces, etc. There are huge differences within China between the major cities such as the provincial level cities, the coastal provinces and the interior provinces. And people, when they are doing a multistage cluster sample design, will use stratification to make sure that these different types of geographic units appear in the sample in specified proportions. So the procedure for this is that units in a sampling frame could be our clusters, our categories, according to the variables that are the basis for a stratification. So if we're talking about in a multistage cluster sample and the highest level in our clustered sampling are either say, provinces or provincial level cities in China. We might categorize them into, again, provincial level cities, coastal provinces and interior provinces. And then that becomes the basis for stratification. In this case we have say, three categories, and then the set of units in each category refer to that as a strata. So if we're, again, sticking to the example of China, then one strata might consist of the coastal provinces, one strata might consist of the interior provinces and the third strata might consist of the provincial level cities. Each strata, each of these three strata in this case, might be assigned a certain number of units to be included in the final sample. So basically, force the final sample to include a certain number of clusters from each of the different types of strata. So we might, if we want at our highest level, a total of six provincial or provincial level city clusters overall. We might decide that we want to make sure that we take two coastal provinces, two interior provinces, and two provincial-level cities. Then sampling is conducted separately for each strata. So we might make up our list of coastal provinces and then draw two coastal provinces from that list. Then our list of interior provinces, we draw two from that, and then our list of provincial-level cities and draw two from that. In a multistage cluster design, stratification can be carried out at any stage, and it's possible to carry it out at every single stage. Because that may be the we may care at various different levels, making sure that our final sample includes units of different types at different levels. So if we move to the United States, as an example, if we think about strata. We might want to make sure that say, if we are using a multistage cluster sample and we're drawing a sample of states. We might want to make sure that by luck of the draw we don't end up with all of our states within a specific region. So what we can rely on is the fact that the United States Census Bureau has a standard definition of nine geographic divisions. West, Midwest, Northeast, South, and then within those, Pacific Mountain, West North Central or East North Central and so forth. Each of which represent a groups of states. So if you wanted to make sure that a sample of states that are highest level includes states from each of these nine geographic divisions, we can stratify by division. Or if we're conducting a survey in which our clusters are counties, we can make up a list of counties for the whole country. And then stratify that again by the division of the state in which they're located. And then we give each of the divisions a quota in terms of the number of counties that we want to include in the final sample. And then we select counties at random within each division with probability proportional to size, like we talked about in the previous module. So we can actually come up with a concrete example, the China General Social Survey, Cycle II. Urban districts or rural counties were the primary sampling unit. That was the highest sampling unit, and there were 2762 of these to sample from. Beijing, Shanghai, Tianjin, Guangzhou, and Shenzhen were each defined as a strata. So basically, they were each defined as a strata, and then within each of these, a certain number of urban districts or rural counties were sampled. And then other county level units were divided into a total of 50 strata, based on economic and demographic variables. And then for each of these strata, a certain number of counties or districts were selected. Basically what they did is that they picked four county-level units from each of the five cities, Beijing, Shanghai, Tianjin, Guangzhou, and Shenzhen. And then they picked 2 county-level units from each of their remaining 50 strata for a total of 120 either urban districts or rural counties that were, on the one hand, representative of the country, as a whole. But also were guaranteed to represent a cross-section of the country in terms of ensuring that the largest and most important cities were included. And that there was a broad range in terms of the county level units for the rest of the country. want to talk a little bit about oversampling. Often, we are especially interested in specific populations. If they account for a small share of the population, a simple random sample may not contain enough of them for a comparison. So if the sampling frame includes relevant individual characteristics, we can oversample these people that we're especially interested in. I'll talk about an example in just a second. Multistage clustered sampling, higher-level units can be known to contain large numbers of such individuals can be oversampled. Now that's all very abstract. Let me operationalize this with an example. And we'll talk about oversampling of African-Americans. Most surveys in the United States oversample African-Americans, and increasingly, other members of unrepresented groups. Because in a typical simple random sample, there might not be an a far observation of them to really allow for meaningful statistical tests and comparisons with other groups. So African-Americans make approximately 13% of the population of the United States. If we just conducted a simple random sample or a sample with multistage cluster sampling without any further adjustment, we might end up with a sample of something that included something like 260 African-Americans out of 2,000. That is 13%, that probably wouldn't be enough to allow for really detailed comparisons between African-Americans and whites. So what surveys do is they will oversample African-Americans. Now this can be difficult at the individual level because typical relevant sampling frames that we might have, like lists of residential addresses and so forth, will not record race. We don't record race in a lot of administrative data. In the United States, race is self-reported. We don't put it on people's driver's licenses and so forth. So if we want to oversample African-Americans we have to come up with a different approach. And what people can do is they can oversample at the block group or higher level unit. So what the surveys will typically do when oversampling African-Americans is that they might actually take an approach from stratification. And then they might on the one hand use stratification to ensure that say blocks, counties, or other administrative units known to have large numbers of African-Americans are included in the sample. And then within each of these units, the sampling might be adjusted to actually increase the surveyed respondents to be higher than in the units that are dominated by groups that we're not trying to oversample. So for example, the clusters, perhaps the city blocks, block groups or counties, that are known to have larger African-American populations. They either might be over-sampled. So the sampling of clusters might be, through stratification, adjusted to ensure that, in fact, these are represented in the final sample in a larger share than they exist in the overall population. Or within each of these clusters, we may actually sample a larger number of individuals or households than again, in the other clustering units. So oversampling again, it's typically used for reaching, or making sure that, normally under-represented groups that have come from a small share of the population overall, but which are extremely important. So, African-Americans in the United States, other examples increasingly Include other minority groups, and there may be other types of oversampling in other countries. So, surveys in China may oversample poor counties, or they oversample particular provinces in the interior that are judged to be of specific interest. In some cases, people that are helping to support a survey may contribute extra money to cover the costs of an oversample. So in the United States, some surveys oversample veterans because of money from the US military, which has a strong interest in understanding the population of veterans. They basically provide additional support for the survey in return for an oversample of veterans. Now we've covered sampling in surveys very quickly here, and sampling really in just three modules. Sampling is actually a fairly complex topic. And you not only need to take a statistics class, but if you really start designing surveys of your own, what people typically do to learn sampling is to actually work on an existing survey, and then take classes in sampling. And I encourage you, if you're interested in participating in surveys, then when you're looking for graduate programs, you'll want to look for programs that offer instruction and training in sampling.