Hi, now we're going to to talk about probability sampling. In fact, we're going to talk about probability sampling for three modules. In this module, we'll talk about simple random sampling. Then we move to clustered sampling in the next module, and then stratified sampling. They're all related. And be clarifying the differences between them in the coming modules. What is probability sampling? From the early 20th century, probably-based or random sampling has been advocated as an alternative to purposive sampling, quota sampling and some of the other techniques that we talked about that were used in the late 19th century or early 20th century. In a probability sample, the members of a population are selected at random to make up the sample. In contrast with quota or purposive sampling, the probability of being selected for everyone in the population is known. Random sampling became dominant after the failure of th Gallup and other polls in 1948. I'd like to go through definitions first. The population, we'll be talking about that a lot. By population, we mean the entity about which we are going to generalize from our sample. So a population could consist of people. It could be registered voters, it could be the residents of a particular country, or it could be a population of firms or other organizations. It's basically the larger set of units from which we intend to draw our sample, and about which we want to make a statement. A parameter of the population is the property that we're trying to estimate. So, for example, if our population consists of registered voters, and we're conducting a survey of a sample of voters, we might want to estimate the percentage of registered voters that favor one party or another by making a measurement in the sample, and then generalizing that to be population from which the sample is drawn. A sampling frame is a complete list of the members of the population from which the sample will be drawn. So whereas the population is an abstraction, registered voters, the citizens of a particular country, the residents of a particular country, a sampling frame is an actual list that will be the basis of our sample. It's an actual concrete or real thing that we work with. The sample consists of the members of the population that are actually drawn from the sampling frame that will be included in the sample. Now, to provide some examples of sampling frames. There's households. So that might be obtained, as a sampling frame, as a list of residential addresses obtained from the post office or some other government agency. Or perhaps a utility company that provides service to all of the households in a particular area. Telephone users. A sampling frame might be a list of active phone numbers obtained from a phone company. Professors. We might make use of a list of names of members of organizations associated with a particular academic discipline in which we're interested. Firms in an industry. We might make use for our sampling frame of a list of members of a trade association. Or a list of companies that have registered with the government in connection with doing business in a particular area. Now, I want to talk about simple random sampling. This refers to the case where every unit in the population is equally likely to be selected for the sample. Measurements in the sample provide direct estimates of population parameters. So if we have a sample of registered voters, and it's been drawn from the population of registered voters, we have a good sampling frame consisting of a list of registered voters, the proportion that we measure in the sample should be an estimate of the same proportion in the larger population. Statistical inference, including hypothesis testing and confidence intervals, are fairly straightforward to work with if we have simple random sampling. And then one issue is that if we are going to carry out a survey with a sample based on simple random sampling, it's normally necessary that contacting respondents needs to be straightforward. That's, of course, easy with a mail survey or a telephone survey. It can actually get more difficult if we're thinking about a household survey that includes in-person visits. So one easy example, something that we can do with random sampling, would be a household survey via mail, where a survey is mailed to a sample of residential addresses. We get sampling frame consisting of a list of all valid residential addresses in a particular city, and we pick a certain number of them at random, and we mail out a household survey. That's straightforward. What's the procedure? Well, first we have to obtain a sampling frame. Now, that can actually be one of the most difficult parts of conducting a survey. It's fairly straightforward for certain things, like household surveys, where we can get lists of valid residential addresses, or surveys of voters, where we have lists of registered voters. But it can be much more difficult for more specialized populations. Professors, people working in a particular profession. Or people that are actually trying to hide themselves, or perhaps engaging in a behavior that they haven't made public, and where there may not be a comprehensive list. We'll talk about some of those issues in a later lecture. Once we have our sampling frame, we randomly select units from the frame to make up our sample. This may be done with software. So we can program a computer to generate random numbers, and use that to pick the units within the sampling frame that will be part of our sample. It may also be done by going down a list, if we can come up with a comprehensive list of every element within our sampling frame, for example, a complete list of all presidential addresses. And then we can just select units at intervals defined by the ratio of the population size to the intended sample size. So, for example, if a list has 100,000 addresses, and our intended sample size is 1,000, that is, 1 out of 100, then we could go through our list of addresses and simply select every 100th address. Let's work through a simple example. So imagine that we have a complete list of addresses for a city. And here we have an extract which consists of 21 addresses on Main Street, including some apartment buildings, so apartment 1, 2, 3, 4. Now, if we wanted to construct a sample that consisted of one out of every four households in the city, we could actually number the addresses, 1,2,3,4 1,2,3,4 and so forth, as a first step to drawing the sample. So here we've done that numbering, 1,2,3,4, 1,2,3,4, etc. So once we have that in place, we can simply go ahead and select every fourth address like this. We started with an offset of two, and then picked every fourth address after that. And it turns out that that would produce a random sample of addresses that consisted of one-fourth, or one out of four, of the addresses in the city. I'm going to talk a bit about some considerations related to sample size. Sample size is selected on the basis of considerations of statistical power. Statistical power is something you'll have to learn about in a more advanced statistics class. But it relates to the ability of a sample to let an analysis actually detect an association, or a difference in the sample. More statistical power reduces the chances of failing to observe a relationship or a difference that actually does exist in the population. We refer to that sort of mistake or error as a Type II error. It depends mainly on the strength of the relationship that's actually in the population, the sample size, and then the criterion for statistical significance that we're going to set. So if we have a stringent criterion for statistical significance then, in that case, we are probably going to set, we're going to need, a large sample to get the statistical power that we need. Sample size as a share of the percentage of the population is relatively unimportant. So that's why typically, even for very large countries, like say, China, typical surveys may just have a sample of 5 or 10,000 people. Surveys are not much larger than they would be for the United States, or even a much smaller country, because you don't get much bang for the buck by shooting for a particular percentage of the population making up your sample. Statistical power is driven by the size, the absolute size of the sample, the number of cases. So I'm going to review and talk a little bit about the advantages of probability sampling. It's representative on all characteristics. So when we have a genuinely random sample from a population, whatever we measure in our sample will generalize to the larger population. There's no discretion on the part of the interviewer in terms of picking who they want to interview, at least not if they've been trained properly. Now, there are issues with response rates, and so forth, that we'll talk about in a later module. But if everything goes as planned, the interviewer doesn't get to pick and choose the way they would with a quota or purposive sampling approach. We can conduct confidence intervals and statistical tests and hypothesis tests with no problem. Now, probability sampling can include some challenges. Sometimes it's hard to find sampling frames, especially if we're looking for a more specialized population, a subset, consisting of people in a particular profession, people with particular interests, or engaged in a particular hobby. Non-response can be an issue. We'll come back to this in a later module. And, then, it can be logistically difficult and expensive to have a simple random sample over a large geographic area. And we'll talk about that in the next module when we talk about multi-stage cluster sampling, which is one remedy. Now, I want to come back to the issue of sample size versus representativeness to highlight it. And I want to emphasize that large sample sizes do not compensate for problems with representativeness, that basically a small representative sample is always preferred over a large, unrepresentative one. So that's why, when you look at typical surveys done for research, they rarely have more than a few thousand respondents. As long as the sampling is done properly, a few thousand respondents will give you a good insight into the population that you're trying to study. Whereas, perhaps online surveys, mail surveys that are done in an ad hoc fashion that may have hundreds of thousands of responses, are rarely used in serous research because it's not clear that they are a sample that's actually representative of the larger population.