[MUSIC] Hi. In this module, I'm going to talk about surveys and sampling in an introductory way. I'm going to talk about where surveys came from and some of the approaches to sampling that were used before the modern era of random or probability based sampling started, which I'm going to talk about in later modules. So, surveys seek to generalize to some population from a sample. Sampling refers to the process of selecting the units within the population that we're interested in to include in a sample that we are going to survey. The goal is to produce a sample that is representative of the larger population. In other words, we would like our sample, its characteristics, to resemble that of the larger population from which the sample is drawn. That way, if we do some calculations based on the sample, we should be able to generalize to the population as a whole. Sampling's actually also relevant for studies that are not surveys, but in some cases make use of very large amounts of data, too much data to, say analyze, in which case some subset has to be extracted. I won't be getting much into that in the course of this lecture. Now, the origins of surveys were in 19th century in Victorian England. Social reformers sought to understand the conditions of the urban poor. These surveys typically focused on income and expenditure, and things like housing and so forth. People were very interested in the plight of the people living in the crowded neighborhoods of London and other major cities, and essentially understanding how they fared or managed when they were earning relatively low wages and then had to spend much of their income on basic necessities. The samples were large, because at that time, people thought that a large sample was better, and they typically consisted of all the residents of a selected community that had been somehow identified as quote, typical. The limitations, however, of trying to generalize from quote, typical communities, quickly became apparent. So people began looking for ways to draw larger samples from broader cross sections of a country, or at least a large region. One early approach, which again the limitations were recognized fairly quickly, was convenience sampling. One example is starting in 1916, the Literary Digest, at the time a famous magazine in the United States, began conducting polls to predict the outcomes of the U.S. Presidential elections. They tabulated responses from millions of cards mailed to subscribers, automobile owners, and telephone users across the United States. So they were able to achieve great geographic coverage of the United States, but, as we'll see later, at something of a cost. The reason that they used automobile owners and telephone users was that at that time in the United States, they were one of a small number of populations for which there were comprehensive listings that could be used to contact people with. Now the results, amazingly enough from this rather haphazard approach to sampling, basically, sending out postcards to millions of people and asking them to indicate a preference and mail them back, were actually correct, until 1936, when they incorrectly predicted a victory for Alf Landon over Roosevelt. Now, you probably know that this prediction was wrong, because most of you have probably never hear of Alf Landon. Roosevelt, obviously, remained as president for some years afterward. The sample turned out to be biased and unrepresentative. It turned out that automobile owners, telephone users, and subscribers to the Literary Digest were among the, on average, wealthier people in the United States at the time, and more likely to vote Republican and vote for Alf Landon. So I refer to this as a convenience sample, because essentially the emphasis was on ease of contact. That is, a sample that was constructed in a way that maximized the ease of reaching out and recruiting respondents, without much care to ensuring that they were representative in the way we think about representative nowadays. Another approach to sampling, which was a bit more systematic, was purposive sampling. Many, if not most of the late 19th or early 20th surveys, used what we now think of as purposive sampling. For example, I mentioned earlier, the studies in the 19th century that focused on particular urban districts. So urban districts or rural communities were selected because they were thought to be, in some way, typical or representative of a class of such communities or districts. This turns out, we now know, to be fairly problematic, because it turns out that almost any neighborhood, any village, is never really truly typical. Every village or community is special. It has features that distinguish it and make it difficult to generalize from its experience to, for example, an entire country. But at the time, there were logistical and other reasons to take this approach. In other situations, especially when it came to surveys in the early 20th century, individual subjects were chosen based on observed characteristics, with the goal of composing a representative sample. Let me give you an example. So, one form purposive sampling was quota sampling. George Gallup successfully predicted the winner of the U.S. Presidential elections in 1936, 1940 and 1944 using a very small samples, much smaller than the ones in the Literary Digest polls. But he used quota sampling, which again was being advocated as a certainly an alternative to convenient sampling, and in general, a acceptable way of drawing a sample. So in quota sampling, targets were set for the number of respondents with specified characteristics in the hope that they would then compose a sample that was representative of the larger population. So the focus was on age, sex and race, and other observed characteristics. The resulting sample, in principle, should match the larger population in terms of these specified characteristics. Let's look at a hypothetical example of how quota sampling works to help give some clarity. So, imagine that we have a population and we're interested in composing a sample where the quotas are set on three characteristics of the population, age, sex and race. So in this population, half of the population is age 21 to 50. The other half is 50 and above. Half the population is male, half the population is female. 90% of the population is white, 10% is black. So if those are the three variables upon which we want to compose our quotas, we would take these shares and then use them to work out the shares of the population that were in each combination of categories. So, for example, if we're looking at females, white, age 21 to 50, 50% times 50% times 90% would give you 22.5%. Or black females age 50 and above, 2.5%, which comes out of 50% times 50% times 10%. So we have the shares of the population, you might say, in each cell, each combination of age, sex, and race. These are percentages. Now, assuming that we want a sample with 1,000 people, we can apply these percentages to 1,000 to figure out what our quota would be in terms of going out to compose our sample. So, for example, 22.5% of 1,000 would be 225 white females who are aged 21 to 50, 25 black females aged 50 and above. So these quotas would then be given to interviewers and they would go out into the field and start looking for people that matched the characteristics for each set of quoats until they filled up each cell. So, they might keep looking for white males, age 21 to 50, and once they had 225 that they had interviewed, they would declare that cell complete. That the quota had been satisfied and then perhaps move on. So this leaves a lot of discretion to the interviewer and it can lead to problems. For example, the easiest way to fill a quota might be to go to some place where you might, for example, find a lot of white males aged 21 to 50, perhaps outside the gates of a factory if it were the United States in the 1940s. So perhaps you could fill your quota in a few hours, but of course, you might be looking at a very homogeneous group of white males aged 21 to 50 if you did that. So, this is an example of setting a quota and then how the instructions were actually given. So, quota sampling turned out to have problems. So in 1948, Gallup, who'd used quota sampling with relatively small samples to correctly predict the Presidential elections before 1948, incorrectly predicted Dewey as the winner of 1948 Presidential election in the United States. You can probably guess that that prediction was wrong because you probably never heard of Dewey. In fact, Truman won that election. It turned out that the sample that Gallup was using was heavily biased and it partly reflected the problems with quota sampling. One problem was that quota sampling left, again, too much discretion to the interviewer. When you looked at the specified targets, the interviewer would have discretion to pick people that they felt easy to approach. So if they were told to, for example, find a certain number of adult white males, they might look for the ones that were dressed the best, or perhaps in particular neighborhoods, because they might feel that they would be easiest to approach and interview, and they might neglect people that were dressed in a more working class fashion or perhaps in neighborhoods that they didn't want to visit. So, you can easily end up with bias on other characteristics like, for example, socioeconomic status or something else, even if you satisfied the dictates of the quota. The other issue, actually there were others. But one of the other issues in 1948 was that the census information that were used to compose the quotas was badly out of date. From 1948, to figure out the shares of the population in each of the categories, they used census data from 1940. Now, the United States had changed a lot between 1940 and 1948. It had been in a war. There had been a massive migration of people from rural areas to urban areas to take up jobs in wartime production, and these people stayed in the cities. There was also improvements in education, a lot of other changes. So the shares that were calculated from the 1940 data were not necessarily relevant anymore in 1948. So, rural groups were heavily overrepresented in the quotas for 1948. So, Gallup and other pollsters got 1948 wrong. Truman won the election. And this somehow marked the end of quota sampling as a widely accepted approach to sampling for surveys that were intended to generalize to a population. It paved the way for the eventual rise of probability or random sampling, which we're going to talk about in the remaining modules, and which essentially dominated starting in the 1950s.