A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

135 ratings

Johns Hopkins University

135 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Module 3A: Sampling Variability and Confidence Intervals

Understanding sampling variability is the key to defining the uncertainty in any given sample/samples based estimate from a single study. In this module, sampling variability is explicitly defined and explored through simulations. The resulting patterns from these simulations will give rise to a mathematical results that is the underpinning of all statistical interval estimation and inference: the central limit theorem. This result will used to create 95% confidence intervals for population means, proportions and rates from the results of a single random sample.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So thus far in the course, we have estimated quantities and taken them as being our best estimate for some unknown truth. For example, we used the sample mean to estimate the mean of the larger population from which the sample was taken. But we don't know the true population mean. So I've consistently alluded to the idea that we would have to grapple with the uncertainty that comes from using such sample based estimates as our best guess for some larger population quantity. And today's the day we're going to get started with formalizing this.

In this course, we are espousing what is called a frequentist philosophy of statistics, which boils down to the idea that any sample we get randomly from a population is one of many possible random samples we could have gotten just by chance because the process we're using involves chance, the random sampling.

And, getting a handle on the potential variability and characteristics of a sample, across different samples from the same population. For example, how does the sample mean based on the sample size 50 from some population vary across different samples of size 50? This will help us get and understand the uncertainty in the sample based estimates. Like sample means, proportions, and incidence rates. So to do this we're going to need to establish and look at characterizing what is called the sampling distribution for our statistic or statistics of interest. So in this first section we worked to define the notion of a sampling distribution.

Okay, in these next set of lecture sets we're going to talk about a very important idea in statistics. Something that will help us take the estimates and associations we've developed at the sample level, and relate them to the unknown truth we're trying to estimate. And we're going to be talking about this thing called the sampling distribution.

So to set it up, we're going to use this lecture section simply to define the sampling distribution of a sample statistic.

So we want to contend now that we've laid out ways to summarize information in single samples of continuous data.

Binary data and timed event data and also how to compare samples of such data types by looking at differences in means, risk differences, relative risks and incident rate ratios.

We have discussed how to do this, and we have discussed how these sample estimates are not necessarily the truth that we want to get at, the population truth, but it's the best we can do based on. The imperfect sample we have from our populations or, of interest.

So ultimately it is important to recognize the potentials in, uncertainty in a sample based estimate of our quantity as it relates to the unknown truth it is estimating. So understanding sample based estimates and how they vary across random samples and same size from the same population. Will give a framework for taking the estimate we have and coupling it with some measure on, of uncertainty to ultimately make a statement about the unknown truth.

This set of lectures, starting with this, where we define the sampling distribution. And through lecture sections B through D, we'll also characterize and estimate the theoretical sampling distribution of a sample statistic. For example, a sample mean proportion or incidence rate. And ultimately, what this will allow us to do is create an interval describing a plausible range of values for the unknown truth that we can only estimate using the results from a single random sample or a random samples if we're making a comparison. And this type of interval that we'll get into detail about shortly is called a confidence interval.

So I've been talking about this idea of uncertainly in sample based estimates, and alluding to it. But what do we really mean when we talk abut uncertainty in sample based estimates, also commonly be referred as sampling variability, and we'll use that term throughout the course and you'll hear it in other settings. Let's just take an example, suppose I was studying something about the one year olds in Nepal, and one of the things I wanted to characterize about this population was the distribution of the heights of one year old children in Nepal.

But certainly because of budget, time, and logistical limitations, there's no way I could actually measure all one year olds at any given time in Nepal to collect all these data. So what I'm going to have to do is take a sample. And suppose I can, I'm doing a small study, and I can only afford to recruit ten children and measure their heights. And so I take a sample of ten children, and the mean height I get for these ten children is 68 centimeters. Suppose somewhere else, in the same region, a colleague of mine unbeknownst to me has the same idea. And he or she also has limited budget and limited time so takes the se, random sample of size ten from this population and ends up with ten different children than I did just by chance.

Meanwhile, in another part of the region, there's another researcher that neither of the first two of us know. And he's doing a study to estimate the distribution of heights for Nepali children, one year old. And he gets a sample of ten children, and when he takes their mean heights, it's 66 centimeters. And so on. Suppose this is happening all over Nepal. Well, what we're seeing here, and this isn't how real research is done, usually there'd be one researcher taking one sample. But this is just to illustrate the principle of sampling variability. You can see, that these mean estimates do not, are not identical across the samples nor will we expect them to be because we're taking random samples of small size from, from much larger population. We don't expect necessarily to get the same children in each of the samples.

So this sort of illustrates the principle of sampling variability in an estimate, based on a sample data.

Suppose we had done this again, or another group of researchers did the same thing, but they took larger samples. Well, we would still expect to see variability in their estimated means

across the samples. But if we compared the variability in the means based on sample of 50 children, the variability of means based on samples of ten children, the variability in the means based on 50 would tend to be lesser. And we'll demonstrate that in another section.

Here's another example. Let's talk about the Baltimore mayoral election. Suppose there's two candidates, I'll just refer to them as A and B. And we're interested in candidate A and his or her chance of winning the election. We're working for her campaign. Suppose we only have limited resources in the beginning of the campaign because donations haven't come in. An so we take a toll, a poll of ten persons, who are registered voters, and ask them, do they plan to vote for candidate A. An the proportion we get, who say yes, is, 60%, six out of ten. That sounds good, that sounds good for Kennedy [UNKNOWN]. And I could go to him or her and say look, you're polling favorably, 60% of Balitmore voters say they'd vote for you based on a random sample. But when he or she found out that sample is based on ten people, they wouldn't be particularly excited. Why not? Well, maybe another pollster from the newspaper does a study on ten randomly selected Baltimore residents. And their results show that four of the ten in the sample

Now, you can expect these proportions, which are based on ten people at a time, to have a fair amount of variation just because they're not very stable estimates. One person changes their vote and the proportion goes up or down by 10%. Each voter has large influence over the estimated sample proportion. Contrast that if we then got some donations rolled in, we're able to go out and poll 500 people.

Well, there may be variation in the estimates based on different samples of size 500 but we'll show that systematically they are less variable than the samples based on, or the results based on samples of size ten. But theoretically, you know, maybe we view this, and we estimate that 46% of the people will vote for candidate A, which isn't so good. But maybe somebody from the newspaper does a sample, and estimates that 49%, and so on and so forth. There's still going to be variation in these estimates, just because we don't have the same 500 voters in each sample. And this variation, now we're not measuring variation in individual responses to the question, but in the summaries on individual responses across different samples of the same size.

So how can we formalize this? How can we actually formalize this definition of sampling variability? Well this quantity that I promised we would define, the sampling distribution of a sample statistic provides the answer.

The sampling distribution of a sample statistic is a theoretical distribution, one that we'll never actually observe or be able to create by brute force, that describes all possible values of a sample statistic from random samples of the same size, taken from the same population.

So let's talk about the theoretical sampling distribution of sample mean heights of random samples of, say, 50 Nepali children who are 12 months old. In reality, any researcher studying this population. We wanted to study 50 children, we would take one random sample sized 50, but the theoretical sampling distribution would occur if somebody, or what it represents is the process of taking all possible random samples of size 50 from this large population of 12 month old Nepalese children. Taking all possible samples of size 50. So maybe sample one has 50 children. And the mean height in this group is 69 centimeters. Sample two has 50 children. And the mean height in this group of 50 is 70.4 centimeters. In sample three, that's 50 children and the mean height in this group

is 67.6 centimeters, so on and so forth. And if we were to exhaust all possible random samples, of which there might be close to infinite, and compute the sample means for all possible unique random samples of size 50. And then plot those in a histogram,

those mean values. So this histogram would show the distribution, not of individual heights of children in any one of the samples, but it would show the distribution of the summary measures, the sample means, across the different samples of size 50. And that would be our theoretical sampling distribution.

And we'll see by computer simulation in the next set of lectures, what the resulting distributions tend to look like. Obviously, we would never do this process in real life. This is where we'd end up if we did.

Similarly, we can think about generating theoretical sampling distribution of a sample proportion of people voting for candidate A from random samples of 100 city, Baltimore city residents. And that would be if we enumerated all possible random samples with 100 people, and surveyed 100 people at a time in each of our samples as to whether they'd vote for candidate A or not. And, we get differing estimates depending on the sample we are looking at.

And, if we did this for all possible random sub-samples, of 100 people from the population of Baltimore City residents. And, then generated a histogram, generate a histogram of those sample proportions across the, probably, hundreds of thousands of random samples that we could do that were unique. That distribution would be the theoretical sampling distribution of the sample proportion of persons who would vote for candidate A, based on a sample size 100. And

So again, we're just getting started here for the sampling distribution that I've been speaking of is the theoretical entity. It can't be observed directly or exactly specified. And in real life research, we're only ever going to take one sample from each population under study. We would never take thousands and thousands of samples of 100 people at a time to understand the variability in our sample statistic.

So, in lecture sections B through E, these will serve to further demonstrate and define sampling distributions, first by detailing the results of some computer simulations.

To do a better job of drawing the pictures I tried to draw in the previous slide. And with these simulations, which are just to illustrate a point, we'll empirically show some consistent properties of these sampling distributions, regardless of the sample statistic whose behavior we're looking at, mean, proportion, incidence rate. And then we'll unveil a mathematical property called the central limit theorem that will allow us to generalize the results we've seen in some specific examples.

In other words, this will allow us to say, in advance, without any data, what the sampling behavior of the statistic we're using estimate some group would look like across all possible random samples. And we can take that knowledge and couple it with the results from any single random sample from our population to estimate the characteristics of this distribution. And this will allows us to specify the sampling distribution from a single sample of data and then ultimately use that coupled with our estimated quantity to make an interval statement about the truth we're trying to study.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.