A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

180 ratings

Johns Hopkins University

180 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Module 2A: Summarization and Measurement

Module 2A consists of two lecture sets that cover measurement and summarization of continuous data outcomes for both single samples, and the comparison of two or more samples. Please see the posted learning objectives for these two lecture sets for more detail.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Okay.

Now let's take a look at some exercises intended to review

some of the salient important points of lectures A through D.

So what I'm going to have you do here,

I'm going to show you some information and then give

you some questions, and I'll suggest you pause this

lecture and work the questions out on your own.

And then compare them.

You hit play again and I'll go through my answers or my take on them.

So, let me show you, let's go back to

this Philadelphia data that we looked at before, and what I want to look at

here is the distribution of the daily

particulate matter measured in micrograms per millimeter cubed.

And, this is the, for all days between 1974 and 1988.

this is distribution of the TSP data, this histogram here shows it.

And the mean of this sample of multiple days is, 67.3 micrograms

per meter cubed.

The standard deviation is 26.9 and the median is 63.

Here's another representation in the box plot format of the same data.

Now let's also look at the death counts per day over this period from 1974

to 1988. So this is a histogram of the distribution

of the number of deaths across these days, and the mean is 46.7 deaths.

The standard deviation is 8.4 deaths. And the median is 46

deaths. And here is a box plot representation

of these deaths data.

So some questions I'd like you to think about.

And then come back to me with your answers.

And see what, how they compared to what I've thought of.

How would you characterize the distributions of the daily TSP?

Total suspended particulate readings and the daily

death counts from Philadelphia for 1974 to 1988.

Based on the information at hand, can you give an estimate of the 25th and

75th percentiles for TSP?

Suppose you want to measure the association

between death and TSP using these data.

To start, perhaps you wish to create four

categories of TSP based on the original continuous variables.

You want to use these four categories,

to have similar numbers of observations.

You want these four categories to have similar numbers of observations.

Can you suggest a way to do this?

Why does the mean tend to be larger

than the medians for samples of right-skewed data?

And finally, suppose you were only allowed to use 300 randomly selected TSP

measurements from those total sample of values of the days from 1974 to 1988?

How would you expect the histogram of these 300 values to

compare to the original histogram

presented, which contains over 4,000 values?

Okay, let's, let's take a look at the questions

I posed and my suggested takes on the answers.

So, I ask you, how would you characterize the distributions of

the daily TSP values and the daily death counts for Philadelphia.

Form 1974 to 1988.

Well let's look at TSP, total suspended particulate levels.

Well first of all we can see from these data, just a numerical summary wise,

that the sample mean of 67.3 micrograms per meter cubed, is

larger than the sample median of 63 micrograms per meter cubed.

Furthermore if we look at these, the histogram presentation

or the box plot, it's pretty clear in my opinion.

That there is evidence of a right or positive

skew in both pictures, but especially in the box top.

You can see there's a fair amount, or what seems to

be a fair amount, of outlying values and

they're all larger than the rest of the data.

Which would be indignative of a positive right skew.

So I would say that all things considered, this daily

TSP distribution is pretty clearly

a right-skewed or positively skewed distribution.

How 'bout for deaths?

Well, this is a little more subtle, and we may have different opinions on this.

And there's not exactly one right answer If

you look at a comparison of the sample mean and the median.

The sample mean is larger than the median, but albeit slightly.

46.7 deaths is the mean versus 46 deaths which is the median.

The histogram, if we look at the histogram, there's probably

several different opinions across your classmates and myself, and you.

[LAUGH]

If you look at the histogram, it, some

people may say this is a relatively symmetric distribution.

If you look carefully it is very hard to

see especially because coloring and sizing there is a

bit of a right tail but it's not nearly

as evident visually as it was with the TSP data.

So the histogram is a little bit of a mixed message it

depends on how much you can see and what your interpretation is.

So somewhere

from skewed to symmetric will be brought again in the answers.

I think in this case, the box plot is a little

more informative in terms of

characterizing whether there's any skewness to

this where it, because we can see that, well, the middle 50%,

the 25th to 75th percentiles, relatively symmetric about the median.

And the non-volume values, largest

and smaller, are relatively symmetric around the sides of the boxes.

We do have some positive outliers, which makes it appear

to be more, a little more right skewed than symmetric.

But certainly there are a bevy of opinions on this, and it's

not as clear cut as it was, in my opinion, with the TSP.

Association.

So then I'd ask you to give an estimate

of the 25th and 75th percentiles of the TSP distribution.

The only way to do that from what I've given you in

these slides was to look at the box plots, and of course, your

estimates are going to be approximations based on the visual cue here,

and it's very hard to see, given the size and detail of this.

But we do know that on the box plot the box in the middle, the

lower-valued side if you will, corresponds to the 25th percentile.

And the upper side, or higher-valued

side of the box corresponds to the 75th percentile.

And so if you're eyeballing this, and I'm not very

good at doing that on this scale, it looks like

the 25th percentile is about 50 micrograms per meter cubed

and the 75th is on the order of maybe 80.

This is where, if we actually wanted

to answer to this question unequivocally, we could

go appeal, if you actually did pull this from the computer, the 25th percentile for

these data is slightly lower than what I had eyeballed.

It's 47.5 micrograms per meter cubed, 75th

percentiles at 82 micrograms per meter cubed.

Suppose you want to measure the association

between death and TSP using these data.

To start, perhaps you wish to create four categories of TSP Based

on the original continuous values.

You want these four categories to have similar numbers of observations.

Can you suggest a way to do this?

Well one possibility would be to take these continuous

measures and break them into categories based on their percentiles.

And if we wanted roughly equal numbers in the

four categories, we would break this into four equal sized

percentiles, so one way to do this would be to look at putting these into what are

called core tiles, categorizing them based on their relative

position to the 25th,, 50th, and 75th percentile in these data.

So, going back to what I talked about

before, we know that 25th percentile is 47.5.

The median, or 50th percentile is

63, and the the 75th percentile was 82. So, what we could do is categorize each of

the individual TSP measurements, as their membership in one of these four quartiles.

So, for example.

For days in which the value was less than or equal to 47.5

micrograms per meter cubed, we'd put them in category 1, the first quartile.

For days in which the values were greater than 47.5 micrograms

per meter cubed but less than 63, we put them in the

second core tile etc, etc and so roughly, 25% of the observations or

quarter would be in this first core tile, another 25% goes from the

25th to the 50th percentile, would be in the second core tile, etc.

So let me ask you this.

Let's go

back and think of the TSP data for example.

Why does the mean tend to be larger

than the medians for samples of right-skewed data?

Well let's think about this. Let's think about a situation.

Suppose we started with a distribution. Suppose we measured TSP incorrectly.

And we graphed the values we had for

these days, and it was, it was roughly symmetric.

I'm going to even make it bell-shaped

here, and it looked very symmetric around its sample mean, and then, suppose

somebody came in and said, well, John,

actually, these measurements over here are wrong.

You've actually gotten lesser values than you should've.

I'm going to put in the proper values.

And it stretched our tail out to here

such that now we had a right-skewed distribution.

This value with it, we had highlighted before corresponded to the

mean, it also corresponded to the median when this data was symmetric.

What happened to the relative position of the median

when we redrew this to include the right tail.

Well the middle value stays the middle value.

Right, so it's hanging out there.

However, the mean which I actually haven't shown visually is going to be effected by

this increase in right tail value.

So the mean is actually going to be pulled

up further to the right while the median remains unaffected.

So in other words, another way to say this without going

to my visual is that the mean tends to be larger than

the median for samples of right-skewed data because the mean is

more heavily influenced by the larger, positive values that occur in that.

Now, finally, suppose you have categorized the T-S-P values

into four quartiles, like I talked about, where category one

has the values less than or equal to the

20th percentile, et cetera, like we detailed two slides back.

You now want to compare the distributions of deaths.

Across these different TSP categories.

How could you do this visually, and how could you do this numerically?

That's what I want you to think about now.

Well, how could you do this visually? Well, you would certainly need a computer.

To do this. Especially with these much data.

This much data, but some possibilities. Stack histograms.

So what I have here is the histograms of the total daily deaths by

TSP Quartiles, so this histogram here on top, and I've

done this through the computer, this histogram on top is the

distribution of death counts for days in the lowest TSP quartile.

This second graph here is the distribution of death counts for days whose

TSP values were in the second quartile between the 25th and 50th percentile.

Center, and so you sort of, it's hard to tell what's going on in these pictures.

These histograms look relatively similar, but there's

a lot of data in each and the scaling is small.

So, another way to compare these head-on, that maybe gives

a little more insight is by looking at box plots...

Of these death values side-by-side across the 4STSP quartile value.

So here is a box plot showing the distribution of death

counts on the lowest TSP days, those in the first quartile.

Here's a box plot showing the distribution

on the second quartile, those between the 25th and 50th percentile

TSP, this is the box plot of the distributions of death.

The third quartile and the fourth.

So what do you see in these pictures? Well, of course pictures are subject to

interpretation, but at least to some degree, you probably noticed

the increased variability in the

upper or largest values of deaths as we increase across

the TSP quartiles. You can see a slight increase

visually perhaps in the medians as well, but that's more difficult to know.

Well, how could you then quantify this numerically?

Well, again, you'd certainly need a computer to do this,

but some possibilities include comparing the medians of death across

the TSP quartile groups, comparing other percentiles, like the 95th

percentile or the 15th percentile or something of that nature.

the one that is used so often in the literature, and we're going to spend.

More time on this course is actually comparing the means.

So for example,

I'm going to report the mean number of deaths in each of

those four samples of deaths distributions for each TSP core tile category.

So in the first category of TSP, the lowest TSP days,

those with values between the lowest value in the 25th percentile.

The average number of deaths was 46.2.

In the next quartile, the 25th to 50th percentile,

the average number of deaths on those days was 45.9.

In that third quartile, those with TSP values between the 50th and the 75th, the

average number of deaths was 46.6, and then finally on

the highest particulate matter days, those with values between this 75th percentile

and the largest value on the data set, the average death count was 48.

So what we might do to present

these comparisons and quantify the difference in values

is we might choose one of these four

groups as the reference, and then compute the

difference between the mean for the other groups

and the same reference, so that these differences

are comparable, and what we're driving here towards

is how we might present these in publication.

How many put uncertainty values in this, which we'll get to next, and

how we might present the comparison after adjusting for

other factors that differ between these days other than TSP.

So, it's just setting up, as we go along the course.

So, one way to do this might be to declare, well, let's declare this the

lowest TSP data being reference, and then we'll

compute mean differences in the number of deaths.

Between each of the other TSP day categories and this same reference.

So difference number one might be the difference between the

deaths, mean deaths on TSP category two, those days

between the 25th and 50th percentile of TSP minus

this deaths on, average deaths in the lowest category of TSP.

So that might be 45.9 minus 46.2, which is mean difference of negative

0.3 deaths so on average those days when the 25th

percent, 50th percentile of TSP has slightly lower

deaths on errors by about 0.3 deaths, we then did this

for the third category compared to the same reference.

For some reason I have trouble writing S on that second round here.

it would be 46.6 minus 46 point, that same 46.2, that reference.

0.4 deaths.

0.4. So it suggests that on average, those

days with TSP limits between the 50th and 75th percentile have 0.4

more deaths on average than days in the lowest TSP quartile.

And then finally if we did the TSP 4 The

highest TSP levels to the lowest, I won't write out the entire thing here.

But the difference is 1.8, those highest particulate matter

days had 1.8 deaths more on average than the lowest.

These differences are all filtered through the same reference group.

Of the lowest Tsp.

So these differences are comparable to each other as well.

So for example, using these differences only, I could get, for

example, the estimated difference in average number of deaths for the Tsp

4 days, the highest Tsp compared to the third quartile, by

taking that 1.8 Which is the difference between TSP4 and TSP1 and

subtracting that 0.4, which is the difference between the TSP3 and TSP1.

And what would fall out of that is

the difference between the 4th quartile and the 3rd.

So one last thing I wanted to ask you

and review was, and this is a question that comes

up often, and you may have already asked it or

heard somebody else ask it, and it's a reasonable question.

Both the formula for the sample mean and for the formula for

the sample standard deviation include the sample size n in the denominator.

Given that this is the case why do neither

the mean or standard deviation

systematically decrease with increasing sample.

So in other words we know that the formula for the mean is the

sum, sum of all the values in our sample divided by the sample size.

And you might say well John The denominator

here is increasing as our sample size gets larger.

So why is not this quotient going down in value?

Well it turns out, you have to remember something that's easy

to forget, is that as the sample size increases, we're also increasing

the number of things we add up in the

numerator, so that's increasing as well with increased sample size.

So this ratio isn't necessarily decreasing

systematically, because both parts are going up.

And the same logic applies to the estimated sample standard deviation.

I'll just write out the formula here. And I for you to better looking

typed versions in previous lectures. But if I do this, yes.

As our sample size increases, the denominator

of the square root ratio is increasing.

But again, we're also increasing the number

of differences we add in the numerator, so.

This ratio in terms of sample size, is being kept in somewhat of a

steady state, and it won't necessarily go

down just because the denominator is getting

larger, cause the numerator is also increasing as well.

Okay well hopefully you found these helpful and some things to think

about as we move on in our quest for a statistical domination.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.