Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

From the course by Johns Hopkins University

Mathematical Biostatistics Boot Camp 2

39 ratings

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

From the lesson

Discrete Data Settings

In this module, we'll discuss testing in discrete data settings. This includes the famous Fisher's exact test, as well as the many forms of tests for contingency table data. You'll learn the famous observed minus expected squared over the expected formula, that is broadly applicable.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

So consider this next example, where I, you know?

I took this modified it a little bit

from Agresti's wonderful Book on categorical data analysis.

so here we're looking at birth weight cross

[UNKNOWN]

classified by babies' birth weights cross classified by maternal age.

And so let's assume that this, the way in which it was sampled as they were, say,

400 people sampled, so that by design, na, neither of the two margins were fixed.

And we're kind of interested, then, in treating the

cell counts as if they were multinomial Four dimensional,

multinomial count with N equal to 400 total

total observations.

And then what we would like to know is if the variable birth weight is independent

of, of maternal age, versus the variable birth weight is not independent.

So let's think about the problem this way.

And see, we had logic our way to the expected cell counts.

Okay.

[SOUND]

Okay, so let's let's first note

that, under the not necessarily,

even just under the null hypothesis.

Just regardless our estimate of young maternal age is always going to be

100 over 400, and older maternal age will be 300 over 400.

The margins of the table where we disregarded birth weight.

Do the same thing for birth weight, disregarding maternal age would be 50

over 400 and then normal birth weight would be what, 350 over 400.

Okay?

And the cell probabilities would then give you the specific, you know, combined

probabilities, if we want to talk about

younger, maternal age and lower birth weight.

That would be this the our estimate of this regardless of the hypothesis.

or under the alternative hypothesis, would be 20 over 400.

But if we're under the null hypothesis, we're assuming

maternal birth weight in maternal age and, and birth

weight are independent, then we kind of logically construct

this probability as the multiple of these two marginal

probabilities, because we're probably estimating the margins a little better.

So that would be 0.25 times 0.125.

Here because we're, we're multiplying the marginal probability of young maternal age

times the marginal probability of low birth weight.

Okay?

So that only, we can only do that under the null hypothesis.

So let's,

let's work on our expected counts.

So our expected counts for the 1,1 cell of low birth weight and young maternal age is

this probability, then the number of counts we would

expect in there is times the 400 sample size.

And we get 12.5 as our expected count.

So then you can follow through for all of

the other three remaining cells in the same way.

[UNKNOWN]

get the expected counts.

And then compare them using the same

formula, observed minus expected squared over expected.

And we get our qchi squared statistic, which in this case is 6.86.

We then compare it to a qchi squared critical value which is around four.

Of course we, we talked about it being you know, the square of the Z statistic.

So if that makes sense, that it would be around four.

and or we could just calculate our Chi-squared P value, which would be the

probability of, of, of getting a test statistic as large as 6.86 or larger.

So I hope everyone can follow these calculations.

And the idea is that we're basically calculating

how distance are observed counts are from kind

of our best estimate of what we would

expect the counts to be under the hypotheses

under the nod potheses of independence, between the row and column variables.

I, I should also add before I complete this slide, that the answer we get from

this test of independence, is identical to the

answer we get from the test of proportions.

You get the same chi squared value, the same P value.

now the interpretation might be very different as we talked about before, if in

one case there was randomization as to which of the rows you like and forces.

In thins case if it was a multinomial sample,

the, the interpretation results dramatically

different and we don't really.

cover you know, a lot of things

like epidemiological style sampling designs in this

class, but you know, suffice it to say, the interpretations can be very different.

But nonetheless, the actual number, the P

value we see, is identical depending on either,

identical regardless of which of the sets of assumptions you see, you make.

So that's interesting, and what's even more interesting is that you

can formulate Poisson models that then again yield the same conclusions.

So here's another example of Agresti's Categorical Analysis book.

here's, which, which is a book I highly recommend.

And I, I, I think it's, it's, it's a real classic in the area.

but, you know, but I should disclose a conflict of interest

that Agresti is a great friend and close colleague of mine.

[COUGH]

any way, so in this example he is looking at, at

different collection of o, occupations and

looking, cross-classifying it by alcohol use.

And suppose the investigators in this trial, or in

this study, you know, went out and found 300 Clergy,

250 Educators, 300 Executives, and 350 Retailers, and then

you know, asked them a question about their alcohol use.

And we'd

like to know whether alcohol use differs by occupation.

So

interest then lies in whether or not in testing whether or not the proportion

of high alcohol use is the same in the, in, in the four occupations.

so if we label P1, you know, the clergy, the

proportion of high alcohol use among the clergy, and so on.

We want to test whether they are all equal, but we don't want to specify

what proportion it is, so let's say

P is the common proportion across all occupations.

And then the alternative would be the opposite

of that, that at least two are unequal.

so our estimate of P this common unknown proportion the

obvious estimate of it would be 233 over 1200 and the obvious then well then of

course the estimate of 1 minus P would be 967 over 12 1200.

So what would?

So our observed first count is 32 what would be our expected first count.

Well yeah, our com, our estimate, if you, if, under the no hypothesis.

Where occupation is irrelevant, we expect about, you know 233

over 1200% to be high alcohol users.

So we multiply that times the, row count, 300.

And we would get the observed cell count.

then the, the low alcohol usage, 268, well, you know?

They, they have to, you know.

These have to add up to the margins, by the way,

that the ex, the both the observed and the expected counts

have to add up to the margins. So you could, you know take 300 minus

the the 1, 1 cell count, to get it.

But otherwise, you could just say 300 times the probability of low alcohol

usage, which is then 1 minus 233 over 1200, or 967 over 1200.

And you would repeat that down and down for for each occupation.

Calculate our Chi-squared statistic, observed minus expected

squared over expected. The sum of all of them, you get 20 20.6.

And we need to compare it to a Chi-squared critical value.

But the degrees of freedom change.

And it turns out that the general rule

of the degrees of freedom for the Chi-squared.

In these settings is rows minus 1 times columns minus 1.

So in this case there's one, two, three, four rows and two columns.

So the degrees of freedom is three.

So then here's our P value, P chi squared,

20.59, 3 degrees of freedom, lower tail equals false.

It's about zero.

It's pretty clear that some of them are different.

[SOUND]

Okay, here's, here's

another example. And we're going to, we're going to do.

so this is from Rice's book Mathematical Statistics and Data Analysis.

Which is another, first of all, I have no affiliation and

have never met Rice, so it's easier for me to say this.

But I, I love this book, I

think it's wonderful, this Mathematical Statistics and

Data Analysis book.

So, if you're looking for a book recommendation, I like that one

[SOUND].

in addition, I really like Agresi's book, but I'm

willing to stipulate my conflict of interest in recommending it.

but I do really like it.

I read it all the time.

so anyway in this book does, he has this interesting example,

where a bunch of word, words taken from some novels that were

[UNKNOWN]

one of them was preparat, w, well two of them were known to be Jane Austen novels.

And one of them was, was in question as to

whether or not it was from written by that author.

Let's say they found it later. And

there's maybe other ways you would want to analyze this data for this reason,

but, you know, we want to use it as an example for the chi squared.

So don't think too hard about specifically how you would analyze this data.

Because I, I doubt this would be what you would arrive at immediately.

But it, it's not unsensible, by the way. It, it's reasonable.

so

so imagine, let's say, book three, I'm

just spit balling here, the magic book three

is this book where you don't know whether or not it's from the same author.

In this case it was Jane Austen.

and you want to test whether, you know, the word

distribution of these words, is equivalent across the three books.

And you sampled so many words from each of the books.

Okay?

So that's the setting and let's see if we can

figure out some, some expected cell counts to do a

[UNKNOWN].

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.