An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

En provenance du cours de Johns Hopkins University

Statistics for Genomic Data Science

124 notes

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

À partir de la leçon

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Now that we've gotten this far in the course,

it's about time that we actually start calculating some statistics.

So these statistics can be used for a variety of different reasons, but

they are often used for performing inference.

And so for example, suppose that we want to know,

is the average height of a child equal to 70 inches?

So if we wanted to check that out, what we could do is we could take the observations

that we've collected for a set of children.

We could take their average, and then we could just take the difference from that

average that we've got in our sample, to the value that we care about, 70.

And so this, in some ways, quantifies how close we are.

But the problem is, it quantifies it on a scale that's not standardized.

So in this case, the distance X bar- 70 might be one thing for

inches, but for centimeters, or something else, it might be different.

So we want to put things on a common scale.

And the reason why we want to put things on a common scale,

is that way we can interpret them well.

Just like with temperature, you want it to be on a scale that's interpretable to you.

You want to similarly put statistics on an interpretable scale.

So one way that you can do that,

is you can measure things in units of variability.

And so one way that you could do that, is again, we take this difference, X bar- 70.

So that's the distance from the value that we think it might be.

So this might be, for example, the null hypothesis value.

In that case, we can divide this difference by substandardization unit.

And so in this case, we're going to divide by the square root of the sample

variance / the square root of n.

And that tells you something about how many sort of variance units,

how many standard deviation units you see the difference to be.

And so once you standardize,

you can make comparisons to sort of standard distributions.

And you can make it easier to actually calculate probabilities of observing say,

a statistic that extreme,

if the data really are drawn from a distribution that has a mean of 70.

When you have multiple samples, you can have a similar relationship.

So suppose that you have one sample that has X values,

and one sample that has Y values.

You can take the average of each of the two samples.

And you can get an estimate of the variability in each of the two samples.

You can then calculate what's probably the most commonly used statistic,

which is the T-statistic.

But again you want to standardize,

because recall this example where you have a gene that you're looking for

differences between that gene, between two groups that are denoted by colors.

So here, you see that there might be a difference between the two groups, but

the variability makes it difficult to decide.

Whereas here,

the two groups are definitely different from each other for Gene 2.

And for Gene 3, they might be different from each other, but

the difference might be small.

So again, what you might do is calculate a statistic that's the average in group one-

the average in group two.

And then you standardize by something like their average variability.

So by doing this, you're again trying to put things on the standardized scale where

you can compare things.

And so this is the T-statistic that's widely used across a large

number of statistical analyses.

But something that happens a lot when you're doing a regression model

is that you don't necessarily calculate the T-statistic, and

that way what you do is you fit a regression model.

And you get an estimate for this b value.

So you get b hat, which is the estimate for that value from our regression model,

and then you divide by an estimate of the variability of that.

So again, we're doing the same sort of thing that we did before.

We're calculating an estimate of the variability, and

dividing the estimate of the distance.

So here, what is the distance?

Well, if you have b hat, and you're subtracting something,

since they're not subtracting anything, you're subtracting 0.

So this statistic is quantifying the difference from 0 of this b hat value,

in the units of variability that are appropriate for that scale.

But something that often happens in genomic experiments is you get things like

this, where you get differences that might be real, but

there's a super tiny variability.

So when you get that, you sometimes get the statistic to blow up.

So if you just divide it by this estimate of variability, that estimate can be very,

very tiny, if you have results that are very not variable.

That often happens when you're near the limit of detection, for example.

And so one way that people have tried to deal with that is by using

various different kinds of empirical based procedures.

The simplest one of those is just to add a small positive

constant to the variability of every statistic in the denominator.

There are different ways, and clever ways of calculating that statistic of

the constant to add to the statistic there.

But what you can see that does is, it prevents even if you have a very small

estimate of the standard error of the coefficient, you get still a sub-value,

a positive value here.

And so that this doesn't allow the T-statistic to get too large, and for

you to see really big statistics for

things that actually have relatively small effects between the two groups.

You can learn a lot more about T-statistics and linear model statistics,

sort of wall statistics for linear regression,

from this paper on linear regression models for microarray data.

The moderated statistics and the empirical based estimates of

how you actually make that denominator bigger are also covered in that paper.

You can also learn a lot more for this class about statistics and R for

the Life Sciences.

Coursera propose un accès universel à la meilleure formation au monde,
en partenariat avec des universités et des organisations du plus haut niveau, pour proposer des cours en ligne.