An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

En provenance du cours de Johns Hopkins University

Statistics for Genomic Data Science

166 notes

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

À partir de la leçon

Module 4

In this week we will cover a lot of the general pipelines people use to analyze specific data types like RNA-seq, GWAS, ChIP-Seq, and DNA Methylation studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

As we've seen in many of the analyses we've talked about throughout this class,

there are a large number of steps that are involved in

doing a statistical genomics project from pre-processing and normalization,

to statistical modeling, to post hoc analyses of the results that you get.

So I wanted to talk a little bit about Researcher degrees of freedom.

This is an idea that was originally proposed in psychology, and there was this

paper that said, basically, undisclosed flexibility in data collection and

analysis allows for presenting anything as statistically significant.

And, so, what are they talking about here?

They're talking about how there's a large number of steps in the sort of

data analytic pipeline.

They go from experimental design, all the way from the raw data to

the summary statistics, and then finally there's a p-value at the end.

Now usually when people are talking about statistical significance,

they talk about p-values or multiple testing corrected p-values, and

often a lot depends on that p-value being

sort of small enough that a journal will publish the paper, or something like that.

And so that dependence is going down a little bit over time,

but originally there's been a lot of sort of focus on that.

But there's been a lot of sort of steps underneath that

process before you get to a p--value that could change what the p-value is.

So, for example, if you throw out a particular outlier, or if you normalize

the data a little bit differently, you might get different results.

And so, there's lots of different ways you can analyze data.

And the danger here is that, when they were talking about it in this paper,

they were sort of talking about a nefarious case where you

keep doing everything you can until you get a p-value that's significant, but

you could imagine doing this just sort of by accident.

You make a large number of choices when doing a genomic data analysis,

and once you've made those choices, you get some result.

And maybe you don't like that result so you redo the analysis.

So one thing that you have to be very careful about when doing

genomic analysis is redoing the analysis too many times.

It makes sense when there's new updated software or there's sort of new biological

or scientific knowledge that's been brought to bear to redo the analysis.

But if you keep redoing it over and over again you sort of fall into this trip.

And so, you can imagine how that would happen with different teams.

So, this comes from sort of a recent analysis.

This is an analysis in genomics, but

it kind of illustrates the point that 29 different research teams were asked to

see if referees were more likely to give red cards to dark-skinned players.

And so each team analyzed the data a little bit differently.

And here you can see the dots represent the different effect sizes that they

estimated for the different studies, and so you can see that they're all different.

And then the sort of confidence intervals,

or the sort of confidence uncertainty intervals, for

each of these different estimates are also different from each other.

And so, while they're comfortingly sort of similar for

many of the estimates here in the middle, you can get quite big variability

just by changing the way that you analyze the data.

And so, you have to be careful to make sure that you don't do this over and

over and over again until you find just the one case where you get a large

estimate of the effect, even if it's probably

not necessarily due to anything other than the way that you analyze the data.

And so, the difficult thing about thinking about that is if

you do a different analysis, particularly if you adjust for

different covariants, you actually are answering different questions.

So the a question is going to be conditional on what your sort of model is.

Ans so if you have whole bunch of extra covariants in the model,

then you're asking, is there a difference in gene expression after I account for

all of these other variables?

That's a very different question than, is there just a gene expression difference

overall, which might mean something totally different.

And so you have to be a little bit careful about this idea researcher degrees of

freedom as related to knowing what question it is that you're answering.

And so this whole idea was sort of summarizing in this paper by Andrew Gelman

and Eric Loken when they talk about The garden of forking paths.

What they mean by that is basically that you start off doing an analysis where you

just haven't seen the data, and maybe you have an analysis plan in mind.

Then once you collect the data you realize,

oh that there's a problem of a particular type.

This happens all the time in genomic data.

And then you start making decisions based on the data that you've observed, and

once you start doing that you start playing into this researcher degrees of

freedom idea.

You're basically changing the way that you're analyzing the data based on

the data, and you can end up with a little bit of trouble.

So the key is to be thinking ahead right from the beginning, how am I going to

analyze these data, what decisions am I going to make before looking at the data,

so that you're not sort of driven by those,

and sort of end up chasing a false positive.

So the key take home message here,

have a specific hypothesis that you're looking for.

So with genomic data there's this sort of tendency to just sort of do discovery for

the sake of doing discovery without a specific hypothesis.

And that can often lead towards this sort of garden of forking paths or

these researchers degrees of freedom.

Another thing that you can do is pre-specify your analysis plan,

that even if it's just internally to you, say like

this is the way we're going to analyze the data and we're going to stick to it.

And then even if you end up adapting it later,

it's good to just analyze the data once exactly how you planned on analyzing it,

even if it has problems, just so you know what would have happened, and

see if there's big differences and why those differences might be.

Another thing that you can do if you have enough data,

although it's often not the case in genomics, is use training and test sets,

so the idea that you can split your data up into a first analysis data set and

then you can validate the results that you get in the remaining data.

And then analyze your data once.

So a very common temptation with genomics is to increasingly add complicated models

until you find more and more things, and that often leads to false positives.

The other thing that you could do is if you're going to do any analyses,

if you report all of those analyses, it will give people the opportunity to sort

of understand if maybe there's potential for

data dredging or researcher degrees of freedom in your analysis.

So this is sort of a cautionary note that genomic data is complicated, and if you

add complicated analysis on top, you can often run into extra false positives.

Coursera propose un accès universel à la meilleure formation au monde,
en partenariat avec des universités et des organisations du plus haut niveau, pour proposer des cours en ligne.