An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

93 ratings

Johns Hopkins University

93 ratings

Course 7 of 8 in the Specialization Genomic Data Science

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

One of the most common causes of false positives in genomics are batch effects orÂ other confounders.Â So I'm going to show you a little example of how you remove those in r.Â So I'm going to set up my plotting parameters like I'd like andÂ then I'm going to load the libraries that we need.Â In this case, it's the SVA package and the snip stats package in the mainÂ drivers of most of the analysis that we're going to be doing here.Â So I load those packages, and then I'm going to load this bladder data set.Â

In this case, I have a expression set, a bladder expression set.Â I'm going to extract the phenotype information andÂ the expression information from it.Â And so now I have an expression object that has 57 samples in it,Â and I have some information on the phenotypes for that.Â And so in this case I have information on the outcome and I also have informationÂ on a batch variable and then whether it's a cancer sample or not.Â

And so the first thing that you might want to do if you want to adjust forÂ you want to deal with the batch effect is if it's labeled like this, if you actuallyÂ have the batch variable you might be able to directly adjust for it.Â So the way that you could do that is you could build a model matrix forÂ the cancer variable.Â And then you could actually just add in another one, the batch variable,Â say where you want the data to come from, that's from this phenotype data.Â So now I've got this model matrix, which I can then fit with lm.fit.Â We learned in this last lecture and so nowÂ I have these coefficients that have been adjusted for batch effects, basically.Â I've adjusted for them directly with the lm.fit function.Â So I can now look at the coefficients forÂ the variable that I care about across all of the different samples.Â And so that works out pretty well the one thing that you have to be a little bitÂ careful of is if there's a tight correlation between the batch variable andÂ the outcome variable that you care about.Â So ideally you'll have balanced within each batch you'll haveÂ observations from multiple different groups.Â In this case it's a little bit tricky, because forÂ example say the fourth batch only has biopsy samples.Â So fortunately they're not perfectly confounded in the sense thatÂ in the fifth batch you actually had some biopsy and some cancer samples.Â And similarly you have some comparison between cancer andÂ normal within the second batch so you have some information to gain there.Â But if these are perfectly correlated, in other words if all the cancer samplesÂ are run in batch 1 and all the normal samples are run in batch 2.Â You can't use this,Â any of the adjustment methods I'm going to be talking about here.Â So you have to be very careful about that at the experimental design stage.Â

Another way that you can adjust for batch effects if you want to,Â if you actually know them if you have them in advance is to use Combat.Â Which is basically is the similar approach to direct linear adjustment butÂ it basically does some of the shrinkage we talked about.Â So Ttat can have some advantages when you have especially small sample sizes.Â And so, here I have to build a model matrix for the combat function.Â And so, the model matrix in this case,Â

So then what I can do is when I'm running combat,Â I can basically run combat to get a cleaned expression data set.Â So I do combat and then I tell it what data to look forÂ in this case it's the expression data and I tell it what the batch variable is andÂ then I tell it is there anything else that it should be adjusting for.Â And then, it basically fitsÂ a set of models and you can have it either plots or output, or not.Â And so, what will end up happening, though, is that it's going to fitÂ this model, and it's going basically regress out the badge effect.Â And so, then you can go through, and you can do lm.fit on the cleaned data.Â So you basically get the combat fit on this data andÂ then you can fit the lm.fit model with our cancer modelÂ matrix now with the data set that was cleaned with combat.Â

So now I have a combat fit coefficients which I can look at as well.Â So I can look at the coefficients from this.Â

So I can see here that they seem to be a little bit smaller than the coefficientsÂ that I got when I fit the direct linear adjustment, andÂ in fact, that tends to be the case.Â So here I'm going to make a plot It has a few parameters so I'm not going toÂ type the whole thing here but I'm going to plop the coefficients from the originalÂ linear model fit versus the combat model fit.Â So if I do that I can actually add the 45 degree line.Â Actually, here's what I'm going to do I'm going to make it a one by one plot andÂ then I'm going to make that plot and add the 45 degree line.Â And so you can see forÂ example that the combat values tend to be a little bit smaller, so the y axis isÂ a little bit smaller than the x axis here because we've done that shrinkage.Â So the other thing that can happen is imagine you either don't know that batchÂ variable, or imagine that batch variable doesn't encompass all the differentÂ potential technical artifacts, then you might want to infer the batch variables,Â and you can do that with the SVA package SVA function.Â So here I'm going to build that model matrix with the cancered outcome that IÂ care about.Â And then I'm going to build the model that I'm going to compare that to.Â So, in this case, I'm going to compare it to a model with no covariants in it soÂ it's the null model.Â

And then what I can do is actually create some covariants byÂ running the SVA function on this data set.Â So I pasted the expression data set, the model for including the variable I careÂ about, the model not including the variable I care about.Â And then I tell it how many surrogate variables, orÂ how many surrogate batch effects it should estimate.Â So then it's going to go through andÂ estimate those batch effects through an iterative procedure.Â So now what I can do is I can actually see that the SVA object that's returned hasÂ this SV component, which is basically new co-variants that have been created.Â That are the potential batch effects, and so for example, I can plot,Â or I can correlate those batch effects with the actual observedÂ batch variable, and I can see that our estimated surrogate variable is,Â this second one is super highly correlated with the batch file.Â The first one isn't necessarily as correlated with the batch variableÂ as the first one.Â So then I can make a plot of that second surrogate variableÂ versus the batch variable.Â And I can see that there's like a relationship between the inferredÂ surrogate variable, the inferred batch effect, and the observed batch effect.Â And so then I can actually plot theseÂ data points on top of that to see if that's true.Â So what we've done here with the SVA is not necessarily actually cleaningÂ the data set.Â We've just identified new covariates that we now need to include in our model fit.Â So we can combine them with our original model that had the cancer term in it.Â And so we end up with a model fit that includes bothÂ our estimated surrogate variables as well as our cancer status variables.Â And so then I can use lm.fit or any of the other methods we talked about inÂ the previous lecture to get the model fit after you adjust for those variables.Â So now what I can do is I can see what the model fits look likeÂ comparing SVA to combat for example.Â Here, I'm plotting SVA versus combat.Â

So on the x axis is going to be the SVA adjusted values,Â on the y axis is the combat adjusted values.Â So you can see that they're actually pretty highly correlated with each other.Â

And then you can do the same thing, forÂ the SVA versus the just the linear model adjusted without the combat adjustment.Â And you can see here, that these aren't necessarily shrunken and soÂ they are maybe a little bit more correlated without the shrinking.Â So that's basically how you can run SVA onÂ the data set to infer the batch effects if you don't know what they are.Â

Coursera provides universal access to the worldâ€™s best education, partnering with top universities and organizations to offer courses online.