And so the idea here is that we want to identify, is this enrichment

statistically significant if it's more than we would expect to see by chance?

So one way that people do that is they again permute the sample labels.

We've permuted the responders and the non-responders.

And now we get the new set of labels.

And so, once we get the new set of labels, We can recalculate the statistics and

reorder them.

And so now that we see the genes that belong to the gene set are a little bit

more scattered throughout this profile and so

you see that the profile goes down and then up and then down and then up.

It wiggles a little bit more but it doesn't deviate from zero as far and so

there appears to be less of an enrichment of those values.

So you can recalculate for several permutations the value of this gene set

statistic, and then you can calculate again a P-value for each gene set category

as to whether the permuted values are more extreme than the observed value.

And so you can calculate a P-value for Each of the gene sets and

then again do a false discovery correction and identify gene sets that are associated

with those statistically significant results.

So what are the gene sets you can look at?

The Gene Ontology Consortium has a large ontology of gene sets that are based on

their function and based on their spatial location within the cell and so forth.

You can also look at molecular signatures that have been curated.

For example this set of molecular signatures that you can get from this

MSigDB database.

Or you can look at things like interactions between proteins and

then see is there an enrichment for a particular set of interactions among

the genes that you found to be differentially expressed.

Really its any previously defined set of genes that has some

function that you care about you can use for a gene set enrichment analysis.

So one thing to keep in mind is this can be very hard to interpret especially if

the categories are broad or vague.

So for example, if you get a category that comes out as transcriptional regulation,

that's a very broad category, there's lots of different subcategories of that.

And so if that's enriched, it's not clear how much added value it's giving you.

It's better if you can find specific, concrete categories that are enriched.

Here, if you're not very careful you can tell stories, so

again you have to correct for the multiple testing problem and

you have to be very aware of your own implicit biases.

This incurs a second multiple testing problem like I said compared to

just the multiple testing problem involved in identifying differentially

expressed genes.

Now you're multiply testing multiple sets and so you have to account for

that as well.

This idea can actually be simplified.

The statistic I showed you here, this gene set enrichment

statistic can be simplified into basically a very simple T statistic

comparing the genes that are in the set to the genes that are out of the set and so

you can read about that here in this paper.