So next we're going to be looking at how we can perform a differential expression and splicing analysis using Cuffdiff between our two conditions. Test and control You might recall that we have three replicates for each. Let's start again by just looking at what is the basic command line usage for Cuffdiff. And actually before even doing so. Many times it is very useful to first reconcile the organization, the gene names and gene organization across the multiple samples. This is because in some of the samples, there might not be enough risk to be able to put together to assemble a transcript end to end. So Cuff Merge, which is also part of the Tuxedo Suite. Is designed to achieve particularly this, to reconcile the gene structure across the samples. And to combine it, and compare it with the reference annotation and do so seriously. So what we're going to be doing in the first stage is to run cuffmerge. So let's see how we can use cuffmerge just like we've done before. cuffmerge.log. Okay, so it's cuffmerge options, and then a file that contains a list of the transcript annotations for each of the samples. And there are a number of options so dash out just like before directory where we would like to write the March assembly. Then if you want to give it reference annotation, you might remember annotation from our top half column, min isoform fraction, the number of threads and so on. So we're going to remember a few of these options here. Active directory we're going to give it a reference annotation. And we're going to give it a number of threads. So similarly, we're going to write. First let's make a directory cuffmerge. And we would like to put the output there. And then we're going to write similarly a file that's called com. And I'm going to make it a combined cuf.dif.cuf merge, okay. So the first thing we're going to say is cufmerge, we're going to give it a reference file, I'm going to do this by prepping, so let's find out from here. There's the annotation. Let's denote this annotation. So cuffmerge options, but then I give it the annotation. Let's say that we want to use eight threads. We want the output to go into the directory. Let's call it WORKDIR and then cuffmerge, okay? This where we'd like the output to go. And now we have to create a little text file that would list The files, the transcript files created by Cufflinks for these examples. So, there's going to be GTFs.txt, and let's put that in the directory WORKDIR/Cuffmerge. So we're going to keep it there, keep them all encapsulated. And for that purpose. Let's create the file. You might remember, in the directory Cuffmerge/GTFs.txt. So the first file was in a Cufflinks, lets start with test, test one. Transcripts, go gtf. So this is one file per line Cufflinks were listing, we're listing Cufflinks transfers for samples, test samples one two and three, followed by the contour samples one two and three. And essentially what Cufflinks does it calls because Cufflinks to assemble this together with reference annotation and create four transcript models. So we've created the file. Okay, let me show it again. Now we're going to go back, sorry, to com. Cuffdiff. And we have everything we need. So, the results are going to be reported in the directory cuff merge. And I'm not going to, just like before, you can simply call a stage com, perhaps will no hop, com.cuffdiff, and save the standard error to cuffdiff The log and so on, but I'm not going to run it and I'm going to show you what the result is. This in fact the command cuff merge creates a single file, merge GTF which is the result of merging the transcripts and reconciling them across these examples. The results, Cuff merge/merge.gtf and a look at this file is going to show me it's a gtf file, basically every logos is assigned a new gene ID and now we have the same multipliers across these examples and the transcript receive new IDs as well. But you also have, for every line and for every transcript the corresponding gene name and the old name. For instance in this case in the annotation, but it could also have been one of those that we labeled from among the Cufflinks transcripts. Cross code them and so on, by comparison to the reference on patient, and so on. So it's a GTF file that contains the reconciled genes and transcripts across the six cell samples. Now, once we have that we're going to run Cuffdiff, and Cuffdiff will run in two stages. In the first stage it will assign reads. To this set of transcripts for each of the samples to quantify them. Or pay expression levels for transports and genes in each of the five, six samples. And then the second stages will perform the financial expression in differentials plus analyses. So let's continue writing our comment five. So once we create the fimers we're going to use it in cuffdiff. So let's create a directory called cuffdiff as well. And that's where we're going to see the output. And we're going to mark that as Okay. Yeah, we should have put here, data one. Which is our directory. Coursera/L4, okay, and let's read it there. For the directory and now cuffdiff. And now we don't need one subdirectory per sample because they are all going into one directory. Cuffdiff2. We want the output to be sent to WORKDIR/Cuffdiff. We want to use say ten we want to use, actually, cuffdiff, let's see what we need. And that's a very long list again. So the basic format is cuffdiff for list of options, transcript of GTF so these are reference transcript the merge file that we used and then the [INAUDIBLE] files. So this would be WORKDIR/CuffMerge/merge.gtf and enough for that line. And now we have a list, we have two parameters that specify the [INAUDIBLE] corresponding to the replicates in the two foundations so first lets start with the test. We know that those are in the top hot directory. Let's make that clear, here. Top hat directory, is top hat. So we're going to write that here,THDIR/Test1/accepted_hits.ban [SOUND] And we'll do so for the other two replicates. [SOUND] Wrong paste [TYPING] So, we're going to copy these. [TYPING] Replica three, replica two. And then for the control set. So, let's take a look at the command as a whole. We're calling CuffDiff, the output reporter will be saved in the directory CuffDiff from the working directory. It will use ten threads it will use as the reference file the merged file that was produced by Cuff merge, and then we have two sets of files of replicates, the first corresponding to the test condition and the second corresponding to the control condition. And again, we're going to run that with nohup sh com.cuffdiff. And save the standard error to cuffdiff.log. So that in case something happens we would know what they are and so on. And just like before, because this is this takes a while. Sometimes a long while. We're going to just cheat again and look at go to results and see what the output looks like. Cuffdiff, so there's a set of files here. So there are the diff files, that show the differential expression, okay? The most important of them, and the most frequently used is gene_exp.diff. So you'll see that there are two types of files here. So they are the ones that refer to the expression. So, for instance, if a gene's expression is different, significantly different, that the secret between the two conditions. Or with an isoform expression is significantly different between the two conditions. The same thing with the groups of transcripts. And then we also have CDS.dif, and promoters.dif and splicing.dif, which refers to the splicing. So whether the the split, the percentage of the various transcripts within that group, for instance, for splicing within a gene group is the same or not or is rather significantly different between the conditions. But as I said the most important one, the most frequently used is the expression of an analysis and let's take a look at gene expression dot diff with the understanding that the other, dot diff files have a singular organization. So we have the locals, which was the one assigned by cuff merge. Gene ID, which is the same. The name of the gene and we start from the reference annotation. The locus chromosome and location within the genome. Sample one and sample two these are some values one through two unless specified. And then we have a column that tells us something about the status, so we have one line for every locus for every gene. And then we have here, the status. It tells us whether the test quickly performed or not. So notice, it means that it could not be performed. There are other parameters such as okay. This probably going to see, so the test could be performed there, there are enough reads. High data means that there are too many reads for the test to be precise. Low data means that there were not enough reads for the test to be precise. And then there's also a label that's called fill, that will be the first format for mathematical exception in the calculation. And then we have, following this column, we have the value one, so the average, let's pick one here. If became for the gene in the first data set in the test, then the average in the control data set, then the log two fold change. So that's log two of this, of the second versus the first. So control over test. Followed by another column that shows the test value, and that it's interpretation in terms of P value and Q value. And finally, the last one tells us whether the difference is a physical significance or not. So, in a very simple way to obtain the list of genes that are differentially expressed, signficantly differentially expressed between the two conditions, we just have to do grep yes. So we're grepping by yes on the last column in gene_exp.diff. And we can look at them. [SOUND] And we can even count them. In this case we have 169 genes that were identified as signficantly differentially expressed. And we can perform the same analysis on the Isoform level and so on. There's one difference when looking at splicing pattern differences. In that case we're looking at the distribution of the values as percentages of the total distribution for a gene. So in one case, for instance, we might have three different transcripts and the relative FPK and expression values might be 50%, 30%, 20% in one case and 80%, 10%, 10% in the control data set. And that's why this pricing dot diff file will show us. So I'm not going to go through that, I would just like to select one gene for further analysis and for illustration, and I'm going to check here, I'm going to look at chromosome 9 because it has a small number of genes and I can pick one. So let's keep this in mind. Let's look at the gene 215 on chromosome 9. Okay. And we have in the test we have 6.8776 as the average of. In the controls we only have 0.18, so a significant fold change and this is a statistically significant and we're going to illustrate that next. And we're going to visual that in IGV. So this is in a nutshell how we can use Cuff merge and Cuffdiff to perform a differential expression analysis and how we can interpret the basic results. Next we're going to be looking at how we can visualize and manually annotate or understand, curate the results of these analyses. Using the IGV. The integrate.