Let's talk about TopHat first. TopHat is a read mapper for RNA-seq. Let’s see how it works.
First, reads are mapped to the genome by Bowtie.
Then reads are piled up and assembled to generate consensus.
A consensus can be a real exon, or consist of several exons, or something else.
We know that canonical junction sites have sequence symbols such as GT and AG.
Next, TopHat will search through the consensus to find and get all possible junction sites and their combinations.
For unmapped reads, TopHat will build their index and map them to these assembled junction sites.
In this way, junction reads are mapped to the reference genome accurately.
Let’s talk about some important command-line options of TopHat. First is the insert length option.
In RNA-Seq, mRNAs are split into small fragments, filtered by length, and sequenced. If the expected fragment length is 300bp,
and it will have 75bp sequenced from each of both ends, then the insert size should be set to 150bp.
The option below is for the standard deviation of the insert length. If the length distribution is concentrated on a certain value,
you can set a small value, otherwise you should set it larger.
The -G option is to provide an annotation file. If your genome has been well annotated, it’s better to provide this file.
TopHat will map reads to the transcriptome first, and then map unmapped reads to the genome. Finally it will merge the two results.
We know that most transcriptomes are much smaller than their corresponding genomes, and junction reads can be mapped to transcriptome
directly. In this way, mapping can be more efficient and accurate.
Let’s talk about library type.
Standard Illumina platform is strand-inspecific. In other words, we can’t tell which one of paired reads and the transcript
are in the same direction, and which one reverse-complements the transcript.
For strand-specific data, there are two cases. In the fr-firststrand method,
the second read is in the same direction as the transcript and the first read reverse-complements the transcript.
The fr-secondstrand is just the opposite.
So you should be careful about whether your data is strand-specific or not, and how strand-specific it is.
This is a demonstration of TopHat. The dataset used in this video is GSE32038. You can download it from the GEO website.
This is a simulated paired-end RNA-Seq dataset. It has two conditions, and both conditions have 3 replicates.
Here C represents the condition, and R represents replicates. I will use C1 R1 for demonstration.
We need some preparations before running TopHat.
First, we need a reference sequence, or the genome sequence, in fasta format.
This is the RNA-Seq data for Drosophila melanogaster.We have downloaded its genome in advance, which is this file.Let’s take a look.
Since TopHat uses Bowtie2 to map reads, we need to build index files for Bowtie2.
We can build it by this command.
This process takes some time, so we skip the running and check the result.
These files ending with “bt2” are the index files for Bowtie2.
We also need the paired-end reads files in fastq format.The two fastq files are ended with “_1” and “_2”, respectively.
This is what the files looks like.
Every 4 consecutive lines represent a read.
The first line is its name.The second line is its sequence.The third line is a plus symbol.
The fourth line is a series of symbols describing the quality for each base above.
In simulated data they are all “I”.
In practice, we need to do for the raw data some very important preprocessing steps, including quality evaluation and filtering.
Then we need to prepare this annotation file, though it is not a must.
This annotation in GTF or GFF3 format, if annotated well enough, can be downloaded from UCSC.
Let’s take a closer look.
Let’s check the first line.For example, this is an exon, and it is on this chromosome.
It starts from here and ends at here.It resides on the plus strand and belongs to this gene and this transcript, and it is also the first exon.
Such annotations can provide for TopHat information of splicing sites.
We can run TopHat once these are ready.
Here -p, -G, and -o denote the number of threads, the annotation file, and the output directory, respectively.
Following the options is the indices of reference sequence, and finally there are the two reads fastq files.
Now let’s type this command to run TopHat.
Because it really takes some time to finish, we will just skip the running and check the result we have produced before.
Let’s take a look at the C1_R1 output directory of TopHat.
First let’s check the logs.
The tophat.log file describes how TopHat ran in the whole.
Now let’s check the align_summary file which describes some statistics of reads mapping.
We can see that all reads have been mapped back to the genome, 1.6% of which are mapped to multiple loci.
Since the data is made from simulation, the proportion of mapped reads is 100%.
This can hardly be true in practice, and it would be very nice to have such proportion reaching 90%.Of course, 60%~70% is also acceptable.
Now let’s check the bam file.
This file describes how reads are mapped back to the genome.
We need to use samtools to view this binary file.
Let’s take a closer look.
Each line of this bam file describes a mapping of a read to the reference genome.
For example, let’s check this line.
This number describes various aspects of the mapping.For example, whether this read is mapped or not,
to which strand this read is mapped, whether its paired read is also mapped, etc.
This is the name of the reference sequence to which the read is mapped.
This is the position of the leftmost base of the read on the reference sequence.
This is the quality of mapping.
The 75M tells us that all the 75 bases of the read have been mapped back to the genome.
Here’s another different case.
The first 66 bases of the read are mapped to the genome, following by a 76-bp gap
and then another consecutive mapping of the last 9 bases of the read.
This read has been mapped to a junction site.
This is the position of the read paired with the previous read on the reference genome.
This is the length of fragment from whose two ends the paired reads are sequenced.
That’s all for TopHat.Now let’s talk about Cufflinks.
Cufflinks is a toolkit for transcript assembly, expression estimation, and differential expression analysis.
Let’s talk about the transcript assembly and expression estimation first.
As mentioned before, reads are generated from transcripts.
If a gene has multiple transcript structures, such as the red, blue, and yellow transcripts depicted in this figure,
then we cannot tell from which transcript each read comes from, except for those reads colored in red, blue, or yellow.
Then how can we reveal the truth as much as possible?
Cufflinks aims at assembling the most likely transcript structure and estimate its expression level.
Now let’s explain some important command-line options of Cufflinks.
The -G option provides an annotation file for Cufflinks to assemble transcripts that exist ONLY in this annotation.
The -g option also provides an annotation file, but Cufflinks will assemble new transcripts guided by the known transcripts in this annotation.
The -u option tells Cufflinks to use a method that is more accurate to process multi-mapped reads.
Without the -u option specified, Cufflinks will only simply distribute the reads uniformly.
For example, a read that is mapped to 10 positions will make Cufflink assign to each position 0.1 read.
With the -u option specified, Cufflinks can distribute the reads more accurately.
For example, the reads will be distributed evenly first,
and the probability of each read being distributed to each position will be estimated by the expression levels of the 10 positions.the expression levels of the 10 positions.
In fact, Cufflinks will use EM to iterate and get the read distribution that is most likely given the current observed data.
The “library type” is similar as TopHat’s.
Below is a demonstration of Cufflinks. These options have been introduced before, and this bam file is the result of TopHat we have just run.
Let’s run this command.
This will take some time to finish, so we skip this and check the result directly.
Let’s take a look at this output directory by Cufflinks.
This file describes the transcripts assembled by Cufflinks. Let’s take a closer look.
Last section we have introduced Tophat and Cufflinks, now we will show how to navigate Cuffmerge, Cuffdiff and CummeRbund.
When will process several samples using Cufflinks, we need to merge all the transcripts data into a more comprehensive transcript.
Cuffmerge is a tool to merge gtf files into a gtf files with more comprehensive gene annotation.
As the figure shows, the 6 transcripts were merged into one.
At the same time, the output can be more accurate with the genome annotation available.
This merged assembly (transcript) provides a uniform basis for calculating gene and transcript expression in each condition.
Now let’s make acquaintance with the options of Cuffmerge.
-g:An optional "reference" annotation GTF. The input assemblies are merged together with the reference GTF and included in the final output.
-p:Use this many threads to align reads. The default is 1.
-s:This argument should point to the genomic DNA sequences for the reference. If a directory, it should contain one fasta file per contig.
It is a simple example of Cuffmerge
These options have been mentioned below. The last item is a list which consists of the file paths of assembled transcripts from Cufflinks.
Now let’s run Cuffmerge.
First of all we need input “cat” command to create a list, which consists of file paths of all assembled transcripts.