Hi, welcome back to the last unit of this week’s lectures.
In the last three units, we had looked at three large centralized bioinformatic resources.
There are also thousands of individual bioinformatics resources.
In Unit 1, I had shown you several tables of individual resources for different analysis purposes.
In this unit, let’s look at a few examples of them in a little bit more detail.
Let’s start with resources for protein three-dimensional structures.
If you want to find experimentally determined three-dimensional structures of proteins, the Protein Data Bank, or PDB is the place to go.
It contains the three-dimensional coordinates of atoms in a protein structure
determined by X-ray crystallography or other technologies as well as lots of other useful annotations.
As of December 2013, PDB contains about 85 thousand X-Ray structures
including 80 thousand protein structures,
15 hundred nucleic acid structures, and over four thousand protein-nucleic acid complexes.
PDB also contains over 10 thousand NMR structures including 9000 protein structures,
1000 nucleic acid structures, about 200 nucleic acid and protein complexes.
There are also about 1000 structures determined by electron microscopy or other technologies.
This figure shows the growth of PDB over the years. The red bars show the number of structures in PDB over the years.
And the blue bars show the number of newly determined structures in PDB every year.
The total number of protein structures determined is reaching 100,000.
As a comparison, you may remember from Unit 3 that the number of curated protein sequences in UniProt/Swiss-Prot
is 541 thousand, and the number of un-curated protein sequences in UniProt/TrEmbl is 48 million.
What can we do about all these extra proteins with only the sequences known?
Fortunately there are bioinformatics methods that can predict the 3D structure of a protein from its sequence.
SWISS-MODEL is such a prediction web server maintained by the Swiss Institute of Bioinformatics.
As I mentioned in Unit 1, there are three main types of methodologies for protein structure prediction.
The boundaries between the types are getting blurry as many of the new methods incorporate two or all three of the methodologies.
Nevertheless, SWISS-MODEL is based primarily on homology modeling.
You can input a protein sequence and SWISS-MODEL searches for homologous proteins with known structure and builds a model for you.
Because protein structure prediction can be computationally intensive, the good people at the Swiss Institute of Bioinformatics created
a SWISS-MODEL repository that contains structure predictions that they made for you from protein sequences in UniProt.
As of December 2013, the SWISS-MODEL repository contains over 144 thousand structural models for human proteins,
about 130 thousand models for mouse proteins, and many structural models for other model organisms.
Another structure prediction method with good prediction accuracy is I-TASSER developed by Zhang’s group at University of Michigan.
I-TASSER also uses known protein structures as templates.
The same group also developed a prediction software called QUARK that builds ab inito models
from putting together small fragments of 1-20 residues long using a knowledge-based force field.
In addition to individual prediction servers, there are also meta-servers such as the Protein Model Portal
that integrates six template-based prediction methods in one user interface.
With so many prediction methods, how do you know which ones have better accuracy?
This can be difficult to assess when everybody evaluates their methods on different datasets.
Wouldn’t it be great for comparison if all the different methods are evaluated objectively against the same datasets?
That’s exactly what CASP does. CASP stands for Critical Assessment of protein Structure Prediction.
It has been running every other year since 1994. The 10th CASP was run in 2012. The 11th CASP will start in May, 2014.
Each time the CASP organizers solicit protein structures that have been determined but not yet published from structural biologists.
They then provide the amino acid sequences of these proteins online and invite all interested researchers from around the world
to predict the three-dimensional structures using their own methods.
The organizers then compare the predicted structures against the experimentally determined structures,
and find out which methods have the best accuracy. It’s a very interesting and useful assessment.
In CASP10 which was run in 2012, the top servers for template-based modeling are listed here.
They include homology modeling methods, fold recognition methods, and hybrid methods that use known structures as templates.
The top servers for free modeling, that is, ab inito prediction without known structures as templates are listed here.
I hope that the above few slides have provided you with useful pointers on protein structure analysis for your future research.
Now let’s change gear to look at a few other resources.
You may remember the ENCODE project that I told you about in the previous unit.
ENCODE aims to identify and analyze all functional and regulatory elements in the human genome.
The UCSC Genome Browser provides a data portal for it. But what about model organisms?
Fortunately, there is also a modENCODE project that aims to identify all of the sequence-based functional and regulatory elements
in C. elegans, drosophila melanogaster, and related species.
As Dr. Gao had told you in earlier lectures, some genomic regions are transcribed into RNAs but not translated into proteins.
These RNAs may play diverse functional roles.
Rfam is a database of over 2000 RNA families, including non-coding RNA genes, structured cis-regulatory elements and self-splicing RNAs.
Each RNA family is represented by multiple sequence alignments.
Furthermore, because many of the RNA families have a more conserved secondary structure than sequence,
Rfam also built the consensus secondary structures for each RNA family.
Finally, Rfam built covariance models to describe each family
which are a slightly more complicated relative of the profile hidden Markov models (HMMs) used by Pfam.
Covariance models can simultaneously model RNA sequence and structure.
Transcription factors play key roles in expression regulation.
PlantTFDB is the most comprehensive database for plant transcription factors.
It covers 83 species, and can be browsed by either species or transcription factor family.
You can also search PlantTFDB using either keywords or sequence. It’s a very useful resource for the plant biology community.
In Unit 1 I listed lots of resources for next-generation sequencing data analysis.
We have also taught you several of the methods in detail in earlier weeks.
So here I will just briefly mention two as examples.
SOAPdenovo is a short-read assembly method that can build a de novo draft assembly for large animal and plant genomes from Illumina short reads.
It was developed by BGI.
The SOAP suite of software also include several other useful tools for next-generation sequencing analysis that you may want to explore.
We had talked about methods to call SNPs from next-generation sequencing data in earlier lectures.
There are also lots of large structural variations and copy number variations in the individual genomes.
CNVnator is one of the methods to call structural variations and copy number variations from next-generation sequencing data.
Mark Gerstein’s lab have also developed several other software and databases for structural variations, which you can explore from here.
Last but not least, I’d like to tell you about three very useful resources that can make your software programming much easier.
Bioconductor has over 700 open-source software packages and modules in the R statistical programming language
that you can download and embed into your own software programs.
They can analyze gene expression data, genetic variation data, and many other types of high-throughput data.
Similarly, BioPerl shares lots of Perl modules for bioinformatics analysis on Linux, Windows, MAC OSX, etc.
BioPython is a similar collection of Python libraries and applications in bioinformatics.
These three resources are very useful. Many of us have used them in our bioinformatics programming.
I would like to end this week’s lectures by saying again that bioinformatics is a fast growing young field.
I hope that in a couple of years’ time, we will be teaching something that YOU develop.
In the meantime, after you had just listened to all these lectures, please really go onto the web and use these resources.
You will see that they give you a new power for your research.
I hope that you will enjoy this new power.
At the same time, as the movie “Spiderman” said, “With great power comes great responsibility.”
Please remember that, as Dr. Gao and I had emphasized again and again,
every powerful method also has its underlying assumptions and limitations.
So please use the power with caution and responsibility.
That’s all for this week. See you next week!