Hello, everyone, Iâm Liu Fenglin from Peking University. Today Iâll introduce phylogeny estimation to you on behalf of our group.
Our report is mainly based on the paper published by Mark Holder and Paul O. Lewis in 2003 on nature review genetics,
which relatively systematically introduces the traditional and Bayesian approaches for phylogenetic estimation.
We want to answer three questions. First, what is phylogenetic estimation? Second, why do we do phylogenetic estimation? And, how to do it?
First, what is phylogenetic estimation?
Phylogenetics focuses on studying the evolutionary relationships among species or genes by drawing phylogenetic tree,
based on molecular sequence and morphological data.
Here is an example of phylogenetic tree
This phylogenetic tree was constructed based on RFX genes in mammalian species,It can tell us the evolutional relationship among these genes
Each ortholog group is colored differently.
Why do we need phylogeny estimation
Here are several main incentives First we can detect orthology and paralogy by phylogeny estimation
And we can estimate their divergence times
we can also reconstruct ancient proteins find the residues that are important to natural selection
detect recombination points identify mutations likely to be associated with disease and determine the identity of new pathogens
How do we estimate phylogeny?
There is an assumption that we are basing on.If two sequences diverged from their last common ancestor,
As time increases, they should become more different
So a basic idea about estimate phylogeny is yo count the number of differences between sequences
Also, we believe that the less the number of differences is, the closer their genetic relationship is.
However, only such a basic assumption is far from solving all of the problems for the complicity of this issue
First, the sequence evolution rate is not constant over time,
So we are not able to estimate the time of divergence simply by comparing the degree of variance between two sequences.
Whatâs more, natural selection biases exist,so some distant related evolutions between sequences are slow,
some of their residues may still be familiar.
And, some sites in DNA sequence are not helpful in phylogeny estimation.Because some sites are reserve in evolution, some are changing quickly.
So we need to make our way to solve this complex issue.
The main process of phylogeny estimation is:First we get a sequence
Then we could download its related sequences and apply multiple sequences alignment methods.
Then we can select a model between traditional and Bayesian appreaches to build phylogenetic tree.
We have to calculate its confidence when using traditional approaches,
if Bayesian approaches are adopted, we are able to get its confidence when we get this tree.
So far we could estimate a âbestâ tree and then hypotheses testing could be applied.
The traditional approaches includes Neighbor-joining(NJ) algorithm, parsimony method and maximum likelihood(ML) method.
The basic idea of NJ is to use âdistancesâ to illustrate variances between different sequences ,add a closest point into the phylogenetic tree.
As that picture shows,newly added points will be pruned into one single point ,and re-calculate distances between new and existing points.
By repeating the process,a phylogenetic tree could be built.
The advantage of this method is fastness.
But its disadvantage is that when compressing sequences into distances,many information is lost.
It is also difficult to obtain reliable estimates of pairwise distance when dealing with divergent sequences.
Next we will introduce parsimony.
The principle of parsimony is to build a tree that requires the fewest number of mutations which is the basic idea of this method.
Its advantage is that it is fast enough to analysis hundreds of sequences.
When these sequences are similar,which means they have close relations to each other.The method of parsimony is rather stable.
Its disadvantage is that when the length of the branches are varied, which means the relationship between the sequences
are not all that close, some are close and some are far, the result of this method will not be that satisfactory.
Next we will talk about the method of maximum likelihood.
The basic idea of maximum likelihood is to calculate the topological structure of the tree,
T represents the topological structure and t represents some parameters.
When T and t is given,we will be able to get the probabilities of these data.
We are looking forward to get a structure and some parameters to make this probability as high as possible.
One advantage of this method is that we can make the for use of the data at hand.
But its disadvantage is that its speed could be slow and it can be unstable when there are many parameters.
We have just introduced the traditional ways to get a tree.So how are we supposed to test the credibility of a tree?
We basically use the method of bootstrap.
Like what is shown in the picture
Pick 6 times with putting back so 6 spots are picked, for instance, we pick the third spot the first time and the second spot in the second time,
these spots are used to build some new sequences. Such sequences are used to build a tree.
If we are able to build a new tree the same as the original tree of the original sequence
it indicates that if there are more samples there is a higher possibility to get a previous tree.
A point should be noted that it does not mean that what we have build is right, the result is simply a essential condition instead of a necessary one.
This process has to go on for something like 100 times, which means that it has to be repeated lots of times.
One major disadvantage is that when calculating, if it uses a slow method like the maximum likelihood, this method can be very slow.
After we have tested its credibility,we have to make a hypothesis testing.
Here is a example, letâs say that we have to identify a virus to find out whether it belongs to groupA or groupB.
If the tree we have already built tends to lead it to groupA,
so our null hypothesis will be it belongs to groupB,
then we build the best tree based on this hypothesis and compare it to the tree we have built assuming it belongs to groupA.
If the difference is rather significant,then we can tell that the virus belongs to groupA .
Above are traditional algorithms we introduced. Then we will introduce the Bayesian algorithm and the way how it build a phylogenetic tree.
The main part of Bayesian algorithm is to maximize of posterior probability,
which is the probability of the observed data to form a phylogenetic tree.
The formula can be inferred from Bayesian equation
.âTâ represents the topological structure of the phylogenetic, âtâ is the parameter and âxâ is the data we observed.
Then we can that thereâs another two probabilities
P(T,t) and P(x) in addition to the probability of maximum likelihood method. And P(T,t) is the prior probability.
The Bayesian algorithm have several advantages. First, it has a strong connection to the maximum likelihood method.
It can work out the measures of uncertainty and the optimal tree.
And it allows complex models of sequence evolution to be implemented.
In addition, it doesnât rely on the molecular clock assumption that the probability of mutation is not related to time to estimate divergence times.
The Bayesian phylogenetic has its own parameters so that it doesnât rely on the molecular clock.
The nuisance parameters are integrated out, and we can obtain a marginal posterior probability.
But why should we integrate these parameters together? Letâs see the diagram on the ppt.
The x axis shows the different value of parameter, y axis is its likelihood, or the Bayesian probability density.
Then we can imagine that using the maximum likelihood method, we will consider tree A is better than tree B.
Because when X is the optimal value about 0.5, tree A is better than tree B.
But using the Bayesian algorithm, we need to calculate the dimension of area formed by x axis and the curve.
It includes all probability of different x values,
which means that Bayesian method is more comprehensive than the maximum likelihood method.
However, Bayesian method also has several disadvantages.
The opponent stated that the prior distribution for parameters must be specified, and itâs too subjective.
It also can be difficult to determine the runtime of MCMC approximation which is a way to obtain the optimal value.
The MCMC first choose a staring tree and model, adjust a parameters and propose a new tree,
if the new tree is better, accept it and start next calculation.
For example, the MCMC can get an optimal value step by step along the red line.
Itâs very difficult for us to determine whether the MCMC approximation has run long enough.
An early end can cause a wrong optimal value and waste of time otherwise.
Itâs also hard to estimate the proper adjustment of the parameter.
We have introduced different estimation of phylogenies which have become a regular step in the analysis of new gene sequences.
These new techniques will exerts tremendous effects in the study of molecular genetics.
These are our references
And these are our group members,Thanks.