Therefore, we usually introduce logarithmic computation to replace multiplication with addition.
Specifically, we take the log (with a base of 10) value of transition probability
and emission probabilities in advance.
Okay, let’s start with a sequence!
First, let’s draw the recursive matrix of dynamic programming.
There are two states: coding state c and non-coding state n.
Next, we need to set the boundary condition
which is the default probabilistic distribution of the two states.
After taking log(log10) transformation,
the probabilities turn into -0.097 and -0.699, respectively.
Next, we will fill in the cells one by one.
First, we meet the first token: C.
From the emission probability matrix, we can see that
its emission probability under the non-coding state is -0.523.
Adding it to -0.097 will get -0.62.
Similarly, this number will become
-0.699 + (-0.699) = -1.40
under the coding state.
Next, we need to move forward by one residue, which requires one step of state transition.
From the transition matrix we can see that the transition probability here is -0.097.
Adding it to the emission probability of residue G under the non-coding state,
which is -0.523, we get -1.24.
Similarly, we can compute the [probability of] transiting from the coding state
to the non-coding state, which is -1.40 + (-0.398) + (-0.523) = -2.32.
Please note that this value is smaller than the -1.24
coming from the non-coding state, and will thus be discarded.
Similarly, we can finish the succeeding recursions and fill in all the rest cells one by one.
OK.Next we will do backtracking.
First, pick up the [cell at the end with the] largest probability.
Start from this cell and backtracking step by step...
OK,We can get the final result when I get the backtracking path.
In other words, we divided the input sequence CGAAAAAATCG
into non-coding regions N and coding regions C.
Due to time limitations, we have only created a very simple MSGP.
However it can be easily extended simply by adding more states.
The only limitation here is that the emission probabilities of different states
which are the content of residues here should differ very much.
Only in this case can we infer from the observed token sequences the state path
For example, Chris Burge developed a gene prediction algorithm, GenScan, in 1996.
This algorithm sets different states for exons, introns, and UTRs,
which improves its accuracy of prediction considerably,
making it one of the most successful gene prediction tools.
However, it does not differ much from the MSGP we have just discussed with respect to basic ideas;
it just introduces more states.
Similarly, you can also use similar methods to predict the 5' splicing sites and so on...
The HMM has provided for the data analysis of bioinformatics an effective probabilistic framework
by separating states and observable tokens.
This model is one of the most widely used algorithm models in current bioinformatics research.
We will provide you with supplementary materials
concerning more applications of Hidden Markov Model in bioinformatics.
Thank you, and that's all for this week.See you next week!