All right, in this video,
we will see a variational Bayes EM,
and also summarize methods that we have seen until now.
Let me remind you what Expectation Maximization Algorithm means.
We try to maximize the margin of likelihood,
the logarithm of it actually.
It would be a logarithm of P of your data, given the parameters.
We also derive the variation lower bound for it,
and it is actually an expectation over some distribution Q of T times the logarithm
of the ratio between the joint distribution over the latent variable set of data,
given the parameters, and this distribution Q of T,
that is distribution over the lesson variables.
And we try to maximize the variation lower bound,
we do this in an iterative way.
On E-step, we maximize the lower bond with respect to Q and,
on abstract, we'll maximize it with respect to Theta.
We also proved that on E-step,
the maximization of the lower bound with respect to Q is equivalent to
minimizing the curl divergence between the Q and the posterior on the learning variables,
given the parameters and the data.
And on the M-step,
we can maximize the expected value of the logarithm of the joint distribution.
Actually, we do this since the denominator of the variational lower bound,
the Q of T, does not depend on Theta, and so we can drop it.
Let's see an E-step more carefully.
We minimize the curl divergence between the variational distribution Q of T,
and the posterior distribution over the latent variables,
given data and parameters,
and we do this minimization with respect to Q.
In week two, we've shown that the minimum of
this curl divergence actually is the second distribution,
that is, the Q of T equals to probability of T given X and Theta.
However, for many models,
it is not the case that we can compute for posterior exactly.
For example, in the next model,
module C and model called Latent Dirichlet allocation.
And for it, it seems to be impossible to compute the full posterior exactly.
So, we can try to use a variational inference in this case.
The new E-step would be as follows.
We minimize the curl divergence in some variational family.
We can use a meaningful approximation, for example, here.
And this method, where on E-step,
we perform a variational inference is called a variational EM.
All right, we've seen a lot of models.
Now, let's summarize them.
Here's our notation again.
We have data that is known we denoted as X,
we have parameters, Theta,
that are unknown and also,
latent variable is T that are also unknown.
I have two criterions here.
First is accurate and inaccurate.
The methods that are higher are more
accurate and the methods that are lower are more inaccurate.
In the meantime, on the right,
we have another criterion,
the slow and the fast.
The first methods would be quite slow,
and the last ones would be really fast.
You have to compromise between accuracy and speed.
The first algorithm is a full inference,
we'll try to find the full distribution,
the full joint distribution over the latent variables and parameters, given the data.
It is actually very accurate since it is exact inference,
however, for minimums, it is really slow.
Sometimes, we can do Mean field.
For mean fields, we will find
the posterior distribution approximated by the product of two distributions,
the distribution over the latent variables and distribution over parameters.
We can use EM algorithm.
In EM algorithm, we'll find only a point wise estimation of the parameters.
We will still have a distribution over T, however,
the Theta would be equal to the maximum of posterior estimation of it.
All right, but it turns out that for many cases,
it is still not the case that we can use the EM algorithm.
And then we can use a variational EM.
In variational EM, we apply, for example,
a meaningful approximation and factorize the probability over
the latent variables into probabilities for each dimension.
And so, we'll find the factorized distribution of latent variables,
and a point estimate of the parameters.
All right, and finally,
if we can do anything, we can use a Crisp EM.
In Crisp EM, we approximate the latent variables and the parameters with point estimate.
We'll do in an interesting way,
we'll find the maximum probability value for the lesser variables and on the next step,
we'll estimate the maximum probability of estimate all of the parameters.
You should be familiar with such methods,
it is used for k-means clusterization.
So, we've seen the methods,
we've seen the variational inference,
and in the next module,
well see an application of variational inference for the Latent Dirichlet allocation.