So far we've talked about MRF and CRF learning purely in the context of maximum
likely destination where our goal is to optimize value likelihood function or the
conditional likelihood function. But as for Bayesian networks maximum
likely destination is not a particularly good regiment and it's very susceptible
to over-fitting of the parameters to the specifics of the training data.
So what we'd like to do is we'd like to utilize some ideas that we, exploited
also in the context of Bayesian network. Which are, ideas such as parameter priors
to soothe out our estimates of the parameters, at least in initial phases.
Before we have a lot of data to really drive us into the right region of the
space. So in the context of Bayesian networks
that was all great, because we could have a conjugate prior, on, on the parameters,
such as the derscht laid prior that we could then, integrate in with a
likelihood to obtain a close form conjugate posterior and it was all
computationally elegant. But in the context of MRF's and CRF's,
even the likelihood itself is not computationally elegant and can't be
maintained in closed form. And so, therefore the posterior is also
not something that's going to be computationally elegant.
And so the question is, how do we then incorporate ideas such as priors into MRF
and CRF learning, so as to get some of the benefits of regularization?
So the ideal here in this context, is to use what's called map estimation, where
we have a prior, but we're instead of maintaining a postiariary is close form,
we're computing what's called the maximum, postierory estimates of the
parameters. And, this in fact is the same notion of map, that we saw when we did
map inferencing graphical models, where we were computing a single map
assignment. Here continue in this thread of Bayesian learning as of form inference
we're computing a map, inference estimate of the parameters.
So concretely how is map inference implemented in the context of an MRF or
CRF learning. A very typical solution, is to define a,
Gaussian distribution over each parameter theta i separately, And that is usually a
zero mean, uni-variant Gaussian.
[SOUND] with some variance. sigma^2.
And, the variance of sigma squared dictates how firmly we believe that the
parameter is close to zero. So for, small variances we are, very
confident that the parameter is close to zero and are going to be unlikely to be
swayed by a limited amount as data, whereas sigma gets larger we're going to
be more inclined to believe, believe the data early on and move the parameter away
from zero. So we have such a parameter prior over
each data i separately, and they're multiplied together to give us a joint
parameter prior. So the parameter prior over each is going
to, over each parameter is going to look like this.
This sigma^22. is called a hyperparameter.
And it's exactly the same kind of beast as we had for the Dirichlet
hyperparameters in the context of learning these in the works.
The alternative prior that's also in common use, is what's called the
Laplacian Parameter Prior. And the Laplacian Parameter Prior looks
kind of similar to the Gaussian, in that, it has an exponential that, increases as
the parameter moves away from zero. But, in this case, the increase in the
parameter depends on the absolute value of theta i and not on theta i^22, which
is what, the behavior that we would have with the Gaussian.
And so this function looks, looks as we see over here, with a much sharper peak,
around zero, that effectively corresponds to a discontinuity, at theta i equals,
equals zero. And, we have again such a prior, a
Laplacian prior to theta i over each of the parameters, of theta i which are
multiplied together. Just like the Gaussian, this distribution
has a hyperparameter. Which in this case is often called beta.
And the hyperparameter just like the variance, of, in the Gaussian
distribution Sigma squared, dictates how tight this distribution is around zero.
Where tighter distributions correspond to cases where the model is going to be less
inclined to move away from zero, based on limited amount of data.
So now, let's consider what map estimation would look like in the context
of these two distributions. So here we'd have these two parameters
priors rewritten, the Gaussian and the Laplacian.
And, now map estimation corresponds to the arg max over theta of the joint
distribution, P of D comma theta, so we're trying to maximize, oh, we're
trying to find the, theta that maximizes this joint distribution.
And by the simple rules of, probability theory, this joint distribution is the
product of P of D of given theta which is our likelihood.