0:00

We define the motion of Markov chain Monte Carlo sampling as a general

paradigm for generating samples from distribution through which it is

otherwise difficult to perhaps interactable to generate samples

directly. Markov chains gives us the general way of

of approaching this problem, but they, but the framework leaves open

the question of where the Markov chain comes from.

That is how do we design a Markov chain that has the desired stationary

distribution. We talked about the Gibbs chain as a

general solution to this problem in the context of graphical models,

but we also mentioned that the Gibbs chain has limitations, in terms of its

convergence rates for certain types of graphical models.

So what happens if we have a bump, graphical model for which the Gibbs chain

doesn't have good convergence properties? How do we design a Markov change for

that? Well, it turns out that there's a general

class of methods called Metropolis-Hastings, and the

Metropolis-Hastings algorithm gives us a general approach for designing a Markov

chain with a desired stationary distribution.

In fact for a given stationary distribution we can construct a whole

family of different Markov chains that explore the space differently, and then,

try and select among those or construct one among those that has good convergence

properties for our distribution. So the key insight behind the

Metropolis-Hastings Method is the notion of a reversible chain.

So what does it mean for a chain to be reversible?

Imagine that we have a chain with a particular stationary distribution pi and

we're going to consider two different experiments.

In the first experiment, we're going to pick a a point in this space, we're

going to according, a point in the state space according to the probability

distribution pi, so say we land on this one.

And then we're going to pick a random edge emanating from the state that we

picked according to the transition model defined by the chain T.

So say we pick this edge, that's the first experiment.

The second experiment is exactly the same thing,

we're going to basically do this experiment twice.

So now we're going pick a state and once again, we're going to pick an outgoing

transition from that state. A chain is reversible if, when I do this

experiment, the probability of traversing this edge, this red edge over here, is in

this process is exactly the same as the probability of traversing this green edge

in the other direction. That is, if picking I and then traversing

to J occurs with the same probability as picking J and then traversing to I.

So, written in, mathematical notation, this tells us that pi(x) which is a red

state, times the transition along the red edge is equal to the probability of x

prime. That was my green experiment times the

probability of transitioning in the opposite direction.

Now, it's very easy to construct Markov chains that are not reversible, so this

is by no means a universal property of Markov chains. But it turns out that

reversible Markov chains have an elegant properties that allow them to be

constructed so as to give rise to particular stationary distributions.

And, specifically we can show and it's not a very difficult proof, we'll show it

in a moment, that if this equation, which is called the detailed balance equation.

3:44

If this equation holds and T is a regular Markov chain,

then T has a unique stationary distribution, which we know from the fact

that it's regular, but furthermore, that stationary

distribution is pi. Let's go ahead and prove this.

And it turns out to be actually a very, a one line proof.

So, here we have we take the two sides of the detailed

balance equation and we put one of them on this side and the other one on this

side. And because the two sides are equal,

if we sum up over x, then the two summations are also equal.

And, now we notice by looking at the right-hand side, that pi of x prime

doesn't depend on the p summing x or on the variable x.

And so we can move pi of x prime out the summation, and so, this can now be

rewritten as pi of x prime times sum over x T of x prime x.

Okay? And this summation over here,

because t is a probability distribution over it's outgoing over the edges that

are outgoing from x prime, is necessarily equal to one.

And so if we rewrite what we just proved in this one line derivation,

we just proved that the sum over x pi of x T of x to x prime, is equal to pi of x

prime, which is exactly the definition of the

stationary distribution. So, this is a one line proof that a

reversible chain gives me the, side sorry, a chain that is reversible

relative to a particular distribution pi necessarily has pi as a stationary

distribution. Now, why is this a useful transition

between the definition of the stationary distribution pi in terms of this equation

down at the bottom versus this alternative definition of the stationary

distribution pi? Because this distribution,

this definition down at the bottom involves the summation over what is in

general an exponentially large space, whereas this one is, as we'll see in a

minute, much more constructive. and that it allows us to construct T, so

as to match pi. And so, that is exactly the idea behind

the Metropolis-Hastings the definition of the Metropolis-Hastings chain.

So how does the Metropolis-Hastings chain work?

It starts out by saying, well, we want to move around broadly in the state space.

And so, we're going to have a distribution queue, which is kind of like

a transition model. So this is my proposal, this is my

distribution Q. Q is going to, to roam freely over the

state space, unlike the Gibbs chain, it's not going to necessarily just take little

local steps from one state to a state that is very nearby.

It can look very far away and and go into large, into very distant parts of the

state space. Now, that by itself of course, is just

going to give me a stationary distribution that has absolutely no

relationship to the stationary distribution pi that I care about and so

what I'm going to have is a critic. The critic is the guy who says, whoa,

wait a minute, you can't go there right now, because it's doesn't that's not

going to give you the right stationary distribution.

So what does the critic do? The critic listens to the proposal that

was made by the proposal distribution Q, so it says, oh, you want to go from x to

x prime. Well, what is the probability with which I'm going to let you do that?

That's the acceptance probability, and that's a binary random variable for each

x and x prime. What is the probability that we accept that proposal?

So, that gives rise to the following process.

I haven't told you how to pick Q or A or anything yet,

but let's just understand the process. At each state x, we sample a potential

successor state x prime from my proposal distribution Q.

So x, and we sort of have a proposal to go to x

prime and I note, denoted this using a [INAUDIBLE] line.

And now the acceptance, the acceptor says, do I accept that?

And if the proposal is accepted, then I actually make that transition.

And if the proposal is rejected, I just stay where I am.

So what transition model does that give rise to?

Because, ultimately, this just gives us a transition model from x to x prime.

So there's two cases. Case number one is if x is not the same,

sorry, x prime is not x,

that is x prime is a state other than x. Well, in this case, the only chance that

I have of moving to x prime is if I proposed to move the x prime and if that

proposal was accepted. So I need to multiply these two

probabilities and that gives me the expression, this expression for the

transition. What about transition from x to x?

Well, there is two different cases. Either, I propose a move to x and it was

accepted or x also gets the benefit of all of the transitions that were rejected

by the, from other states that were proposed but where, where the move wasn't

accepted. And so, what we have here is Q of x to x

which we assume is always accepted, that's sort of one of the actions in this

model. And then we sum up over all possible other transitions of the

probability that they were proposed times the probability that they were rejected,

which is one minus the probability that they were accepted.

That gives rise to a transition model. So this is a general definition of a

Markov chain, but as stated, there is no reason to expect it to have a particular

stationary distribution. And so, now this is where we bring back

the intuition from the detailed balance equation.

And we're going to use the detailed balance to construct an acceptance

probability that has the property that we want.

And so, here is the detailed balanced equation rewritten, and now let's write

down and we're going to assume in this in this particular example that x is not

equal to x prime. And so we're going to, because this is

trivial for the case x is equal to x prime,

this equality always holds pi of x T of xx is equal to pi of x T of xx.

And so let's look at the case x is not equal to x prime and our goal is to

construct A. So the goal is to construct A such that,

10:50

such that detailed balance holds for Q. So I'm given Q,

the set up is I'm given a proposal distribution Q and now I'mm going to pick

A so as to make detailed balance hold. And if detailed balance holds for Q comma

pi, then I'm going to then that is going to

guarantee that the process overall has the correct stationary distribution.

So, this is the equality that we want to have happen.

And, now, let's go ahead and just define this as a constraint on the acceptance

probability. So we have acceptance probabilities A

occurring on both sides of the equation. So I'm going to divide, move both As to

one side and move everything else to the other side.

And so I get a constraint on the ratios between the acceptance probability from x

to x prime, and the acceptance probability from x prime to x,

and that has to be equal to this ratio over here.

And so I'm going to assume, without loss of generality, that this ratio

notice that this ratio, we have we have

we can decide this in any way that we want.

We can either consider the ratio from x to x prime as the numerator or from x

prime to x. I picked this one under the assumption

that this is a sum value rho, sorry, this is equal to rho, which is

less than one. And, so now if we pick A,

so now we need to pick the values of the two acceptance probabilities, x and x

prime, so as to make this equality hold and one simple way to do that is take the

smaller of the two, which we assume was the numerator,

and we make A(x) two x prime equal to rho and we make A x prime to x equal to one

and that gives me immediately that this [INAUDIBLE] holds.

Now, of course, we could have picked these probabilities to be, otherwise,

we could have picked for example, A x to x prime to be half of rho and A of x

prime to x to be equal to half, but notice what that would do. That would

just reduce the probability that our proposals are accepted, which would just

make the chain move half as quickly, because you'd end up staying at the same

state twice as often as you would if the acceptance probability was higher.

So these are in fact the highest numbers that we could possibly pick while still

making this ratio constraint hold.

13:46

And so that gives me, when you actually put it together as a general formula for

the acceptance probability, it gives us this expression,

which, if you look at it, you will see that it gives the right values for both

xx prime and for x prime to x. So that problem that we have to pick A

given Q and the question is what about Q? How do I pick Q?

Well, that turns out to be the $64 million question,

picking Q is an art, and it's not an easy choice in many

applications. But let's think about some conditions the

Q must satisfy before we think about how to pick one.

So, one condition on cue is derived simply by looking at the form of

expression for A. And we can see that this here involves a

ratio, and if you have ratio, then, one

condition is that you better not have a denominator that's equal to zero.

And, so one of the cases that one of the things we must guarantee regarding Q in

order for to be a valid set of acceptance probabilities is that Q itself must be

reversible in the following sense. That is if the if the denominator, if, if

the numerator is greater than zero then the denominator is greater than zero and

vice versa. That is, if you can propose a step from x

to x prime, you'd better be able to propose a step

from x prime to x with nonzero probability which guarantees that this

ratio is always well-defined. Now, the problem which constrains on Q is

that it actually induces a, a tension in the design of Q.

now on the one hand, you'd like Q to be very broad.

So if you were at this state over here, you'd like to, you'd like Q to sort of go

really far away from the current state, because the further away you can go, the

faster you move around in the state space.

You don't want to stay local. You don't want to make the same mistakes

that the Gibbs chain does by staying very near to the current assignment.

You want to move around quickly and that improves mixing.

On the other hand, if you look at the implications that this formula has, when

the proposal distribution is very broad, and so that you have one state that for

example, has low pi and the other state has a very high value of pi.

Which is what you get when you move around from a hilly part of the space to

a low part of the space, you have very big differences in terms of

the height of the two stationary. So what I'm drawing here is the height of

pi and this is my state space x, and if I move around very far, I'm liable

to get into situations where pi of one state x might be considerably larger than

pi of x prime. And if you think about what that implies

regarding the acceptance probability, the acceptance probability can get very, very

low, which in turn, implies slow mixing again,

because you might not, you might be taking very global steps but you're

taking very few of them. And so this is a real sort of tension in

the design of distributions, of proposal distributions queue.

We can always take an example of a of a Markov chain where proposal distributions

can make a very big difference. And, this is a problem that you wouldn't

necessarily immediately think of as a graphical model, but it can be formulated

as one and it's a matching problem and we'll see it in other settings as well.

So here we have a bunch of red dots as we see over here and we have a bunch of blue

dots, and we want to match the red dots and the blue dots and there's some sort

of preference function. So, these edges over here are annotated

I'm going to mark in green, are annotated with some kind of weights that tell us

how happy is the red, is this red dot to be matched with this blue dot.

And this happens a lot, for example, in in various correspondence problems where

we're trying to match sensor readings to objects for example, in a tracking

application. It also happens in in image problems

where you want to match features and one image that features another,

so it's a very common problem. So, one formulation of the matching

problem as a graphical model is that we have a variable Xi for each red dot.

So the green, so the red dots are going to be the variables,

the blue dots are going to represent values.

And we're going to have that Xi is equal to j if I match the i to j.

[COUGH] So that can now be written as a probablity distribution over a space.

And so, we can consider a probability of an assignment of values to to variables

and we're going to say that, first of all, has to be a matching.

So if we don't have that every x, every red dot is matched to a different blue

dot, that's probability zero assignment. So this would be the case if you had two

red dots that matched to the same blue dot. And otherwise, we're going to define

some kind of weighted combination of, and so, is, sorry, it's it's going to be an

exponential of the sum of the quality of the matches between the red dots and the

blue dots. So, we're summing up the quality of the

edges that participate in the matching. So for example, if this is my matching

over here and notice that I didn't get to match all the red dots,

that's well actually here. I'm going to match all the red dots.

So, now we have a matching on a, on the, for each red dot, we assigned the blue

dot as a value. And and now

the and now the overall quality is the total sum of of the quality of the

matching. And that defines, and that can be

exponentiated to define a probability distribution.

So and we're going to represent these qualities visually in this example as, as

distances, so that we're going to assume for example that this red dot prefers to

be matched to this blue dot more than it prefers to be matched to this further

away blue dot. So one could imagine running, it gives

sampling distribution on this model by taking a variable and say one of the red

dots and picking a new value for it, that is a new blue dot that it wasn't matched

to. This is going to change things very, very

slowly, because most of the blue dots are going

to be taken by a different red dot. And so, those assignments, where a red

dot is matched to a previously taken blue dot, are going to have very low

probability, zero probability in fact,

and so those are going to be impossible. So the only thing I can do is I can make

the red dot match a different blue dot match, match one of the unmatched blue

dots and that's going to take a very long time for things to change.

Here is a different way of doing this, which is the Metropolis-Hastings which is

one way of constructing a proposal distribution and using

Metropolis-Hastings. And that's called an augmenting path

proposal distribution. So what that does is,

let's assume that this is my current matching,

and I'm going to define the following proposal.

I'm going to pick one of the variables, Xi,

say this one. And I'm going to say, I'm going to pick

another value for it pretending that everything is available.

So I'm going to ignore the fact that these guys already matched. I'm going to

pick this one. I'm would I'm, I'm picking it

probabilistically, so I could have picked a different one.

I could have picked the same one that I was matched to before.

I could have picked a further away one, let's say I pick this one.

Well, now, this guy, poor guy, this one is unmatched.

this, sorry, this red guy over here, so he have to have to pick a new partner

and so, say it picks this one. This red guy is unmatched now and so it

might pick one and say it picks this one. And that leaves me with this red guy and

say, this red guy picks this one, and notice that now I have a new legal

matching. Every, I've closed the loop and I've

defined what's called an alternating cycle, where basically, I can switch all

of the black edges to green edges and that gives me a new legal matching and

that's my proposal. And with a little bit of a numerical

calculation, one can figure out the acceptance probability for this proposal

distribution and it turns out that it's actually really a good proposal

distribution. So let me show you some examples on the

next slide. So this is the results on the chain, if I

just apply Gibbs sampling. And you can see over here, four chains,

so the four colors represent the four chains.

And what you see on, as, in the x-axis is number of steps and on the y-axis is some

kind of statistic that I'm tracking in order to gauge whether the chain is

mixing. As it happens, this is the probability

that the first red dot is matched to the first blue dot,

but it doesn't matter, any statistic would have illustrated the

point here. And you can see that there's a lot of

long range fluctuations in the chain, that is the probabilities change over

time. The probabilities, by the way, are

computed over windows of size 500, from the samples in that window.

And you can see that the chain take, that the chain takes a long time to move from

one region of the space to the other. The proposal distribution that I just

showed you by contrast is totally different.

You can see that the probabilities are almost totally flat and that there is,

basically, all four chains are the same and there is, almost the same and there

is very little change in windows over time.

And there's an even better proposal distribution that I didn't show you that

modifies the augmenting path by just a little bit and that gives you effectively

perfect mixing across the entire time. If we look at the other metrics that we

defined for evaluating mixing, we see, we can compare the probabilities

of these different statistic of, of of different statistics across the two

chains. So, for example, this might be, just for

example, the red chain and this might be the green chain.

And what we see here are the are the statistics that you get for different

kinds of probabilities for the probability that this variable is

matched, I'm sorry, that this dot is matched to that dot for example.

And you can see that the Gibbs chains, the estimate that you get are very noisy

and the different chains give you divergent estimates.

The second Metropolis-Hastings, sorry, the first of the Metropolis-Hastings

gives you things that are almost on the diagonal, and here, things are

effectively exactly on the diagonal, perfect mixing.

But to summarize, Metropolis-Hastings is a very general framework for building

Markov chains, so that they are designed to have a particular stationary

distribution. It requires that you come up with a

proposal distribution and that proposal distribution has its needs to have

certain properties in order to be valid. But once you give me such a proposal

distribution, the detailed balance equation, which is this nice simple

equation, tells me how I can construct the acceptance distribution, so as to

it enforce the fact that we get the right stationary distribution.

So, this is great in some ways because it gives us this huge flexibility in

designing proposal distributions that explore the space quickly and move around

to far away parts of the space. But as we saw, picking the proposal

distribution is actually an important design point and it makes a big

difference to performance. And it's not, there isn't like a science

of how you should do this effectively. And so, there are and so this is

something that is really an art and requires nontrivial amount of thought and

engineering.