0:01

So over the course of the last few lectures,

we've talked about a number of different inference strategies and

how they might be applied to a range of different graphical models.

One question that comes up is whether the same strategies can be used for

the kinds of template-based models that we defined earlier in the course,

models that are defined via, say, dynamic Bayesian networks or via plate models,

or one of the other representations that uses repeated structure.

So, as we'll see, the answer to that is both yes and no.

That is, the same strategies can be applied, but

they're also some interesting new challenges that arise from this.

0:39

So, first, let's remind ourselves of how, for example,

a dynamic base net is specified.

So as a reminder,

a dynamic base net is specified using a initial time 0 distribution, which is

usually a Bayesian network as well as the 2 TBN that tells us the transition level.

And it's certainly the case that one can take those two pieces and

construct an unrolled or ground Bayesian network.

And once we've done that,

that ground Bayesian network is a Bayesian network just like any other.

And so whatever inference techniques we have developed for answering queries for

any Bayesian network can be used in this context as well.

So, specifically, once we unroll the DBN for given trajectory length, we can run

your favorite inference algorithm over the ground network to answer questions,

such as what is the location of the car at time 2,

given whatever evidence we might like to give it, so

observations up to time 2 or observations for the entire duration of the sequence.

2:18

What are some of the challenges though?

So part of the difficulty or new dimensions that are offered by some of

these models is the fact that they establish new problems.

This is particularly true in the context of temporal models,

where we might often be interested in what you might call belief state tracking.

That is keeping track over the state of the system as it evolves.

Now in some ways,

this is just a traditional probabilistic inference task because it corresponds

to asking what is our probability distribution over the state of time t,

given the observations that the agent has had access to up to time t.

So this over here, what we see as o(1:t) is

just a shorthand for o1, o2, up to ot.

And so this is an inference task that can be answered using standard

probabilistic inference techniques over the unrolled network.

But there is a challenge that needs to be addressed

if you want to keep track of this probability distribution over

the course of a long trajectory without having a potentially unboundedly

large network that we have to keep running inference over.

So it turns out, fortunately, that one can do this in a dynamic

way without keeping track of the huge unrolled network at all points in time.

And this just comes directly out as a consequence of the Markov properties of

the graphical model.

So, specifically, we're going to do this as a two stage process,

where the first thing we're going to compute is this sigma(dot t + 1).

And the position of the dot indicates that although it's

a distribution over the state at time t + 1,

it's a distribution that doesn't take into account the time t + 1 observation.

And so this is defined to be, as written here,

the probability of the state of time t + 1, given the observations up to time t.

And so now we can do fairly straightforward

manipulations of this probabilistic expression.

So the first thing is we inject S(t) into the right

hand side of the conditioning bar and sum out over it, so

this is now a probability of S(t + 1), given S(t) and observations up to

time t times the probability of S(t) given the observations up to time t.

And now we can apply

indepedencies that arise from the structure of the graphical model.

And specifically, the fact that the state of time t + 1 is independent of everything

that occurred before given a state of time t, which is going to allow us to

5:08

And on the second term, which is probability of S(t) given o(1:t),

this is just a previously computed belief state, as it's called,

the sigma(t) that we are trying to keep track of.

And so this allows us to take sigma(t) and produce sigma(dot t + 1).

The second step is now to count for the observation model at time t + 1.

So, let's look at what that would involve.

So, now, we have sigma(dot t + 1), which we defined on the previous slide,

and so now we want to condition that on the new piece of observation,

of the observation to find t + 1.

And this is defined, so just, from that to derive the belief state, sigma(t + 1).

And so here we're just going to take that definition, probability of S(t + 1)

given all of the observations, the ones that went through t and the new one.

And we're going to apply Bayes' rule to define this and to reformulate this,

and that gives us probability of o(t + 1), given S(t + 1) and

the previous observations, times the probability of S(t + 1),

given o(1:t), divided by the probability of o(t + 1), given o(1:t).

This is a straightforward application of Bayes' rule, and once again, we can

now examine each of the terms in this expression and see what it corresponds to.

And looking first at this first term in the numerator,

we can see that this is the probability of the observation at time t

given the state of time t and a bunch of things that happened in the past.

And once again, conditional independence tells us that the stuff that happened in

the past is irrelevant and to be removed because of conditional independencies.

8:08

this might manifest in the context of an actual belief state tracking problem.

So, here is a situation where this model might be applied, so

let's first explain what we're seeing before we see a demo.

What you see in the green circle is the robot true position,

which is not known to the robot, because the robot's trying to localize itself.

These blue lines are the observed sonar readings that the robot

gets at each point in the process.

And so you can see that the returned length of the sonar

readings correspond roughly to the distance of the robot from the wall.

But there is some noise here.

So for example, sometimes for whatever reason, the sonar appears to go

through the wall even though it should have been giving us a reading that

is considerably closer because it should have been reflected from the surface.

So sonars are quite noisy, and the visualization that we see reflects that.

The red dots that you see is a visualization of the robot's beliefs over

where it is.

In the initial stage this is a uniform probability distribution and

because the robot doesn't know where it is.

And as we'll see, this gets clumpier and clumpier over time as the robot's belief

state over where it's likely to be becomes more and more definitive.

So let's look at what this looks like over time.

So what we're going to see is the robot's beliefs as it gets more and

more observations.

10:25

The second challenge associated with these template-based

models are computational issues.

So sure we can produce an unrolled Bayesian network and compute a posterior

over any subset of the variables using standard inference techniques.

But one of the consequences of being able to produce these very large

probabilistic graphical models from a fairly small template is that we can

produce very large probabilistic graphical models from a very small template.

And large probabilistic graphical models can pose new inference challenges

in terms of the ability to scale inference to models of that size.

So, specifically,

if we look at the unrolled model that arises from an unrolled DBN and

thinking back to some of the analysis that we did regarding the complexity

of probabilistic inference for a particular probabilistic graphical model,

we remember, for example, that if we want to run inference,

the exact inference, say, a clique tree over this unrolled network.

Then for example, if we want this to have a clique tree where,

say, the time 0 variables are in one part of the model and

the time t variables for some future t are on some other clique and

all, so they're not all together in one big clique.

Then the minimal subset needs to separate these variables over here.

And we say the blue variables, the time 0 variables, from the green variables.

So the subset must separate them.

And if you think about what separation imposes on us in this setting,

we can see that the only way to separate the, for example,

these variables, the blue variables over here,

from the green variables over here is to put in the subset at least

the persistent variables, that is the ones where there is an edge

from the variable at time t to its incarnation at time t + 1.

And so the minimal subset in this context, the smallest subset that we can

construct that would allow us to have different cliques for, say, time 0 and

time 2, is something that involves all of these variables in the middle.

And so, that can potentially involve a significant computational cost for

exact inference, especially when we have a large number of persistent variables.

A different way of understanding this is via the concept of entanglement,

if we're thinking of belief state tracking.

So if our goal is to maintain a belief state over, say, the variables of time 3,

and we're trying to think about how can we maintain this probability distribution,

the sigma of time 3, in a way that doesn't involve

maintaining a full explicit joint distribution over the variables time 3.

We quickly realize that, really, we have no choice, because there are no

conditional independencies within, Time 3.

Now you might say, how is that?

This is a nicely structured model.

Why wouldn't there be conditional independencies?

Well, look at what's going on here.

Can we say that weather is, at time 3,

is conditionally independent of failure at time 3?

Well, locally, they're not connected to each other within the time slice,

but there's certainly an active trail between them that goes,

for example, like this, from weather time 3, weather time 2,

weather time 1, failure time 2, to failure time 3.

And so, this is an active trail between these two variables, which means that they

are, when you consider only the variables at time 3, not conditionally independent

of each other, given any of the other variables at time 3.

And so, this entanglement process occurs very rapidly over

the course of tracking a belief state in a dynamic Bayesian network,

which eventually means that the belief state, if you want to maintain

the exact belief state, is just fully correlated, in most cases.

14:57

And so, this is a computational consequence of the fact that we have

a very large Bayesian network.

And if you want to think about just how large it can get, we can go back to

the real vehicle tracking example that we talked about in a previous lecture,

and note that all of these variables over here are all persistent variables.

And so, belief state tracking would have to maintain if it were to be done

exactly a full joint distribution over all of these variables, which is likely to

be intractable, certainly if you want realtime performance and

probably not even without that constraint.

And so, there has been a lot of work on approximate methods in the context of,

say, temporal models, because of the computational issues that they raise.

16:18

MRF, which is what we really have here, where you have the variables on the left

are connected to the variables on the right.

And in general, assuming that the grades are fairly dense,

we mentioned before that for exact inference,

the lowest cost that we can expect is if you have m variables on the left and

n variables on the right, it's the minimum of (m,

n), so, the minimum of the number of courses and the number of students.

And again, you can come up with cases where this is going to be lower than that.

But in general, for bipartite MRF like this, that's what you're going to get.

And assuming that the university has a reasonably high number of courses and

presumably even more students, this can become intractable very quickly,

giving rise to the need for approximate inference methods.

18:03

considering the connections between the web pages.

So specifically, by considering edges between, in this case,

undirected edges between the category of pages that are linked to each other.

So, for example, we see, and this is a model that was learned from Dana,

that students are quite likely to point to faculty, and

faculty are somewhat less likely to point to students.

And faculty are even less likely to point to other faculty.

And so that gives us some information about the correlations between labels

of web pages that are linked to each other and gives rise to improvements

in performance that one gets that are actually quite considerable.

So, for example, a reduction in error from

18% to a little bit over 12%.

To summarize, inference in template and temporal models can be done in principle

by unrolling the ground network and using standard inference methods.

However temporal models specifically raise new inference tasks, such as real time

belief state tracking, which requires a certain adaptation in our methods.