One of the most elegant properties of probabilistic graphical models is the intricate connection between the factorization of the distribution as the product of factors. And the independence properties that it needs to satisfy. Now we're going to talk about how that connection manifests in the context of a directed graphical models or Bayesian networks. So, let's first remind ourselves about why independence and factorization are related to each other. So, for example, the independence definition that P of X, Y is the is the product of two factors P of X and P of Y is the definition of independence. And at the same time it's a factorization of the joint distribution as a product of two factors. similarly one of the definitions that we gave for conditional independence, which is the, the joint distribution over X, Y and Z is a factor over X and Z times a factor over Y and Z, is the definition of conditional independence. So, once again, independence of factorization. So, we see that factorization of the distribution corresponds to independencies that hold in that distribution, and the question is if we have that if so if we know now that a distribution P factorizes over G. The question is can we know something about the independencies that the distribution P must satisfy, just by looking at the structure of the graph G. [COUGH] So what are independencies that need that might hold in a probabilistic graphical model. So we talked about the notion of flow of influence in a probabilistic graphical model where we have for example the notion of an active trail that goes through as up through I, down through G and up through D, if for example we have the G is observed, that is in Z. And that gave us an intuition about which what problistic influence might flow. We can now turn this notion on its head and ask the question well what happens when we tell you that there are no active trails on the graph that is influence can't flow. So we're going to make that notion formal using the notion of d-separation. And we're going to say that X and Y are deseparated in a graph G given a set of observation Z if there's a lack of trail between them. And the intuition that would like to demonstrate in this context is that this notion of influence can't flow corresponds much more formally to the rigorous notion of conditional independence in the graph. So let's actually prove that that's in fact the case. So the theorem that we'd like to state that if P factorizes over the graph, then P factorized is over G. And we have a deseparation property that holds in the graph. So X and Y are deseparated in the graph. There's no active trails between them. Then P satisfies these conditional statements, X is independent of Y given Z. So deseparation implies independence. If, the, probability distribution factorizes over G. So, we're now going to prove the theorem in its full glory. We're going to prove it by example because that example really just, it really does illustrate the main points of the derivation. So, the assumption is that here is our graph G, and here is the factorization of the distribution. So, according to the chain rule, of Bayesian networks. And this is a factorization that we've seen before. And so now we'd like to prove that a d-separation statement follows from this from this deriva-, from this assumption. And, and the d-separation statement that we'd like to prove follows as an independence. Is one that says that d is independent of S. First, let's convince ourselves that D and S are in fact d-separated in this graph. And so we see that there is only one possible trail between d and s in this graph. It goes. That instance g is not observed in this case, and neither is l. We have the, the trail is not activated and so they, the two are, the two nodes are de-separated and so, we'd like to prove that this independence holds. So what is the prob, the joint probability distribution, of D and S? Well it's the sum over this join distribution over here. Marginalizing out the three variables that are now D and S so GL and I so we have the sum over GL and I of this big product that's defined by the tin roll. And as we've previously seen when we have a summation like that one of the things that we can do is we can push in summations into the product so long as we don't push them through terms that involve the variable. So here for example we have we might have a summation over L. And we can push in that summation because the only term that involves L is the probability of L given G. So if we push in the summations in this way we end up with the following expression and we see that we have this summation over L here, a summation over G and then finally the summation over I at the very beginning. And what do we know about the summation over L of P of L given G? We know that this, that this is a probability distribution over L and therefore. It's necessarily sums to one, because we're summing up over all of the possible values of L. Once this term is one, we can look at the next term, which is this one. And we can ask ourselves, well one is, cancels out, so probability, what is the sum over G of the probability of G given D and I, and that too is one. And by the time we're done, we end up with the following expression. And the important thing to notice about this expression is that this is phi one of D and this is phi two, oops. I two of S, and therefore because we partitioned this joint distribution as a product of two factors, we end up with something that corresponds directly to the definition of marginal dependence, therefore proving that P satisfies this independent statement. Having convinced ourselves that de-separation is an interesting notion, let's look at some other de-separation statements that might hold in a graph. So, one general statement that we can show is that any node in the graph is de-separated. From it's non-descendants given it's parents. So let's look at the variable letter. And let's see what non-descendants, non-descendants it has that are not it's parents. So it only has two descendants let's mark those in blue it's this one and that one and what are the non-descendants so these are the descendants. What about the non-descendants? Well, here's one non-descendant, here's another, another and another. So, those are the four non-descendants in the graph that are not parents. So, now let's convince ourselves that, letter is de-separated from all of it's non-descendants given it's parents and let's take just an arbitrary one of those, let's take S A T for example and let's see if we can find an active trail from S A T to letter that is active given the parents, which, in this case, is just a variable grade. Okay. So what are, what's one potential trail? So we can go this way, up. And then back down through water, but is that trail active? No its not active because Greg is observed and so drop with blocks a trail in fact it will block any trail that comes in from the top. So any trail into letter that comes in through its parents grave is going to be blocked by the fact that grade is observed. So that means we have to come in through the bottom. So let's look at that. So let's look at the trail from s a t, through job. An up, again through letter, well is that trail active, no. The reason it's not active is because only great is observed, and job, since it's not, it can't be a parent of, our letter, because it's a descendent, is necasserily not observed and so neither job nor happy, are observed and so we, we can not activate, this V structure, over here, can not be activated and so, trails that come in from the bottom, don't work either. And so we again this is a proof by a demonstration but it shows again that that. It shows the general property in a very simple way. So that tells us by the theorem that we proved on the previous slide that if P factorizes over G, then we know that any variable is indepedent. Of it's non-descendants given its parents, which is a kind of nice intuition when you think about it. Because when we motivated the structure of Bayesian networks we basically said that what makes us pick the, the parents for a node is the set of, of variables that are the only ones that this variable depends on. So the parents of are the variables on which depends directly. And this gives a formal semantics to that intuition. That is we now have the variable is depending only on its parents. So, now that we've defined a set of independencies that hold for any distribution that factorizes over a graph, we're now going to define a general notion, which is just a term for this. So we define d-separation in a graph G and, and we showed that when you have the d-separation property, that P satisfies the corresponding independence statement. And so now we can basically look at the independencies that are implied by the graph G. So I of g which is the set of indepedencies that are implicit in a graph G are all of the independent statements X is independent of Y given Z that correspond to d-seperation statements within the graph. So this is the set of all indepedencies that are derived from d-seperation statements and those are the ones that the graph implies hold for distribution P as we just showed in the previous demo. So, now we can give this a name. We going to say that P if P satisfies all the in-dependencies of G, then G is an I-map of P, and I-map stands for in-dependency map. And the reason its an independancy map is by looking at G it's a map of independancy as a hole in p. That is it maps independancies that we know to be true. So let's look at examples in imap, so let's look at two two distributions here is one distribution p1. And here is a second distribution, P2. Now lets look at two possible graphs. One is G1 that has D and I being separate variables. And the other, is G2 that has an edge between D and I. So what independencies does G1 apply? What's an I of G1? Well I of G1. Is the independence that says d is independent of I, because there's no edges between them. What's I of g2? One of, IT2 implies no independencies, because, there are no, because D and I are connected by an edge and there's no other independencies that can be implied, and so, I of G2 is named B7. So what does that tell us, let's look at the connection between these graphs and these two distributions, is G1 an I map of P1, well, if you look at P1, you can convince yourself that P1, satisfies, D is independent of I, J1 isn't I map. Fp1. What about G, is G1 an I map of P2? Well, if we look at P2, we can, again, convince ourselves by examination that P2 doesn't satisfy. D is independent of i. So the answer is no. this, g1 is not an I map for p2. What about the other direction, what about g2? Well, G two has no independent assumptions, and so it satisfies, and so, it both P one and P two satisfy the empty set of independence assumptions. As it happens P one also satisfies additional independence assumptions but the definitions of Imap didn't require that that G, that a graph exactly represent the independents in, in in a distribution, only that any independence that's implied by the graph holds for the distribution, and so we have. The G2 is actually an I-map for both P1 and P2. So now we can restate the theorem that we stated before. Before we talked about the separation properties so now we can just give it a one concise statement. We can say that if P factorizes over G, that is if it's representable of the bayesian network over G, then G is an IMAP for P, which means I can read from G. Independencies that hold in P. Regardless of the perimeters. That's by knowing that P factorizes over G. Interestingly, the converse of this theorem holds also. So this version of the theorem says that if g is an imap for p, than p factorizes over g. So what does that mean? It means that if the graph, if the distribution satisfies the independence assumptions that are implicit in the graph, then we can take that distribution and represent it as a Bayesian network over that graph. So once again, let's do a proof by example. So now here we have the little graph that we're going to use. And and so now what do we know? We're going to write down the probability of the join distribution over the, these five random variables using this expression. Which is the chain rule for probabilities. Not the chain rule for Bayesian networks, because. We don't know yet that this is representable as a Bayesian network. We're going to write the chain rule for probabilities. And that has us, writing it as P of D times P of I given D. Which, if you multiply the two together using the definition of conditional probability. These two, together, give us the probability of I, D. If we multiply that by G, given IND. Then we end up with the probability of G, I, D. And so on, we can construct this telescoping product. And altogether we, by multiplying all these conditional probabilities we have a we have a, the def-, we can reconstruct the original joint probability distribution. Great, now what? So we haven't used any notions of independence yet. So where do we use that? Well, we're going to look at each one of these terms. And we're going to see how we can simplify it. So P of D is already as simple as it gets. So now, let's look at, the next term. Which is probability of I given D. And let's look at this graph. And we can see that I and D are non descendants. And we're not conditioning on any parents and so we have that in this graph as and we've already shown this that I is independent of D. And so we can using one of the definitions that we have of conditional independence, we can erase this from the right hand side of the conditioning bar and just get P of I. The next term is already in the form that we want is so we don't need to do anything with it but now let's look at the probability of S given D, I and G. So here we have the probability of a variable S, given one of its parents I and one of its non-descendants D and G. And we know that S is independent of its non-descendant given, non-descendants given its parent I and so we can erase both D and G and so end up with the probability of S given I. Finally, we have the last term, probability of L, given D, I, G, and S. And here again, we have the probability of a variable L, given it's parent G, and a bunch of long descendants, so we can again, erase the long descendants. And so all together that gives us exactly the expression that we wanted. And so now, again, by example we proved this we proved this independent, we proved that from these independent statements we can derive the fact that P factorizes using the product rule for base E networks. To summarize, we have provided to equivalent views of the graph structure. The first is the graph as a factorization, as a data structure that tells us how the distribution P can be broken up in to pieces. So, in this case, the fact that G allows P to be represented as a set of factors or CPD's over the network. The second is the Imap view, the fact that the definition that the, that the structure of G and code independencies that necessarily hold in P. Now what we've shown is that each of these actually implies the other. Which means, if, that if we have a distribution p that we know is represented as a Bayesian network over a certain graph, we can just look at the graph, and let even delving into parameters know certain independencies that have to hold in the distribution P. And independencies is a really valuable thing because it'll, it tells you something about what influences what, and if I observe something, what else might change. And that helps a lot in understanding the structure of the distribution and consequences of different types of observations.