So, the reason we cannot estimate the average treatment effect using the difference between the means in treatment and control groups when there are confounders, is because the distribution of the confounders differs in the two groups. So, in sub-classification, we formed blocks using the propensity score, looked at the average difference within a block, and then weighted each block using the proportion of the sample within that block. We can re-write the sub-classification estimator, the average treatment effect which we've written on the first equality sign, we can rewrite that as in the second equality. From the second, after the equality, it shows that the sub-classification estimator essentially re-weight the treated observations within a block, by the inverse of a treatment probability within the block, and the control observations by the inverse of the within block probability and not receiving treatment. Now, in the special case of a block randomized experiment, the treatment group weights are just the inverse of the propensity score. This suggests more generally, weighting the treated observations by the propensity score and the untreated observations by the probability of not receiving treatment, and forming the weighting estimator. So, clearly, propensity score weighting reduces to propensity score sub-classification when the propensity score within block S is actually n1s over N sub s, from which it is evident that the sub-classification estimator is a crude version of weighting, where the weights are replaced by an approximate propensity score applied to the observations in block S. So, this demonstrates that weighting using a propensity score actually inverse probability of weighting creates distributions in the treatment group and the control group that are the same. Also, it demonstrates that weighting is theoretically superior to sub-classification as the latter generally leads to a biased estimate of the average treatment effect. Now, to see a more formal justification for weighting, we can reason as follows: first line just notes that Zi times Yi is actually Zi times Y1, and thus the second line what we want to show is that the expectation of ZiYi divided by the propensity score is actually equal to the expected value of Y potential outcome in treatment. Okay. So, to see that. Okay. So, the first line is just going to be iterated expectations conditioning on the confounders. The second line now is going to use the fact that we have unconfounded treatment assignment T or Zi and the potential outcome Y and one apart. Then, we'll notice that the expected value of Z of given X over E of X is just E of X over E of X or one, and then we'll get the remaining result. So, the weighting estimator is unbiased for the average treatment effect. Great. That said, in practice, a number of issues arise. First, for any given sample, the weights will generally not add to one, they will in expectation but not in practice, not in the particular sample. Well, we can fix that by normalizing the weights as we've done below. So, that's not a real problem. But with weighting, a more serious problem is the propensity score is typically unknown and must be estimated. In practice, so we replace with an estimated propensity score. Now, if the model for the propensity scores misspecified, this can create severe bias and the estimated average treatment effect particularly at the large and small values of propensity score. So, let's look at that. For example, if the propensity score is 0.05, but we estimated as 0.01 instead, the treatment observations are going to be weighted five times more heavily than they ought to be. Similarly, if the score is 0.95 and is estimated as 0.99, the control observations are going to be weighted five times more heavily than ought to be the case. So, this in a certain sense reduces the advantage of weighting over sub-classification. That said, in both cases, sub-classification in weighting when the propensity score is very small or large, there are often an insufficient number of treatment or control group observations which creates additional uncertainty and in estimation and that leads researchers who use both methods to often trim the sample, removing all observations with estimated propensity scores above or below user-specified cut-offs. So, the critique of weighting that one might make above is mitigated somewhat when one trims the sample in any case. The weighting estimator is easily adopted to estimating the effect of treatment on the treated. How do we do that? First we'll note that the treatment group mean is unbiased and consistent for the expected value of a potential outcomes given treatment. So now, we just have to estimate the potential control outcomes given treatment. So, we need to re-weight the control group observations to the distribution of propensity score in the treatment group. Assuming also that the propensity score is estimated, this is the estimator we get for the effect of treatment on the treated. Now the weighting estimators can be implemented using least squares with following weights for the treated observations, and then for the untreated observations one minus the propensity score estimated propensity score to the minus one half. Now, as with sub-classification, one might also add covariates to the regression. A seeming disadvantage is that it is then necessary to estimate two models one for the propensity score and one for the regression. However, Robins and Ritov showed that if either the model for the propensity score the regression function is specified properly, the estimator the average treatment effect is consistent. They refer to this property is double robustness. We're going to take that up in a few modules. Weighting estimators that use propensity score also feature prominently in the literature on longitudinal causal inference. Here, the goal is to estimate the effect of a treatment regimen. So, a whole bunch of treatments, individual treatment versus an alternative regimen, where ZT which is the treatment assignment at time T or Z star, an alternative treatment assignment at time t denotes the treatment assignment then. Now, as an example of what one might want to do in longitudinal causal inference, one might want to ask whether an outcome is improved by taking a pill twice daily versus once a day in the morning over a period of t over two days, so you can work that out. So, in the SQL, we shall briefly consider this important literature on longitudinal causal inference. But we just want to note that the inverse of the propensity score features prominently in this literature this point.