The propensity score is being used in the estimation of treatment effects in observational studies in essentially four ways. One, regression on the propensity score, two, subclassification on the propensity score. Three, weighting using the propensity score, we haven't talked about weighting yet. And four, matching these in the propensity score. Recall, assuming unconfoundedness, use of linear regression and observational study, will lead to biased and inconsistent estimates of the average treatment effect. When the regression function is not correctly specified. Now, while it might be difficult to estimate high dimensional regression non-parametricly point to which where we turn. The results from the previous lesson indicate that it would suffice to regress the outcomes on the propensity score, and treatment assignment then average over the distribution of the propensity score. If the regression model has been correctly specified, this will yield a consistent estimate of the average treatment effect. But in practice, the propensity score is unknown, therefore this strategy requires estimation and specification of two models. One for the outcome and one for the propensity score. Later we shall see it is more common to use the propensity score as an additional regressor, along with the covariates X. Or in conjunction with the regression function to obtain a so called doubly robust estimator, so let me move on to sub-classification. Sub-classification forms blocks, little s, based on the covariate values, and then the data are analyzed as if they came from a block randomized experiment. Within each block there are n sub s units, n0s of which are controls, and 1 sub s which are treated. We're going to parallel the kind of notation, Y bar 1s will denote the average value of the outcome for the treated units in stratum s, and Y0s bar will be defined accordingly. And we'll take the usual estimator in the stratum s, and then we'll estimate the average treatment effect by just weighting up by this strait and proportions. And if we use weights n1 sub s over n1, then we're just waiting by the treated of observations, by the distribution in the treated observations. And that of course will give us an estimated the effect of treatment on the treated. Rosenbaum and Rubin show that for two units with the same value of the propensity score, one treated and one not. The difference in outcomes is unbiased for the average treatment effect at that value of the propensity score. Now, this justifies sub-classification based on the propensity score. And as the propensity score in a many-to-one function of the original covariates, this may help to address some of the problems noted above. So let's talk about subclassification with the propensity score first. To implement this, the propensity score's typically unknown recall, we're going to replace it with an estimate. And then we're going to form S strata Where each unit I belongs to stratum containing protestimated propensity score. So now we're going to group together the observations with different propensity scores that's going to introduce bias. But if the number strata are large, the estimator ought to have minimum bias. Trouble is, by making S large, each sub-class gets fewer observations, which increases the variability of the estimator. So we encounter the classic bias variance trade-off. It's difficult to give general analytical results, these will depend on the form of their regression functions, true regression functions in both the treatment and control groups. And also how balanced the covariets are in the sub classes, but in practice, the use of five to ten sub classes is often recommended. And for some reasonable examples, this can reduce the bias relative to using the unadjusted estimator by more than 90%. There are many possible ways to implement estimation by subclassification. For example, suppose one starts by estimating the propensity score model using logistic regression or a probit model. In practice, you might include not only main effects of covariates, but a number of interactions as well. You can get fancier, you can use something like generalized boosted models if you wanted. So you've estimated the model, well now you're going to find out that it often happens at higher values of the estimated propensity score. There are no matches or relatively few control matches, and that's going to cause a problem. And in the literature this is refered to as insufficient overlap. One way to deal with insufficient overlap is to choose e1 hat, and or the last one e s- 1 hat. So that the lowest and highest intervals contain an adequate number of observations from both the treatment group and the control group. But then the first and or last sub classes might include cases from the treatment and control groups that are quite different, and that'll result in increased estimation bias. So, another approach is to trim the sample by excluding observations with estimated values below and above some thresholds. And estimate the treatment effect of interest on this region of so-called common support. For example Imbens and Rubin exclude observation with estimated propensity score less than the smallest value in the treatment group, or greater than the largest value of the propensity score in the control group. Next once we've done a pre specified number of sub classes is formed using propensity score intervals of equal length. And in each intervals a test is conducted to assess whether or not the mean propensity score, is different in the treatment and control groups. In other words, how good a job did we do in that subclass? In those intervals where the null hypothesis of no difference is rejected, you can just split the interval until you fail to reject the null hypothesis. Or until further splitting would result in a situation where an interval doesn't contain both treatment and control observations. After now you've determined the number and the spacing of the intervals, now the covariate distributions ought to be balanced across the treatment and control groups within each sub class. And so you can check this, Imbens and Rubin recommend using a normalized difference below, but there are certainly other ways to check that. Eyeballing it, of course, is a crude way but there are other ways as well. The basic methodology admits a number of obvious refinements. For example, the propensity score model might be estimated using nonparametric logistic regression or machine learning methods for classification. We call then in a completely randomized experiment, we saw that using the linear regression to adjust for differences between treatment and control groups resulted in an unbiased estimator of the average treatment effect, was smaller variance than the unadjusted estimator. Since the randomized block experiment is a randomized experiment within blocks, and sub-classification is an attempt to mimic a block randomized experiment. This suggests using linear regression within blocks to adjust for differences in balance between covariates in the treatment and control groups. Thus consider the following regression where S sub I is the sub class to which I has been allocated and the usual assumption, expected value of the disturbance given the regressors in the class to which I is in zero. And then we have the intercept and regression coefficients in that subclass, and the tall in that subclass will be the average treatment effect. Now, as before, using the fact that the OL S residuals and weighted residuals sum to zero. It'll follow from that that the difference between means in the S stratum is equal to the treatment effect plus the difference between the covariant means multiplied by the regression coefficients for the stratum. Now, the average treatment effect we would estimate as the stratum proportions times the estimated average treatment effect within the S subclass. And if we weighted by, not the proportions within the stratum, but by the proportions within each stratum of the total treated number observations, we'll get an estimate of the average treatment effect on the treated. Now, we want to get some variances, but those are pretty easy to get out of this because they'll just be the squared within each stratum will just be. For example, for the average treatment effect, it'll just be n sub s over n squared times the variance of the tau hat s star. And similarly in the ATT, we're just going to use the different weights.