Now, I'd like to briefly examine the use of machine learning methods to estimate treatment effects. This is very new area of causal inference, a lot of work being done now. So, one of the earliest papers in this genre is genre for Hills Paper, who use Bayesian additive regression trees. I've already referred to that, to estimate the regression function, she didn't estimate the propensity score, just the regression function. Then, this is used to estimate average treatment effects. So, a nice feature of this approach is that it generalizes readily to the case of a continuous treatment, and repeated draws from the posterior distribution of the estimand lead to interval estimates. Simulation suggests this approach is competitive with others across a variety of response surface specifications. Now, more recent work considers estimation of both the regression function and the propensity score, and takes into account the regularization induced bias in estimating treatment effects, and also by doing both, you're going to get double robustness. So, there are a couple of approaches that have been pretty systematic. There's other stuff, but I'm just going to give you a flavor of the kinds of issues that arise when you want to use machine learning methods. We're going to describe this doubled/debiased machine learning as an approach that nicely illustrates some of the points I want to make. As a thorough understanding of this approach or other approaches that uses machine learning requires more statistical training, then is assumed for this course, I'm only going to give an elementary illustrative example. Just like I said, to give you the flavor. So, following, mispronouncing for sure, Chernozhukov et al., consider estimating the Average Treatment Effects using both propensity score in the regression function in the partially linear regression model. We'll come back to the use of both momentarily, but I'm going to first talk about just what happens if you just tried to use the regression function. So, E of X and g naught of X, those are at this point not specified. It's not a linear model. It's not parametric. But of course, we'll make the usual assumptions about the residuals to distinguish from the sample residuals the errors. So, we'll make those usual kinds of assumptions. We're going to assume a treatment assignment is unconfounded, and so under this assumption, this observed regression, the first one for Y, theta naught is the average treatment effect. So, the first thing is, we don't need to use the propensity score equation to estimate theta naught, we can just skip directly to the regression function. But, we still got to estimate and do something about the G part to get our hands on theta naught in a good way. So, now let's suppose that the data consisting of N observations, is split into two groups, and the g naught is estimated using the second group, and you could use something like random forests, or some other machine learning procedure, or some ensemble of procedures, it doesn't matter. Then, I'm going to get an estimate of g naught "hat", and then Y minus g naught "hat". I'm going to regress that on Z in the first group, and the first group is going to have size N, just like we've always been having. So, now I'm just back in my first group, the group of N1 and N naught not treated and control. So, now I'm going to attain an estimate theta naught hat of theta naught. So, I'm going to do the regression in the usual way, and so unless that's going to give me the following estimator theta naught hat. What is that? We can rewrite it as follows, and I'm doing this because I'm going to want to say something about the asymptotic distribution of this, so that's where we're headed. So, I'm going to rewrite theta "hat" in the following way, and you can just do the arithmetic and see, I'm just subtracting g naught and adding g naught, so I get g naught compared to g naught hat. Now, the next thing I want to do is, remember that Y minus g naught was equal to Z theta plus U, the error, and so now I'm going to substitute in all that stuff. Now I've got these two terms, and the first term if you're familiar with regression, you're going to see all that looks very familiar, but the second term you normally wouldn't see. Now, I'm going to look at this asymptotic distribution, and we're going to try to look and see what happens. Is this guy rude and consistent? The answer is going to be no. So, the first term, that's great. I mean, that's going to be very well behaved. You can see that the first part of the first term is going to be converge in probability to the probability Z equals one inverted, and the second term is going to be, you look at that a bunch IED stuff with mean zero and variance, so we're going to get a normal limiting distribution off of that. So, the first term on the right-hand side has a limiting normal distribution. So far, that's great. The second term however is problematic. Remember, because of the bias that we introduced by regularization, this term is not going to have mean zero, and it's going to actually diverge to infinity, so that's where we have problems. So, what does that do? The above is supposed to tell you that naive applications of machine learning procedures will lead to biased estimates of the treatment effects, and we care about that. In the paper that, by Chernozhukov et al, they go on to show that estimating the propensity score using the auxiliary sample, and then subtracting this estimate from Z to form an estimate V hat that's the residual estimate of the error, and then using V hat as a regressor in place of Z results in so-called debiased estimates of theta naught. Now, in general, your propensity score may also be estimated using machine learning methods, hence the label double machine learning. I'd also like to mention a related approach targeted maximum likelihood, another method. So, in a first step of targeted maximum likelihood, any number of machine learning methods can be used and combined to predict the regression function, and combined into an estimator called the "super learner". Super learner can be used to estimate treatment effects, but the estimates are biased. So, in the second step, the propensity score is estimated and used in conjunction with the super learner from the first stage to create a doubly robust estimator or the treatment effect. I hope this gives you a little bit of a feel for the kinds of issues that arise when you use machine learning methods to estimate causal effects.