In the previous lesson, we discussed some randomized experiments. Notably, Bernoulli, completely randomized, the block randomized, and the special case of the block randomized, the paired randomized. This lesson, I want to discuss randomization based inference for somebody's randomized experiments. Okay. So, you may or may not be familiar with randomization based inference, but the key ideas here are as follows; first of all, the potential outcomes are fixed constants. They're not random variables. So therefore, I'm going to write these using a lowercase y. This kind is if you've ever had a course on sampling, you should be familiar with this idea because it's the same idea. Second, we do have probability but it's not because these y's are random variables, it's because the assignment rule is probabilistic. So, probability enters only through the assignment rules, but not through the outcomes themselves which are fixed constants. So, using this randomization based inference, we're going to take up the sharp null hypothesis that there is absolutely no effective treatment for any unit. Now, this is much stronger than an average effect of zero because pluses and minuses could cancel out an average out to zero. But this is zero for every unit. So, okay. We're going to need some more notation of course. So, let's let yi(Z), be the responsive unit i, under treatment assignment Z, in omega. Now, the Z where you recall is the set of treatment assignments for everybody. Now, under that particular assignment, we'll write out all the data that we get, the y(Z's). Y1 through YN. Now, our null hypothesis is that for every subject, i equals one through n, and for every assignment Z, and now let's have Z prime be a different assignment. The response vector y(Z), is equal to y(Z) prime, and that's equal to just y1 through yn. Okay. So, this is very special. All right. Lot of simplification occurs. So now, we want to test this hypothesis, or we want to compute a p-value. You are pretty used to doing that stuff. When you normally do that stuff, when the y's are random variables, you choose a reference group like the normal distribution, and you compute the p-value with respect to that. But here, we're not going to do it that way. Here, we're going to use a different reference distribution. Things are quite ingenious in a very simple, because under the null hypothesis of no effect for any unit, the subject's data is the same for every assignment. So, essentially, we can write down the data for the treatment group, and for the control group, under every possible assignment. We can use that once we've chosen a test statistic to use that as a reference distribution. So, let's formalize this idea, but you can see the intuition's very clear, and it's very clever. Let's consider a test statistic t of capital Z. Capital Z, remember that's random variable, and that's for all the subjects, and the little vector y is the the fixed constants for all the subjects. So, we're going to calculate the value of t, this test statistic, under the observed assignment. But then, because we know the data under every assignment, under the null hypothesis, we're going to calculate this test statistic for all the other assignments, and then we're going to use this to determine a p-value. In this case, one sided, p star, and it's the probability that we would observe a result as extreme or more extreme than the result we actually observed under the null hypothesis. So, this is the p-value for the test. You're familiar with that. It's the same thinking, it's just you're using a different reference distribution, you didn't choose it from outside, its chosen with inside the experiment. Now, for all the randomized experiments we consider before in the previous lesson, all the assignment vectors in omega are equally likely. So, when we want to consider the probabilities of the sets such that the extreme or more extreme values, there are just the probabilities that the assignment is that Z and omega is equally likely. So, that's the cardinality of omega invert, to the minus one of course, and then we're just adding over all the t values that are greater than or equal to little t. So, that's very, very straightforward. So, let's actually look at this stuff and practice. Just for fun, I'm going to give you this example, which if you ever had a course on categorical data, you might have been introduced to Fisher's exact test, which is exactly where this is coming from. Okay. So, here's the story. R.A. Fisher, who no doubt you've heard of had a colleague, Dr. Bristol, who worked in his lab. Dr. Bristol, claimed she could tell whether the milk in a cup of tea with milk was poured into the cup before or after the tea. So, Fisher says, ''let's conduct an experiment to test this.'' So, in this experiment, there are eight cups in total, and four received milk first, and four received tea first. Bristol then took a drink from each cup and responded to t first, to zero here, or milk first, which is one. So here, one stratum, omega is the set of all column vectors of size n, equals eight, such that the number of cups with milk added first is four, and there are 70 such vectors. Bristol knew that there were four cups with milk added first, and four cups with tea added first. She was told that. So, her response vector should also contain four zeros, and four ones. Okay. Now, under the null hypothesis, Bristol would give the same sequence of responses no matter what the actual truth is. So, for step one, let's choose a test statistic that is just to have it be simple the number of correct responses. Then, we're going to see how many does Bristol get right. That's the observed value of the test statistic, Bristol get six right. Pretty good. Now, for step three, we count how many Bristol would've gotten, given her fixed responses, under the other 69 possible assignment vectors. So, then we find out that there are 17 ways to get six or more correct. So, the p-value is 0.243. Can you see how there are 17 ways to get six or more correct? Now, this example is less fun, but it's certainly more straightforward. So, let's have a completely randomized experiment with four subjects, two former assigned to the treatment group. All right. So, simple permutations. There are six possible assignments, and each has probability one-sixth because it's the completely randomized experiment. So, the assignment that was actually chosen were the first two units in the control group and the second two units in the treatment group. The value that we observed of the data was 1, 3, 4, 6. So, as a test statistics, let's choose the treatment group mean minus the control group mean, pretty standard stuff. All right. So, for that, we get three for the actual assignment. Now, we can compute this under the other five possible assignments as well. If we do so, we get the values two, zero, zero, minus two, and minus three. So, the probability under the null hypothesis of getting a result of equal or greater magnitude is one-sixth. I'm not going to deal with two-sided p-values but they can be dealt with. Okay. So, let's look at some other test statistics. The median test statistics computes the median of the n responses and then counts the number of responses in the treatment group that exceed the median under the assignment vectors. Pretty intuitive sort of thing. Of course, if you were interested in some other quantile, you could certainly look at that. Now, maybe you've heard of the Wilcoxon rank sum test or equivalently the Mann-Whitney test statistic. So, here what you do is, you take the responses, the y's, and you transform them to their ranks arranged in ascending order. Let's not worry about ties here. You have to do some fiddling if there are but we're not going to worry about that here. The test statistic is the sum of the ranks in the treatment group under the different assignment vectors. Again, you can use that to get a one-sided p-value p star. Now, for the case where s is greater than one, remember that's the block randomized experiment. Okay. Now here, n sub s equals two, so it's a special case of the block randomized experiment, the paired randomized experiment. If you have binary data just zeros and ones, there's something a well-known tests called McNemar's test and that uses the number of one's in the treated group. For that same case, if you had continuous data, the Wilcoxon sign-rank test statistic is often used. Here works as follows. So, without loss of generality, we label the units, so that this is first unit in each pair that receives treatment. Then, for each pair, compute the absolute value of the difference between the treated and the control observation, and convert these differences into ranks. Then, calculate the sum of the ranks for the subset of cases where the response of the treated unit is higher than that of the control group. So, that's what that little equation there and that what the set says. Okay? We can generalize those kinds of things further and Rosenbaum 2002 has some nice work on that, some nice examples. Then, there's the well-known book by Hollander and Wolfe on non-parametric statistics. Now, we'd been assessing the null-hypothesis absolutely zero effect. But if you think about for a moment, we can assess a much more general null hypothesis where the difference for each unit is Ti, where Ti is just something we specify. Because if we do that, we're going to see for each observation either a Yi zero or Yi one. So, under the null hypothesis, we know the value of the potential outcome you don't see. So, again, for any assignment vector, you can therefore write down the values for all the responses. Now, as it turns out, it might be pretty demanding tasks to specify T sub i for each case. So, what usually happens is this constant T sub i is T specified. So, this is the null hypothesis of a constant effect. The T equals zero is just what we've been talking about before. Okay. But this null hypothesis can be tested for different values of T the same way as before, we can basically shift the outcomes subtract of T and basically then we're in business. So, we can obtain a confidence interval for the set of values of T for which the null hypothesis is not rejected at level alpha. So, let's go back to our toy example. Suppose we want to test the null hypothesis, the T is one versus the alternative hypothesis the T is greater than one. Under the null hypothesis, we can write down the values y one zero y two zero, y three zero and y four zero and the values y one for each of the cases. So, if z is equal to this potential outcome vector where the first two units are assigned to treatment and the second two units are assigned to the control group, we will get a difference of one. Okay. In continuing, we get under h naught under this particular null hypothesis, that's the probability that t is greater than or equal to three is one-third. So, now we confront a few other questions. Okay. Now that we've introduced this. What test statistic should we use? Well, we could use several in different situations. So, some considerations are the sensitivity to departures of interests from h naught. We may be interested in detecting some kinds of departures more than others and can choose a test statistic accordingly. Second, robustness to outliers. So, for continuous data for example which are way out there, this is one reason why sometimes people use ranks or medians rather than means. Then, another consideration is the power. Now, I want to talk about large experiments. In a completely randomized experiment, there are this many assignments. So, even with modern computers, this set of assignments can be too large to handle easily. For example if n is equal to 500 and half of the subjects are assigned to the treatment group, there are more than 10 to the 49 assignments. It is incredible. So, you're going to break the computer drawing on that one. So, what can you do? One, you can sample the assignments. Okay. The second thing you can do. This is often done. Is derive the mean and the variance under the null hypothesis, that you can do for some of these, test statistics, and use a normal approximation for normal theory test. So, that's another possibility. All right. So, let's have a just a general look at this randomization based inference. Okay. Some very nice things about it, it makes minimal assumptions as you could see. We use the internal reference distribution instead of trying to impose some distribution from the outside like a normal or a t or something. It's pretty easy to understand it's quite intuitive especially when you get the hang of it. It's got some disadvantages. Sometimes the sample is of interests. But often, you want to get outside the sample and extrapolate beyond. Well, that's of course going to require more assumptions to do but you might be willing to trade off those assumptions for the greater generality that you will get. Second of all, so far, it doesn't handle heterogeneity. Well, remember we talked about the case T equals Ti. Well, it's kind of hard to deal with that case. There are some things you can do. Rosenbaum has done some very nice stuff. You might see his 2010 book. Of course, this is of interest, because basically, we probably think that most treatment effects are heterogeneous in most cases of interest to us. So, we really need to have something that we can deal with. So, actually, now, what I'm going to turn to estimation focusing primarily on randomization based inference for the sample average treatment effect.