So, here we are in our R studio with our coursera.r file and we're moving on to the scenario where we're comparing the number of distinct pages visited in an A/B test, and we're going to go through a few analyses to do that here. And as the comment indicates, what we'll be doing is an independent samples T test and we'll talk more about that as we go. So, as is our usual procedure, we'll read in one of our data files, that goes with this work. And that is a PG views or page views .csv, so we'll read that in. And as is our typical process here, we'll take a view of what that is so we can be comfortable with it. So as you can see we have a subject column so we can see that each subject is measured just once, it seems, and then a site column so which site where they issued, A or B, and as I scroll down here, it kind of refreshes. I'll go all the way to the bottom, and then it'll refresh. And so we do have 500 subjects, as we said in the description. And then a column called pages. And it looks to be pretty much kind of single digit counts of how many pages were viewed. Looks like, obviously one would be a minimum we would guess and I saw maybe a ten in there or an 11 in there as maybe a maximum. We can find out more formally what those are. That gives us a sense of what we're dealing with. We'll go ahead and recode that subject, that subject column as a factor since it's just a number it thinks it's a numeric variable, but as we've now talked about variable types, we know we want it to be a factor. It won't be used directly in this analysis, but we're going to keep dong this good practice because as we progress in the sophistication of our analyses, we'll see that we end up using the subject later. And then let's go ahead and take a little sum review. We can see that there are 500 distinct, these six plus 494 other distinct levels of subject. That's just the subject identifier. It looks like 245 of those subjects were exposed to site A, 255 to site B. So very nearly a 50/50 test and certainly kind of a realistic outcome, as often is the case. And then here, because pages is a numeric response variable. It computes for us a min and a max, 1 and 11 there, and some other data. We can see the mean is right near four and the median is four. We'll also look a little bit more at some descriptive statistics using the plyr library. This function, DDPLY, DDPLY, allows us to apply a function over certain aspects of the table. And remember, I'll remind you, you can always type a question mark and then a function name, assuming that the library for it is loaded, and it'll bring up the help for that name. So DDPLY is a split the data frame, apply function and return results in a data frame. So what we see as input here is the data table itself is page views. We want to split by site and apply this inline function where we are summarizing over the pages by site. So when we do that, we can see for each site, A and B, we can see now some of the same statistics that we saw before overall, but now split by site. So we can see the mean for site A is 3.4, the mean for site B is almost 4.5. So that suggests there may be a difference, but we've learned that comparing means directly is not the full story. We need to know something about the variance. So, this other function allows us to summarize and get the mean number of pages which we have here. But also then the standard deviation which would be of interest. We can see that in the site A condition, there was a standard deviation about half the size, of the number of pages viewed in the site B condition. So there were more pages viewed in site B, but also with greater deviation around that mean. One way to view that is with a histogram. So we can call the hist function and we can look at the page views for site A and the number of pages. So I think we can just graph that there, and we can see a couple of things about this. We can kind of see the range from this from about one to six. We can see in site A, it looks to be kind of a normal distribution, kind of a bell curve or Dalsian curve there. Let's go ahead and look at a histogram of site B. And here we can see something a little bit different. A very few number of pages visited up above, seven and eight and ten, quite a few down lower. Doesn't quite look like a bell curve. It doesn't look normally distributed, and those kinds of considerations will come up as we go forward in the course. For now, we're going to ignore those differences, but they are relevant and we will talk about them more in the future. Another way to look at the data too is a box plot. So with the plot command, we can see pages by site. And now we understand that notation a little better. Pages being the y variable, the outcome by site, which is our independent variable or x variable if you will. In the meantime then, we're going to execute our independent sample's t test. Why is it independent samples? What does that mean? Remember that factors can be between subjects or within subjects. And between subjects is the type of factor that site would be, because each visitor gets either website A or B, but not both. So it's an independent samples T test. In the future we'll see a paired samples T test that is appropriate for within subjects situation. You can see this parameter at the end. To T test var equal. That's saying the variance is equal. We can see in this box plot that's obviously not true and we'll formalize that consideration as we go as I said in the future, but for now we'll just do a basic uncorrected T test assuming that the variance is equal. In reality T tests are fairly robust to changes and deviations in variants. They don't have to be exactly equal anyway. So, let's go ahead and execute that and we can see that we have the T test here. Well, what's this output mean? So, the data confirms we're looking at pages by site and that's in fact exactly the design we talked about. The t-value is the t-statistic, so just like with the chi squared statistic, in the previous things we went through, the t-statistic is the value in the t distribution that we are getting from this data. The degrees of freedom is 498. Obviously related to the 500 subjects that we have there, and then the p value is very, very small, far less than 0.0001, so that's about all we care about, but very near zero. Some other results as well, we can see the mean for Group A and B are like we saw before in those summary statistics. So the bottomline here is we have a significant difference between the number of pages visited in website condition A and B. Okay. So that is the T test for our simple website AB test. And it might suggest to us that people visit, because people visit more distinct pages in website B maybe we go with that. Let's return now to our table of analyses and see where this has brought us. As you know from before, we completed the top test of proportions table previously, and now we've come down to the analysis of variance table and we're in that first row, and what's turned red there is that independent samples T test that we just did. If we look on the left column it has one factor and that was pages, it had two levels and it was a between subjects factor, so that's what the third column with the B means, and we're in a parametric test. And next time we talk we'll get more into what the difference between what parametric tests and non-parametric tests are. But you can see the table sets up a sort of equivalence relationship where if we're in a parametric situation we have certain tests and if we're in a non-parametric situation we have others. For now you can think of the difference as whether or not we can make certain assumptions about the data, which are required for parametric test. For example that the data is normally distributed is a common assumption we'll have to contend with and for many measures the data is. We can see in these box plots however that for site visit A, the data is clearly not normal and we saw that in the histogram as well. So that's the difference between those columns. And we'll formalize that more as we go. But we've done the independent samples t-test and that's where we'll leave it for now. Let's see how we would report that t-test result in writing. So let's see how we would report our t-test from our website AB test. So we analyzed page views, and our result was a t-test, which we indicate here. It has one parameter for its degrees of freedom, and that was 498. So this is it's degrees of freedom. This is the test type, obviously and the test statistic was 7.21. In our case it came out as negative 7.21. You can put that in or not, it's up to you and really it just means which order the two levels of the website were in. If you compare A to B then you'll get negative 7.21. If you flip that and compare B, the difference in the mean of B to A then it will be positive 7.21 so it really doesn't matter whether you have the minus sign or not. So that's the statistic. And then the p-value, and we've talked about how to report those, so we can use the p is less than 0.0001, given how small the p-value was. So that's how we would report our t-test for the independent samples t-test for page views.