Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

From the course by Johns Hopkins University

Mathematical Biostatistics Boot Camp 2

34 ratings

Johns Hopkins University

34 ratings

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

From the lesson

Techniques

This module is a bit of a hodge podge of important techniques. It includes methods for discrete matched pairs data as well as some classical non-parametric methods.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

So here's another example that was on Wikipedia, which is a wonderful example. It's kind of a famous example in this area just to describe the numbers, how this numerically can happen. So in American baseball, a batting average. So, if you've never seen or heard of baseball, you know, baseball's a game where a player you know, look it up. The player goes up with a bat and the pitcher throws the ball really hard. It's really quite difficult to hit a baseball, especially in the professional leagues and throw the ball a hundred miles an hour. So, the player swings the bat and tries to hit the ball. The percentage of basically not exactly, but the percentage of time that the player hits the ball (no period) is going to be their so-called batting average, right? And really good players can do this, let's say, 30% of the time. You know, excellent players, but most players are worse than that. Okay. So here's two players and their batting average. So Player 1 had 10 at bats in the first half of the season and got 4 hits. Player 2 had 100 at bats, and got 35 hits. >Hits. So player one's batting average was 40%, player two's batting average was 35%. The second half of the season player one got 25 hits out of a 100, 25% batting average. The second half of the season player two got 2 out of 10 hits, 20% batting average. So in both the first half and the second half of the season, player 1 had a better batting average than player 2. right? If you just add up these numbers, 29 hits out of a 110 bats [UNKNOWN] that bats the whole season for player 1 and 37 bats for 110 for the whole season for player 2. You get 26% for player 1 and 34% for player 2 so, player two has a better batting average. So, it seems paradoxical that a person can have a better batting average for both the first half and the second half of the season, but have a worse batting average overall. But of course, the numbers work out, right you see it. The numbers actually work out,

and so, you know, I put in, consider the number of bats here cause that. Is coming into play that the, the, the player once had this very good batting average when they had relative few bats and modest batting average when they had

lots of bats and vice-versa for their player. So, that's I think really the culprit in this case. Okay. In another very famous example

Simpson's Paradox is the so-called Berkeley Admissions Data. And it's fine that it's in R, so I'll cover it a little bit and you can explore it because you can. To get it, you can do just help, U.C.B. Admission's and then that will describe the data set. Data U.C.B. Admissions will load it up and then here I give a little command of why you see the admissions, C(1,2), sum. Get's the appropriate margin so here, we looked at, whether or not, So, for admission versus rejected by gender and we get that, a male's. The the acceptance rate was higher for males than it was for females, disregarding anything everything else.

Okay. But, I give another command here and then, It shows the, now the admissions rate, I'm not showing the counts because at the end it's getting a little bit on the [UNKNOWN]. So, I'm showing admission rates by the department. Department A,B,C,D,E,F and E and F. And, then when you can see along, you know, for Department A. Males got admitted fewer percent of At a time department B, males got admitted fewer percentage of time, Department C males, you know, got admitted slightly larger percentage, lower for D, slightly larger for E and lower F, so clearly the admissions gender balance in the admissions is dependent on whether. >Whether you are conditioning the evidence of gender imbalance in admissions is dependent on whether or not you're conditioning on department.

and, and there's different application rates and you can explore this yourself. There's different application rates, by gender, for each of the departments.

And so look here, in fact I, you don't even have to explore it yourself, I apparently have it on here, on the next slide. So gender of male female by department and you can at any rate you can explore this a little bit more

because the data is just in R I don't think you have to have any other packages or any data installed its just data UCBA admissions and and play around with it. So let me so let me talk a little bit about, you know, what in the world is going on here? because it's always, it's confusing, it seems confusing, you know to me there's a couple things that help me understand Simpson's Paradox. First thing is, the Math, there's no problem with the Math, right. If you're saying that, you know, a over b is less c over d and e over f is less than g over h. But, if you can find integers that satisfy the following equations you know, where b has to be greater than a and f has to be greater than e, and so on. But if you can find integers that satisfy these equations, then you found an example of Simpson's Paradox, you just have to put the context around it. But doesn't seem at all paradoxical when you state it as a couple of relationships between integers, that's why right, Then, it just doesn't seem very paradoxal any-more. It's the contacts that adds the paradox and the, you know, from a statistical standpoint, it says the apparent relationship between two variables can change in the light or absence of a third. Which again doesn't sound that paradoxical. It's only when we conflate the probabilistic statements and the evidence associated with the probabilistic statements vis a vie the data with the causal statement, right. So the problem is that we are going to try and get at the cause to a truth by virtue of the probabilistic statements associations sustained from the data, but that's a quite hard thing.

The question in all of these cases is, what's the right answer what should you condition on or not condition on (no period) and that's a hard problem we're not going to really cover that in this class.

To me the real answer of this is that it's quite hard to exactly figure out when you've conditioned enough. Right, in some cases no conditioning is exactly the right answer and in some cases conditioning is exactly the right answer. To really handle this formally, you have go to do something called [UNKNOWN] basically and that is really a discipline that's really a sub-discipline of statistics that is really, entirely designed towards addressing this question in a formal manner.

in the meantime, let me say this in the meantime, what can you do and the idea is to not decouple the statistics from the scientific discussion.

have a discussion about hypothesis for the causal mechanisms between the various associations. You know, it doesn't make sense to be conditioning on the race of the victim. In the Berkeley admissions data you would want to talk about well, are there very different acceptance rates by departments and then. Are there different application rates by gender to each department and, you know, does the fact, say, you know, are are women applying to departments that are harder to get into, that would explain the marginal association quite well. And is that really the driver? And that's a discussion, in a sense, a discussion that's informed by the statistics, but extra-statistical. And at this point, I think it is this kind of interplay between the data in a scientific discussion. Is the best solution I can offer to you for dealing with confounding. When you take further statistic courses, you can learn some of the formal mechanisms for trying to account for confounding, but suffice it to say, it's one of the harder issues in statistics, knowing how to balance the over adjustment with under adjustment in terms of confounding. [UNKNOWN] Is one of the central problems in in observational data analysis. And, it's what makes observational data analysis so hard compared to say for example, where you randomize treatment or something like that.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.