Hi, welcome to our crash course in causality. I'm Dr. Jason Roy, associate professor of biostatistics at the University of Pennsylvania. This video is on confusion over causality.` We're going to focus on why causality is sometimes controversial or confusing. We're going to use that to motivate why causal inference is an important field of study and to also use it to motivate the rest of the course. So, our first topic is spurious correlation and we're going to use this as one of the examples of a reason that causal effects can be confusing and it can be unclear whether one thing really does cause another. And what we mean by spurious correlation, is that you could have unrelated variables that might just coincidentally be highly correlated over some period of time. There's a lot of examples of this. But one example of this is illustrated in this figure where we have divorce rates in Maine, which is the red curve. And these are divorce rates from 2000 to 2009. And then we also have per capita consumption of margarine which are the black curves here. And you see that they follow each other very closely. So if you didn't know what these variables were, you might assume that these variables are related to each other or affecting each other. But, given that the divorce rates in Maine and in consumption of margarine, we suspect that these are actually causally unrelated. And so in this case, it's kind of obvious that this is spurious correlation. We don't really think that divorce is affecting consumption of margarine or the other way around. But you can imagine that these could be two variables where it's not quite as obvious. And then we would be left wondering is it a spurious correlation or is it actually causation? So, identifying these kinds of spurious correlation examples that people see and practice, some in the real world sometimes makes it, it makes causality seem like something that we can't know because we see these kinds of curves and you say, "well if these look like the correlated or causal and they're not, then what about these cases that you've presented me with where you're claiming there's causation. How do we know it's not spurious?" So the field of causal inference is going to help us answer that another reason that causal inference can be confusing sometimes is because of personal anecdotes. So, a lot of people have strong beliefs about causal effects in their own lives. And they like to tell us about it, so they share these stories with us and they sound confident, they really believe that they know that about actual causes in their own lives. But it's not clear if they're actually right. So somebody might tell you a story about, "the last time I had this injury I really put this ointment on and it worked for me." But how do we know they're actually right? So this is one question that we have to answer. And I'll give you another example which is a very common situation is that if somebody lives to be an old age, to live to be in old age that we'll ask them, "what was your secret?" So here, Bill Smith lived to be 105 years old. One thing that he does unique that most people don't do is, he has this habit of eating a turnip every day. And because he's one of the only people he knows that does that and he's lived to be an old age, he's he believes that that really helped with his longevity. But what do we actually know? We know two things, we know that he has lived to be 105 years old and we know that he ate one turnip a day. So we know those two pieces of information, but we don't actually know if what he ate contributed to his lifespan. And you'll notice I say contributed, because it also could be that his diet shortened his life even though he lived to be in old age, he lived to be 105 years old, maybe he would have lived longer, had he done other things. But we don't know if it's causal at all, so we don't know if it increased his lifespan, decreased it or had no impact. In a particular, we don't know what would happen if other people adopted this habit. And this is really the implication of these kinds of anecdotes, is that not only did it work for me, but I'm recommending that you do it because it would work for you but we don't actually know if that would be the case. So another area that's controversial or leads to a lot of conversations and skepticism and debate about causality has to do with science reporting and headlines and a lot of times, science reporters are careful to not use any form of the word cause. But, a lot of the words they use are kind of loaded and it get interpreted causally. So this is an actual headline that I found in the last few years and it says 'diet high in red meat linked to inflammatory bowel condition'. And so, people who read that had in mind- might think, "Okay, I shouldn't eat red meat." And you'll notice the word here that's kind of loaded is 'linked'. It's not clear what linked means. So, are they saying that it's a causal relationship where if you eat less red meat, you'll be less likely to have an inflammatory bowel condition or by linked do they just mean something like the spurious correlation I showed a few slides ago, where they happen they happen to be related but they're not causally related. So it's unclear from the headline what it really means, you would have to dive into the study and read how the study was conducted before really making a decision on whether this is a causal relationship. So another example is that there was a headline that said 'positive link between video games and academic performance study suggests'. So again, the word link appears. So, this is very common with headlines when they use the word link and it's not clear what it means. They do tamper it a little by saying 'study suggests' but again, it's it's not clear if this is causal in the sense that people who play video games might happen to be students who would have stronger academic performance regardless of whether they played video games or not. Or it could be something about playing video games that helps with academic performance, maybe it increases your concentration, for example. So from this headline, it's unclear. Here's another example, 'Prostate cancer risk soars by a quarter if men drink just one or two beers'. So, this certainly sounds causal because when they say risk soars, they're basically implying that the drinking affected the risk. That's what the word soar sounds like that. But certainly, they weren't randomizing men to drink more alcohol, to drink more beer. So this was, this is a study that must have been observational and so men who consumed more alcohol seemed to have higher risk of prostate cancer. But if we made modifications, if people change their drinking behavior would that affect prostate cancer risk? And it's unclear from the headline whether that's the case, although it's implied. So again, we would need to really look carefully at the methods involved to see whether we can make causal conclusions. And finally, we have the clever headline 'Health racket; tennis reduces risk of death at any age, studies suggest.' So, in the same situation where the word here is risk of death is reduced, right? so reduced is sort of the operative word. And again, this sort of this does sound causal,' if you play more tennis, tennis you're less likely to die at a young age', for example. But whether this is causal, to me is unclear. And one thing about all of these headlines that I should note is, because causality is unclear in general it's sort of for any kind of headline like this you can easily say, "well, this is just correlation, correlation does not equal causation." But you could also share this headline if you were excited about the results. So, how skeptical people are about headlines like this a lot of times depends on their point of view. So if you are a tennis player and you see this headline, you might share it on social media, you're very excited, does it feels good. Whereas, if you don't have as active of a lifestyle, maybe you don't believe tennis is good for you, you see a headline like this you might be very skeptical and you might say this is just correlational. So a lot of times how skeptically people view headlines depends on what their prior beliefs are. And we want to move away from that to a large degree, we really want to look at the evidence as it is. How was the study designed? What statistical methods did they use? What assumptions did they make? And then, to make a judgment about causality based strictly on the evidence and not based on our prior beliefs. So that, so we covered spurious correlation headlines. Now we can get into what I'll call reverse causality and here what we're thinking about is; these situations where the causal direction could be well, the relationship between two variables could be causal in either direction. So if you think about a causal arrow you could imagine it could be either direction. So I'll give a lot of- I'll give one example of this. So let's say we're interested in the relationship between urban green space and exercise. So one hypothesis is that if you have more green space in a city, then that will, that would sort of be related to exercise. But how? So, I have a question here, are physically active people more likely to prioritize living near green space? Okay so, if there's a nice park in a city, perhaps people who already exercise a lot would want to move there. Right? So, these people here say, "because we like to exercise, we plan to move near this park," so it could be that a city with a lot of green space will attract people who exercise a lot. Alternatively, and this is probably the question that researchers would be more interested in who are perhaps interested in motivating people to exercise more, they might for further question, Does green space in urban environments cause people to exercise more? So, perhaps people who are already living in a city maybe how much they exercise would increase if you were to build more green space, if you had more parks. So these people here are saying if there was a park like this near where we live, we would exercise more. Right? So there's this relationship between exercising green space in a cross-sectional kind of observational study we might see when there's more green space, there's more exercise. But, here the causal arrow could go in either direction. So, the green space might cause people to exercise more or it might attract people who exercise to live there or some combination of the two. And sort of sort that out, we would need to carefully examine the temporal kinds of relationships between these variables and try to have- make sure that our data would be able to have some kind of signal about when there was a change in green space, did that cause a change in exercise among people who already live there? So these studies would have to be very carefully designed. Otherwise, we're stuck in the space of wondering which direction the causal arrow goes. So, we've covered a lot of examples of why causality is confusing and this is meant to motivate why we need a formal field of causal inference and why for example, you're probably interested in taking this course. So the field of causal inference or causal modeling is attempting to sort of clear up a lot of this confusion. So one way is to come up with formal definitions of causal effects, what are we even mean by a causal effect? And then, once we know what a causal effect means in particular what causal effect we're interested in for particular research question or study, then what assumptions do we need to identify causal effects from data? So, we're either designing a study ahead of time and we know what kind of data we would like to collect, or we already have data and we want... we'd to know well, what assumptions would be necessary to be able to use these data to identify causal effects. So the feel of causal inference researchers in causal inference really worked hard at identifying causal assumptions. What is it that we would need to assume to be able to get at these causal effects? And related to that is in observational studies, we're going to have to control for variables, control for what are known as confounding variables. So variables that would affect our exposure or treatment of interest and also the outcome. And then there are rules about what variables we would actually need to control for. So we're going to spend a fair amount of time in future videos talking about what variables you would need to control for it to be able to make causal effects? And another important contribution of the field of causal inference, is sensitivity analysis and what sensitivity analysis has to do with is looking at how sensitive your results are to possible violations of some of our causal assumptions. So we have to make causal assumptions to identify causal effects. But what if those assumptions are violated? So sensitivity analysis tries to quantify that. What if our assumptions are violated by a small amount? Would that change our conclusions? What if our assumptions are violated by a large amount? How much would that change our conclusions? And so, sensitivity analysis is an important part of causal inference and that is something that we will discuss later on and an important contribution of the field. I just wanted to give a very very brief history of the field of causal inference and in doing so, I am going to leave a lot of important contributions out. So my apologies to people who I left out, but this is a very broad overview. But these... statisticians have been thinking about causal modeling for a long time. So some of the key original contributions date back to the 1920s with Wright and Neyman, and but it's really become its own area of formal statistical research since about the 1970s. And it's really been growing ever since so there's been a lot of research in the field of causal inference since the 1970s and I'll just provide a few highlights here. So, one of the most important things has been the reintroduction of potential outcomes. And one example of this is in the Rubin causal model, which was from Rubin's classic paper in 1974. And we'd say the reintroduction of potential outcomes because this was actually talked about back in the 1920s in the papers I referred to above. But it was kind of forgotten over a few decades. But since the 1970s, it's been a regular part of the vocabulary in the thinking about causal effects. Another important contribution has been in causal diagrams, with some key contributions by Robbins, Greenland and Pearl. And we'll spend a fair amount of time in future videos talking about causal diagrams but they're generally thought to be very helpful in getting us to write down what we believe is happening in the world around us and in particular, with regard to the variables that we're interested in studying. And once you write down these causal diagrams, theory has been developed about then what variables you would need to control for- to identify causal effects. So, we'll see that these causal diagrams are related to potential outcomes and the link between the two of them will help us in determining what statistical methods we use to actually estimate causal effects. When it comes to estimating causal effects, one major contribution has been in propensity scores which was first introduced by Rosenbaum and Rubin. And another major area of research in the last two to three decades has been related to what's known as time dependent confounding, but you could roughly think of this as situations where treatments or exposures vary over time and exposures at one time affect a lot of different variables of future times and those variables affect treatment of future time so you get these feedback sort of loops happening, and then how do you estimate the sort of joint causal effect of treatment over time on outcomes when all of these variables are affecting each other? So this was a very difficult problem but a lot of progress has been made and some of the key contributions were by Jamie Robins in what are known as some of his gene methods. And so, that's been from the 1980s onward, there's been a lot of work in that area. And, another exciting area has to do with the optimal dynamic treatment strategies and this is another very difficult research area where a lot of progress has been made in the 2000s and this has to do with what we mean by dynamic treatment strategies is not just, 'is treatment A better than treatment B?' But for this patient that has these particular characteristics, which treatment should I give them? What's the optimal treatment for somebody as a function of some of their actual variables, which some of those variables might change over time and how do you identify an optimal strategy as a function of some important clinical variables? And so, there's been a lot of progress in that area. And then finally in the machine learning area, there's this general method called targeted learning, which has become popular recently; which really is more of a machine learning type of approach based on semi-parametric theory and high dimensional data, which I think has had a lot of promising results. And this was mostly developed by Mark van der Laan and his collaborators. So going forward. Now we want to think about where do we go from here. So, this video was meant to be- to provide some background and motivation for the rest of the course. So what we're going to do from here on out is, focus on causal inference from the perspective or for observational studies and what are known as natural experiments. To a large degree we're going to ignore randomized trials although, randomized trials will be part of our causal reasoning. In randomized trials, you're actually randomizing the treatment or exposure, in observational studies the treatment or exposure is just as it is in the real world. There was no direct manipulation. In natural experiments, we'll get into later but there are sort of situations in the real world where something seems like it's randomized even if we didn't formally randomize. But it's important as we dive deeper into causal modeling to remember a couple of things. So one is that, we will have to make some untestable assumptions. Those are referred to as causal assumptions. And what we mean by untestable, is something that we can't really check from data to see if it's true. We're going to have to have some assumptions that are somewhat based on faith. But a lot of times, based on faith also means not just blind faith but we will be able to tell to a large degree whether the assumption sound plausible or not. But because we can't test them, they are assumptions that we'll have to just trust is true which is also why we carry out sensitivity analyses. And there was this classic paper by Cochran in 1972 on observational studies, and he reminded us that observational studies require a good deal of humility because we're- we can only claim to be groping toward the truth. So, even though we're studying the field of causal inference and we believe we'll do a better job of getting that causality, we're not going to know for sure whether we're there. And so, we need to have humility and sort of always keep that perspective in mind.