[MUSIC] If you read articles in the scientific literature, you'll often see people report p-values when they report statistical tests. P-values are widely used, and it's important to understand what they mean. They're also widely criticized, because people often misinterpret p-values. So in this lecture, the goal is to understand what they mean and how to correctly interpret them. When we talk about p-values, the first question we should ask ourselves is why are they so popular in scientific articles? Well, there's a reason for this, and Benjamini expresses it quite nicely here. He says, "In some sense it offers a first line of defense against being fooled by randomness, separating the signal from the noise". So, this is what the p-values allow you to do. When you interpret your data, you might be very likely to interpret data in favor of the hypothesis that you have, even when the effect might be only slightly in the right direction. The risk is that you're fooling yourself. You might be too likely to declare that something is going on, when you're actually looking at random variation in data. So, p-values are one way to prevent you from fooling yourself. P-values tell you how surprising the data is, assuming that there is no effect. And we'll look at all these aspects in more detail. What surprising means, why they're statements about the data, and why they're built on the idea that there is no effect. Now, some people say that p-values are more accurately explained as what you use if you don't know Bayesian statistics yet. In Bayesian statistics, people don't use p-values. And I still remember when I was doing my own PhD, that I had this confusion about whether I should use p-values, or Bayesian statistics. My understanding was, more or less, that I realized there was some problem with using p-values, and Bayesian statistics might be preferable, but most people didn't use Bayesian statistics. So, it was probably fine to just continue using p-values. Now, I think it's fine to use p-values, but you should interpret them correctly. So, that's the goal in this lecture, to prevent this confusion in you, and to make sure that you use p-values correctly, if you decide to use them. Let's start with a practical example. Let's say you want to do a study where you examine the influence of calling while you are driving. Does being on the phone when you are in your car increase the risk of getting into an accident? You might design a study where half of the participants drive around the city while they're on the phone, and the other half of the participants drive around, but they're not on the phone. You want to see if there's a difference, maybe in the number of people that they hit while they're driving through the street. Or maybe, your ethical committee doesn't allow you to do this, and you're better off using a driving simulator to study this. Now, if you have collected your data, you counted how many people get hit by a car, either by people who are on the phone while they're driving or people who are not on the phone, then you can look at the difference between these two conditions. Now, this difference is never exactly zero. There's always some number followed after the comma that makes a difference. So, let's say the difference you observe is 0.11, a mean difference. Now, how should you interpret this mean difference? There are two options: A) what you are looking at is probably just random noise. There's always some random noise in your data. Option B) this is probably a real difference, this is something that you should take seriously, and at least examine further in future studies. So, which of these two is true? Well, we can use the p-value to differentiate between these two options. From the data that we have, we can calculate means, standard deviations, and we know the sample size that we have. We can use these parameters to calculate a test statistic, and compare this test statistic against a distribution. You can use many different types of distributions. If you examine precognition, you might want to use a paranormal distribution. But most often, people just use the normal distribution. So, this bell shaped graph is something you might have seen before. And there's something you should note here, and that's that this distribution is centered on zero. And when we talk about the p-value being data that is surprising, assuming the null is true - the null hypothesis is true - this is what we mean. We look at a distribution centered at zero. Now, you can see that most of the data in this case, let's look at 95% of the data, will fall between two critical values. And these critical values, you might have seen this number, 1.96 and -1.96. These are critical values if you want to use an alpha level of 0.05. If data falls between these two values, it's not surprising. Assuming that the null hypothesis is true, most of the data will fall between these two critical values. But sometimes, we might see a data point that's more extreme than this. And this is a surprising finding. This is surprising data whenever the mean difference, or the test statistic that is computed from this mean difference, is in one of the two tails of this distribution. So, whenever we find data that falls in these tails, it's surprising. And we might want to examine it further. It also means that the p-value's smaller than 0.05. The formal definition of a p-value is the probability of getting the observed, or more extreme data, assuming the null hypothesis is true. Now, I highlighted the word data here. I think it's important to realize that we're talking about the probability of observing data. A p-value is the probability that you'll observe some data, but not the probability of a theory. This is a very common misunderstanding. People often want to make a statement about the probability that the theory is true. But when you calculate a p-value, all you can do is make a statement about the probability of the data. Now, if you make this mistake, you're in good company. Let's take a look at this example from quantum physics, where a physicist talks about the probability of observing a certain spin between quantum particles. So, this is a study where they measured the spin in particle levels floating around somewhere in Delft, and another one that was floating around somewhere in Amsterdam, in the Netherlands, and these two particles spin together, they have some sort of relationship. And this relationship, based on the data, was statistically significant with a p-value of 0.04. Now, some physicist is interviewed about this finding, and this physicist concludes "In other words, there is a 96% probability they won the race". So, this person is making a mistake here because, with "won the race", this person means there's a 96% probability that the theory is correct. But this is a statement about a theory. It's not a statement about the data that you have observed. So it's comforting, maybe, that a quantum physicist, which sounds like you're supposed to be really smart, also makes this misinterpretation of what a p-value means. After you have observed a p-value that's smaller than 0.05, for example, an effect is not 95% likely to be true. Think about precognition research. Let's say that I present one study to you where you find a statistically significant effect of precognition. After this, do you really think it's now 95% probable that precognition exists? Probably not, you cannot get the probability that the null hypothesis is true, given the data from a p-value. If you look at the two statements below on the screen, you see that the probability of the data, or more extreme data, assuming the null hypothesis is true. It's not the same as the probability of an hypothesis given some data that you have observed. These two probabilities can differ widely. If you want to know the probability that a theory is true, you need to use Bayesian statistics. Bayesian statistics is the only approach that will allow you to make statements about the probability that a theory is true. What happens if you do a study and your p-value is larger than 0.05? Well first of course, you've spent a lot of time and effort collecting this data, and maybe you hope to find the statistically significant effect. So the first thing you do is cry a little, you're a little bit depressed. That's okay, But after this how should you interpret this data? Well, all that we know when the p-value is larger than 0.05 is that the data we have observed is not surprising, that's all. It doesn't mean that there is no true effect. There might very well be an effect, but you just didn't have enough participants in your study to detect this effect. Remember that you need large samples to statistically detect a small effect. So just because your p-value is larger than 0.05 doesn't allow you to conclude that there is no effect. There might be very small effect, you don't know. Personally, I try to think of a p-value larger than 0.05 as mu, which is a concept from Zen Buddhism. In Zen Buddhism, there is a famous saying that goes like this. A monk asked a Chinese Zen master, "Does a dog have a Buddha-nature or not?" So you might expect a yes or no answer here, because that's also how the question is phrased - yes or no. But instead the Zen master answered, "mu", which basically means, "I'm un-asking the question". It's negating the question that's asked. Whenever you find a p-value that's larger than 0.05, you might feel the tendency to say, "So, is there an effect or not?" But whenever the p-value is larger than 0.05, you can't answer this question. So you should just answer, "mu". So how do you use p-values correctly? The first thing to understand is that p-values can be used as a rule to guide behavior in the long run. You can calculate them for every single study, but they only work in the long run. Let's take a look how. If you use the decision rule whenever the p-value is smaller than the alpha level, so this is your type I error rate, which is often set to 0.05, you can act as if the data is not noise. Now this word, "act", is very important. It's independent of what you believe is true, but all that you know is, if you use this decision rule in the long run, you won't say that there is something, when there is nothing, more than 5% of the time. Alternatively, when the p-value is larger than the alpha, you can remain uncertain or act as if the data is just noise. So these are rules that you follow in the long run. When you act as if there is an effect whenever the p-value is smaller than 0.05, in the long run you won't be wrong more than 5% of the time. Now this is an interpretation of p-values as proposed by Neyman and it's often used. Let's take the discovery of the Higgs boson as an example. If you remember, during the press conference about the Higgs boson, researchers were talking about whether the 5-sigma threshold was passed. And 5-sigma is used as a threshold to declare something a discovery in physics. Now 5 sigma is basically a p-value smaller than 0.0000003. So based on this idea, we can act as if the Higgs boson is true. Every now and then, of course, we'll be wrong. With such a high threshold for an error, we'll only be wrong in one of many billions of parallel universes. So there's one parallel universe where people spend the time and effort to build a large Hadron Collider to detect the Higgs boson and they declared it was statistically significant. So it was there. But they were actually wrong. But with such a high threshold, this of course rarely happens, and we can be pretty safe that there is a Higgs boson and we didn't make a mistake. When you interpret p-values and you want to write something about what you've found, you should not write, "We found a p-value smaller than 0.05, so our theory..." Because if you do this, you're making a statement about a theory based on a p-value and you shouldn't do this. The correct way to discuss a p-value smaller than 0.05 is to say, "We found the p-value smaller than 0.05, so our data..." You make a statement about the data because that's what the p-value relates to. You might say something like, "So our data is in line with - some idea that you want to test - ". Whenever you've found a non-significant result, a p-value larger than 0.05. You enter what's known as a degenerative research line. You made a prediction, but it doesn't hold up. So you have something to explain. Now one explanation might just be random variation. P-values vary and even if you have examined a true effect, every now and then you'll observe a non-significant result. So you might just say, "Everything's fine, this happens. If I do another study that's exactly the same, you'll see that it will pan out, and my prediction will hold". Other times, you might need to say, "Well the effect that I predicted might be smaller than I expected". So you do another study, but it's larger and then you show that the effect is really there. Nevertheless, whenever you find a non-significant result, there is something to think about. You have to explain it in some way. One way might be, if you do a lot of studies, every now and then you will find a non significant result, but then you need a lot of studies to support this. Other times you might say, "I have to do the study in a slightly different way", and you can use this change in the paradigm to develop a progressive research line. Remember that p-values vary so always think meta-analytically about p-values. This is also recommended by the statisticians who talked about p-values in the very beginning. For example, this is a quote by Neyman and Pearson. Statistical tests should be used with, "discretion and understanding, and not as instruments which themselves give the final verdict". So if you calculate a statistical test, that's only one thing that should go into your reasoning to decide whether this is a true effect or not. Always think more about your study. P-value might be a starting point, but you also want to look at effect sizes and other studies that have been done. Fisher similarly says that a single p-value is not enough to declare some discovery on. He says, "A phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result". So we have to repeat the experiment multiple times. He also says, "No isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon". So he's saying that we should see a single p-value, maybe as an invitation to explore this effect further, but it can never be enough to declare something a scientific fact. So we always need to do several studies and p-values can guide us in the long run in which studies we might want to do. So, at the end of this lecture, let's take a look at the p-values that you can expect when there is a true effect. And the p-values that you might expect when there is no true effect. Now, I never really realized how p-values are distributed across studies when you do a lot of them. And I think it's very important to understand this for the correct interpretation of a p-value. So take a moment to think about this. What kind of p-values would you expect when there is a true effect? What kind of p-values would you expect when there is no effect? Let's take a look what really happens. When there is a true effect, the p-value distribution depends on the statistical power. Let's take a look at the visualization of this. In this graph, you see the p-values for 100,000 studies, where every study had 50% statistical power. This means that it's 50% probable that we'll observe a p-value that's smaller than 0.05. If we look at the p-value distribution, we indeed see that it's much more likely to observe small p-values, than it is to observe high p-values. And if we look at the leftmost bar, we see that indeed 50,000 of the 100,000 simulated studies yield a p-value that falls between 0 and 0.05. Now, we might want to increase the statistical power a little bit. You see that with higher power we have basically pushed more of the p-values below the significance threshold of 0.05. Here we have 80,000 of the 100,000 simulated studies that yield a significant effect. If we increase the statistical power even more, to 95%, we now see that most of the p-values that we'll observe, given that there is a true effect, will fall below the significance level. So which p-values can you expect when there is no effect? I really never knew this myself. I thought that p-values might be distributed in a way that, if there's no effect, we'll see a lot of very high p-values. Or I thought that it's possible that maybe they're distributed as sort of a normal distribution, but instead it turns out, that when there is no effect, p-values are uniformly distributed. Every p-value is equally likely. And it also makes a lot of sense if you understand it. In this case, we have simulated 100,000 studies where there is no true effect. And you see that, no matter where you look in the distribution, low p-values or high p-values, they're all equally likely. Now this makes sense because in this way, when there is a uniform distribution it means that 5% of the p-values that we observe when there is no effect, fall below the 0.05 threshold. So when there's no effect, we have a 5% probability of making a Type I error - of saying there is a significant effect when there's actually nothing going on. So here you can see this small Type I error rate highlighted in red. So this is what it means to make a Type I error. The reason that it's 5% is because there's a uniform p-value distribution. If you would increase your alpha level to 0.10 there's still 10% of the p-values that fall below 0.10. To conclude, it's important to understand how to correctly interpret the p-value. Its use is often criticized because people incorrectly interpret what p-values mean. And I hope that after this lecture, you won't be one of them. [MUSIC]