Hi. I'm Tim NeCamp. I'm a PhD student in the Department of Statistics at the University of Michigan. I became interested in statistics because I think it harnesses the power to make better healthcare decisions, to potentially personalize education, and help nonprofits and community organizations also make better decisions. In this video, I'm going to talk about looking at associations with multivariate quantitative data. So first, where does multivariate quantitative data come from? Suppose we have a person, and for this person, we measure their age, but we also measure their BMI, their blood pressure, their cholesterol level. Data arising from this is multivariate quantitative data. It's multivariate because we measure more than one thing on each person or each subject, and its quantitative because the numbers we measure take on measure numeric values. So, suppose we have this data. Suppose we collected data on maybe 15 people, or in the case of NHANES, where this data comes from, we've collected on thousands of people. This can't really help us understand the patterns that exist within our sample or the characteristics of the people within our sample. So, maybe we want a way other than this, to visualize it. One way to visualize this is with univariate histograms. With these histograms, we can get an understanding of maybe the median and the spread of age, we can get an understanding of the median and the spread of the systolic blood pressure. But what if we are interested in the association between these two variables? For example, do older people tend to have higher or lower blood pressure? Right now, we don't have a way of connecting a person in the age histogram to a person in the systolic blood pressure histogram. A way to solve this, is by making a scatter plot. A scatter plot graphs two quantitative variables together. Here, we have age on the x-axis, and the systolic blood pressure on the y-axis. Each point represents a single person. So, for example, this person here, who's maybe around the age of 26 has a systolic blood pressure of around a 180. The scatterplot is advantageous because it tells us about the association between these two variables. How are the two variables related? In this scatter plot, we see that older people tend to have higher blood pressure. There are a few specific aspects of association that we are interested in. One aspect is the type of association. Here, you see three different types of associations. A linear association, where the pattern between the points forms a line. A quadratic association, where the pattern between the points is parabolic or it goes up at the beginning but then to back down later, and then in the last case you see no association. So, as the age is increasing, you don't really see the HDL cholesterol, neither increasing or decreasing, so that's no association. We're also are interested in the direction of the association, which corresponds to the slope of the line in our association. For a positive slope, we call this a positive linear association. Which means that when the x increases, or in this case as age increases, the systolic blood pressure also increases. We also have a negative linear association, which means as x increases, the y is decreasing. Or in this case, as a vehicle is getting heavier, the miles per gallon is actually decreasing. In addition to type and direction, we're also interested in the strength of the association. So, here, in the first example, we see we have a weak linear association because the points are largely scattered along that line. In the second example, the scatter is much smaller, giving us a moderate linear association. Then, in the last example, looking at test score versus hours watching TV, we see that the scatter is very minimal, which is the strongest association we see here. Notice that the strength of the association does not depend on the sign. You can have strong positive and strong negative associations. In addition to describing the association qualitatively, there's a way to quantify both the strength and sign of a linear association. This is called the Pearson correlation. The sign of the Pearson correlation is exactly the sign of the association. The closer the Pearson correlation is to one or negative one, then the stronger the association. In this first example, you'll see that we have a weak positive linear association, which means, our correlation is positive but not that close to one. As the association gets stronger in the second example, our Pearson correlation increases, but it's still not that close to one. In the third example, because our association is a negative, our Pearson correlation is now also negative, and since it is extremely strong, it's very close to negative one, within the value of negative 0.99. Then, in the last example, when we have basically a no association, the Pearson correlation is very close to zero, negative 0.01. So, one caveat with all these associations that we've seen, is that correlation does not imply causation. So, what do I mean by this? Though in our example we see that as age increases, the systolic blood pressure is also increasing. This does not necessarily mean that age is the reason or cause for why systolic blood pressure is increasing. Here's a hypothetical example for why not. Maybe all the people in the red box happened to have smoked. Then, maybe smoking tobacco is what is actually causing the high blood pressure, while age is actually not causing the high blood pressure. It just so happened that in our data, older people in our sample were smoking more tobacco, making it appear like age was the cause. Just because two things are associated, it doesn't mean that one causes the other. In addition to wondering about association, we might ask the question, is anyone in our dataset unusual? These points are called outliers. For multivariate quantitative data, that potential outlier is a point that strongly deviates from the patterns and the rest of the data. For example, here is our pattern in our data, and you see that some of these points are pretty far away from that pattern compared to the rest. These are potential outliers. One last thing I wanted to show, is that in addition to having multivariate quantitative data, you might also have a categorical variable that you're interested in. For example, here, we recorded systolic blood pressure and age, but we also recorded gender. If I want to represent all three of these variables together, I can color the points based off the categorical variable. A couple cool things we can note from this graph, is that males tend to have higher blood pressure, especially at a younger age, and the increase in blood pressure as people get older is more prominent in females. The pink dots are much flatter than the black dots. So, what we've learned in this video for multivariate quantitative data, we've learned how to use scatter plots for visualization, we've learned how to describe association with type direction and strength, we've learned how correlation is a way to numerically describe association, we've also learned how to identify potential outliers.