To begin with, let me ask you something. Look at these two boys. What do you think? Which one is heavier? The left one or the right one? Yes, probably the left one. He's taller. Another question, which one do you think is older? Probably the left one again. It is no guarantee of course but on average children are taller when they are older. So in your head, you perform linear regression. You explain a continuous characteristic by another one. This is what researchers often do to understand the world around us. To smallest to simplify nature and quantify relationship. Linear regression might be the first and most fundamental method for this aim. This video will explain to you how it works. After this video, you will know the concept of regression. You will also know how inference works in linear regression. Finally, you will know and understand the underlying assumptions of linear regression. Imagine that our population of interest is defined as all children from the Netherlands between zero and five years old. In this population, we would like to predict weight in terms of height. Due to logistical constraints, we cannot measure all children from our population. As a solution, we draw a sample of children and measure their height and weight. With simple linear regression, we try to predict or explain one variable. This variable is called the outcome or dependent variable. In our example, weight. This outcome will then be explained by another variable, the independent or explanatory variable. In our example, height. We can get our first reading of the relation between the two variables by making a scatterplot. The method of linear regression captures this graphical relation, it provides a straight line to describe the association between the outcome and the explanatory variable. We call it the regression line. For choosing this line, the least squares method is often used. This method calculates the line which minimizes the distance between the observed points and the resulting line. The deviations from the line are called residuals. For example, consider these two red observations. The regression line goes through the left red observation. Therefore, the residual is zero. However, for the right red observation the residual or distance to the regression line is not zero. It is 5.5 kilograms of. In our example, the slope of the resulting regression line is 0.39. This means that each centimeter taller children are on average 0.39 kilograms heavier. The slope of the regression line quantifies how much the average outcome increases or decreases when the explanatory variable increases Y Unit. In mathematical terms the regression line corresponds to an equation. Beta zero and Beta one are called the regression coefficients. Beta zero is the intercept, it is the value of weight when height is zero. Beta one is the slope or steepness of the line. Like in our example if Beta one is positive and increase in the explanatory variable will lead to a mean increase in their outcome. In contrast, if the slope Beta one is negative an increase in the explanatory variable will lead to a decrease in the outcome. This is the case. For example, when calculating the regression line for remaining live years as dependent variable and age as independent variable. The regression equation can be used for prediction. What is the expected weight for a child of one meter? Beta zero plus Beta one times 100, 18 kilograms. All we have seen so far is based on our particular sample, but actually we would like to draw conclusions about the whole population of 30 children between zero and five years old. How can we do that? We must first realize that there are two main sources of uncertainty when estimating our regression line; First, even if we will observe all children in the country, there will still the variation in weight among the children of the same height. A second ingredient which influences the level of uncertainty is sampling variation. We observed a particular sample of a particular size not the whole population. Now, we will discuss how to overcome these two problems. To understand how to draw inference from the regression line, you need to know the statistical model of linear regression. This model allows you to calculate standard errors to measure the precision of our estimated regression coefficients. Let us start with our regression formula. The residuals are captured in the error term called e, which we assume to be normally distributed with mean zero and standard deviation Sigma. The formula to estimate Sigma is displayed below. Important is that the residuals need to be normally distributed but not Y. This effectively means that we assume that for each particular value of height, weight is normally distributed. Also the variance of weight is assumed to be the same for each value of height. Look at the plot. Simple linear regression relies on some assumptions. As the name indicates, the main and most important assumption is that the relation between the two variables is linear. Violation of this assumption is very problematic. It makes no sense to fit a linear model to variable which are not really unrelated. Take for example, mortality rate and age. Is therefore very important to visualize the relation of the two variables in a scatter plot before starting or analysis. The second assumption is the assumption of normality of the residuals. This can be easily assessed by an histogram. The third assumption is the assumption of homoscedasticity which can also be assessed graphically, plotting the residuals for each value of the predictor. The variation around zero should be similar in the whole plot. In this lecture, we have seen how to use simple linear regression. We use it to model the relationship between two numerical variables. We have used the example of height and weight. In the next activity, you will learn how to perform simple linear regression in R. However, height is not the only characteristic which influences weight, right? The simple linear regression model can be extended to account for multiple explanatory variables. We will study it in the next lesson, multiple linear regression.