[SOUND] In the last lesson, we learned how to code for categorical variables that had only two possibilities like gender. What do we do when there are more than two choices? In this lesson, I will take you through what is needed to run the regression model when we have categorical variables with more than two possibilities. Let's go back to an example we looked at in the last module when we learned about simple linear regression. Which was, an advisor in the university would like to motivate her students to study hard and graduate with high GPA. To stress the importance of GPA, she's wondering if she can show data on relationship of graduate study salary and the student's GPA. She has data on 100 recent graduates. While GPA had a really small p value, and thus highly significant, it's only explained about 33.6% of the variations. That advisor was not happy to say that the unknown factor is what's controlling almost 67% of the variation. So, she decided to add another variable to her analysis that she believed also contributed to a graduate salary and that was the graduate's major. There are lots of majors in the university. So, she decided for now to simplify her analysis by categorizing the student in four possibilities, Engineering, Business, Agriculture, and Liberal arts. Now, the study has two explanatory variables, GPA and major. GPA is a numerical value, and major is a categorical variable with four possibilities. So we first have to construct dummy variables for major, and only then can we run the analysis. Like before, we should be mindful of collinearity. And to avoid collinearity, use one less dummy variable than there are possibilities. In case of gender, there are only two possibilities and we use only one dummy variable. In this case, there are four possibilities so we will use three dummy variables, define three dummy variables for four possibilities. It would be Engineering is = 1 for engineering majors and 0 otherwise. Business will be 1 for business majors and 0 otherwise. Liberal Arts will be 1 for liberal arts majors and 0 otherwise. So if a student has 0 for all these values, then the student has to be in the College of Agriculture. This is the major left out of the model. Again, while numbers would look different based on which one is left out results and the analysis will be the same. So don't worry about which ones to code as 1 and keep in the model. Here's the complete analysis of our data. Too much to look at at once, so let's review this piece by piece. So did our prediction model improved as compared to the simple linear regression. Let's look at the adjusted R square. As you can see, 82.43% of the variations in the starting salary is explained by the regression model. This is definitely better than the 33.64% for the simple linear regression model, which only considered GPA. Now we can turn our attention to the individual variables that are in the model. For that, we will look at the P-values of each, all P-values are very small, so we conclude that all majors have a significant affect on starting salary. Now let's move on to using the results for prediction. Just as before, the model is about the deviation from the dummy variable, agriculture major students that were left out. Now we will explore what exactly this means. For students who have the same GPA, the coefficient for Engineering major is 2830. And this is being a positive number value, meaning that majoring in Engineering as compared to Agriculture, adds about $2830 to the person's salary. Same goes for business. Compared to AG majors, Business majors get a positive bump of about $1304. The difference in salary between liberal arts majors and an agriculture major is a negative value of 2321. Which means they earn about this much less than the students who major in Agriculture, holding the GPAs the same, of course. The intercept is where the values for all independent variables in the model is zero. Everything is zero, then the student is not in Engineering, is not in Business and is not in a Liberal Arts major, thus the student's major is Agriculture. What is the difference in starting salary of an Engineering graduate and Business graduate with the same GPA? It is the difference between the two coefficients. Which means that Engineering on average will make $1,525.55 more than a business graduate. Now let's look at this one. What is the point prediction for salary of a student with a GPA of 3.5 majoring in Liberal Arts? Take the intercept and add to that the coefficient of the GPA times 3.5 then add Liberal Arts coefficient to get 44,921.21. What is the point prediction for a salary of a student with a GPA of 3.5 majoring in Agriculture? Take the intercept, add to that the coefficient of GPA times 3.5, all other coefficients will be multiplied by zero, so the salary is 47,243. Once again, you see that they're comparing effect of many different variables at the same time. Regression allows us to know the collective impact, as well as individual impact that these variables have on the response variable. One of the challenges in regression is to select the right independent variables, which doesn't violate the model's assumption. [SOUND]