0:15

In this session, you're going to look at regression analysis which is a common

Â technique used in many different contexts.

Â But here we're going to talk about it from the point of view of cause effect

Â analysis.

Â When you're looking at process improvement and we're going to look at it from

Â the point of view of an example that we'll look at later on in this session.

Â But first what is regression all about?

Â The idea of regression is to generate an equation

Â that describes a relationship between a y and many xs.

Â And you can have simple linear regressions which is 1x and

Â you can have multiple linear regression which has multiple access.

Â Multiple independent variables having an affect on a dependent variable.

Â Regression is mainly used when you have continuous independent variable and

Â continuous dependent variables.

Â It can be used for different types of data, other than continuous independent

Â variables and continuous dependent variables.

Â So it can be used for discrete kind of dependent variable.

Â It can be used for discrete kind of independent variables, but

Â those are different techniques that you would use.

Â You could use it based on what we'll be learning in this session.

Â You could use it to work with discrete independent variables.

Â You can trick regression to treat discrete independent variables as

Â a part of the regression equation.

Â And we are not gonna look at it in this particular session, but

Â there are ways in which you can do that as well.

Â So just know that it's not restricted to the kind of example that we're going to

Â look at in this case.

Â So let's take a look at what we

Â would see when we are interpreting the results of a regression model.

Â So there are four main things that you would look at when

Â you get the results from regression.

Â So when you do get the output from regression,

Â first thing you would look at is, is the model significant?

Â Is the p-value for the model significant?

Â And if you remember, and we'll emphasize it some more later on, but

Â when we say p-value we're saying is the significance value

Â of the overall regression less than the alpha value?

Â P less than alpha, reject the null hypothesis and

Â we'll look at the p-value for the F-statistics.

Â What you will see later on,

Â is that regression also has within its results an ANOVA table.

Â It's a ANOVA table that describes its results, and

Â that's where you'll be looking at the p-value for the f-statistic.

Â Very similar to what you might remember from analysis of variance.

Â Very similar to that ANOVA table, because this will also be an ANOVA table.

Â So that's the first thing you look at,

Â overall model significance based on F-statistic and p-value.

Â That will tell you whether, the model is statistically significant or not.

Â Then you move on to the R-squared, and that's talking about something we can

Â think about more from a practical perspective and

Â we'll see what the R-squared means.

Â But that would be the next thing and that's what it's also called, goodness

Â of fit and next thing we would look at, is the independent variable coefficient.

Â So if you have multiple independent variables although the model

Â may be significant, the overall model may be significant,

Â what you might find is that you might find the individual independent variable

Â coefficients to not always significant.

Â There might be multiple Independent variable coefficients, and

Â you wanna see whether each one of them is significant.

Â For that, we also have the same rule of p-values, looking at p-values and

Â seeing comparing them to alpha value that you have.

Â Just to talk about that briefly, it uses a t distribution.

Â Overall model significance uses an f distribution, the intercepts and

Â the coefficients are looked at based on a t distribution.

Â And finally, you want to look at the t-statistic for each

Â of the coefficients and its p-values and you wanna see how significant they are.

Â So going back to the same idea of whether each of

Â the coefficients is having a significant impact on the dependent variable.

Â 4:20

So what is a p-value, just to revisit that, just to rehash that,

Â what you may have already seen earlier.

Â The fact that if we get a p-value that's less than alpha,

Â we reject a null hypothesis.

Â So kind of a cheesy statement here, if p-value is low then null value must go.

Â It's a good way to remember this whole idea of,

Â what do we mean by how do we us p-values?

Â If the p-value is low the null hypothesis must go.

Â You reject the null hypothesis.

Â How is the p-value computed?

Â We use the f-distribution for the overall regression equation.

Â Sometimes it's computed using the z-distribution,

Â sometimes it's computed using the t-observed or the t-distribution, so

Â many different ways of computing the p-value.

Â It comes back to the same rule though.

Â What is R squared?

Â When we're talking about regression we refer to something called the R squared.

Â It's a very practical way of thinking about,

Â how good is the regression equation in a practical sense.

Â So you come up with regression results, you come up with an equation,

Â you come up with a statistically significant model.

Â The next thing you look at is, is the R squared high?

Â What do we mean by high?

Â R squared values go from 0 to 1, and what they're basically telling you is what

Â percentage of the dependent variable can be explained by the independent variable?

Â What percentage of the variation in the dependent variable can be explained by

Â the independent variables that you have in the equation?

Â So that's what we're looking at.

Â What percentage of the variation in Y can be explained by

Â the variation in all the different Xs.

Â So you have the total explained variation and

Â you divide that by the total variation that gives you your R squared.

Â So it's a ratio of your total variation that's explained divided by the total

Â variation, and that's why it goes between 0 and 1, the highest value being 1.

Â And more formally stated is the proportion of variation in the dependent variable.

Â That can be explained by the independent variable.

Â So that's your r squared.

Â What you also see in the output for any multiple linear regression kind of model.

Â In fact, you'll see it even for a simple linear regression model, but

Â it's not as important, it's not as significant there.

Â When you have multiple independent variables that you're throwing into

Â an equation, what you will see is that the r squared

Â adjusted value is something that you want to focus on.

Â Why do you wanna focus on that?

Â Because what you need to know for that is you need to know what will happen when

Â you add more variables, more independent variables to a regression equation.

Â When you add more x variables to a y equals function of x equation,

Â the R squared value, the proportion of variation explained by that equation,

Â is always, always, always going to go up, if ever so slightly.

Â So what are the implications of this?

Â Even if you throw in nonsense independent variables into an equation,

Â the R squared value will always go up even if it is ever so slightly.

Â So what is that telling us?

Â Well that's telling us that we simply keep on putting in

Â independent variables even if they don't have any relationship.

Â We can get a higher R squared value.

Â Well the R squared adjusted, gives you a more truer sense of whether adding that

Â variable is giving you significant bang for the buck if we call it that.

Â So the R squared adjusted value, adjusts for

Â the number of predictors that you have in that model.

Â And what you'll find from any kind of output that you see from a regression,

Â you'll find the adjusted R squared value is always adjusted down from the R

Â squared value.

Â So in a practical sense what it's saying is that it's accounting for

Â the number of predictors that you have so

Â it's adjusting it down from the raw R squared value that you are having.

Â How is this helpful?

Â It helps to make a decision about whether you should be adding more X variables

Â into this equation other than looking at the significance of those x variables,

Â you also want to look at whether the R squared adjusted value is going up

Â because what can happen with the R squared adjusted value is,

Â it can actually go down when you add more independent variables into the equation.

Â What is useful is to keep an eye on the R squared adjusted, compare it to the R

Â squared, and then also keep an eye when you're going from one model to the other,

Â whether the R squared adjusted is going down when you're adding more variables.

Â All right. So

Â we've talked about overall modeled significance.

Â We've talked about the R squared value and then the next thing we want to look at

Â is the individual coefficients in the equation, and

Â we call these the betas, the b-coefficients, so you might have b1, b2,

Â b3 and what are we talking about there?

Â We're talking about the null hypothesis for each one of these coefficients saying

Â that is this independent variable significantly explaining the dependent

Â variable, and for each one we have a hypothesis that says,

Â and the null hypothesis says that b1 is 0 in mathematical terms we're

Â saying the slope is 0, so it has no impact on the y value that we had.

Â And the alternative hypothesis is that it's different from zero and

Â it's significantly different from zero, for

Â us to include it in the model would be a statistically significant effect.

Â So how would you test it's statistical significance after the coefficients,

Â the beta coefficients, you would test it based on or regression tested rather,

Â based on the t distribution.

Â And the t distribution here is based on degrees of freedom.

Â And what are the degrees of freedom that we're talking about here?

Â You may remember degrees of freedom.

Â We talked about them when we were talking about ANOVA.

Â Here we're talking about degrees of freedom for

Â the t-distribution being n- k- 1, n being the sample size,

Â n in statistics as always the sample size, k is the number of

Â independent variables in this case and we subtract one more for the intercepts.

Â So n- k- 1 degrees of freedom and again you don't necessarily have to know

Â these technicalities when you're interpreting it based on software but

Â if you wanted to know the critical value for being able to reject or

Â retain the null hypothesis for a particular coefficient,

Â you could get it based on your alpha value as well as your

Â degrees of freedom and going to the t table or getting it from Excel.

Â All right, next let's take a look at what you get

Â from a regression equation in terms of an equation.

Â So what you get is you get a regression equation

Â which gives you the direction, the size, those are the two things it gives you so

Â it gives a slope, and the slope would be plus or minus, and

Â there would be a size associated with that slope.

Â There would be a b value that associated with that coefficient, so

Â it gives you the size.

Â And then as we saw on the previous slide,

Â you also get a statistical significance value of it.

Â So you get a p value associated with each coefficient, which will tell you

Â whether that coefficient is helpful in terms of interpreting what you're getting,

Â in terms of predicting what you're getting as the y.

Â So is x1 or x2 or x3, each one of those a significant predictor of y.

Â The sign of each coefficient is going to be important and

Â how you would interpret the coefficient is, is that it would represent the mean

Â change in the response for one unit of change in the predictor.

Â The mean change in y for one unit of change in that x value.

Â That's how you would interpret this.

Â That's what would be the practical

Â 12:21

interpretation of that particular independent variable.

Â And then, finally, you will have an equation that you can use.

Â So you'll get an equation that says y equals a particular intercept,

Â so just to make up an example here.

Â It could be y = 4 + .2 times

Â a + .3 times b + .6 times c.

Â What is it telling you?

Â Now you have an equation.

Â So if you had the values for a, b, and c, you can plug them in and

Â you can get a value for the y.

Â So let's take an example in order to use this idea of interpreting an equation.

Â So let's say that you have a potato chip company that is analyzing

Â factors that affect the percentage of broken potato chips in their bags.

Â And what they're looking at in terms of the independent variables.

Â So the dependent variable is broken potato chips.

Â The independent variable is the percentage of potato relative to other ingredients.

Â That's one independent variable.

Â And the other independent variable is the cooking temperature.

Â So, does cooking temperature and the percentage of

Â potato in the potato chips affect the number of broken potato chips in the bag?

Â The percentage of broken potato chips in a bag.

Â That's what we're looking at in this particular example.

Â And let's say you did the analysis based on some data and

Â you don't have the data right here, but you do have the results of the analysis.

Â So what you have right here is the results of the analysis,

Â it's giving you, let's say the equation was significant.

Â Again, you don't have the f value, you don't have the [INAUDIBLE] table, but

Â let's say that was significant and you've gone past that.

Â You said well the first thing we looked at was the S statistic for

Â the whole equation.

Â The model is significant.

Â We looked at the R squared.

Â The R squared is 67.2%.

Â That's telling you that 67.2% of

Â the variation in broken chips is explained by these to variables.

Â Are each of these two variables significant?

Â So let's take a look at that.

Â We see the P values for each of these variables.

Â Percentage, potato.

Â The P value is 0.001.

Â Cooking temperature, the P value is 0.02.

Â So what is this telling us?

Â Let's say that we were using an alpha value of 0.05.

Â These two P values are less than 0.05.

Â So each of the coefficients as a P value that's less than .05.

Â It's .00 and .02.

Â So we can say that each of these two significantly affects broken potato chips.

Â The intercept or the constant in this case,

Â what is shown as a constant in this particular table has a P value of .32,

Â and we don't really interpret the constant or

Â the intercept in the case of a regression equation, so we leave it as it is.

Â We are going to use that as part of the equation.

Â It's going to be a part of the equation when we want to do

Â any kind of calculation of broken potato chips, but however,

Â we do not worry about its p-value being greater than 2.5.

Â So keeping that in mind, what are we seeing here.

Â We're seeing a regression equation.

Â We can come up with a regression equation that says percentage of broken potato

Â chips in a packet is going to be equal to 4.231 minus 0.044.

Â The percentage potato plus 0.023 cooking temperature,

Â and that's what we have in terms of our equation.

Â That's the equation you have over here.

Â Now, what I'd like you to do is take this equation as an exercise,

Â Take this equation.

Â And first of all, think about how you would interpret it,

Â which we've already done on the previous slide.

Â So, that should be something that should be easy to think about in terms of this

Â specific example, as to what are we talking about here in terms of how does

Â potato content affect broken chips?

Â How does cooking temperature broken chips?

Â And second, what I'd like you to do is for a particular setting of 50% potato and

Â a cooking temperature of 175 degrees Celsius,

Â come up with an expected value of broken potato chips.

Â So, calculate based on this equation, the expected value of broken potato chips.

Â So, go ahead and do the calculations for it and then we'll come back and

Â take a look at the solution.

Â 16:48

So what you would have seen here, in terms of applying everything that we learned in

Â terms of the equation to this particular example, is you would have said, well,

Â we can interpret the equation based on for each 1% increase in the amount of potato.

Â The percentage of broken chips is going to decrease because we had a negative sign

Â for that particular slope.

Â It's going to decrease by 0.044%.

Â And on the other hand, with an increase in temperature for

Â each 1 degree Celsius increase in cooking temperature,

Â the percentage of broken potato chips is expected to increase by 0.023%.

Â Now, you have to cautious about the fact that you have to treat this as

Â a equation that you can use only within the range of the data in

Â which you had collected in order to come up with this equation.

Â So, what I'm saying here is that you can not go beyond the range of the data on

Â the basis of which you came up with this equation.

Â Because, how do you come up with this equation in the first place?

Â Or how did somebody come up with this equation in the first place?

Â They modeled what they found based on data that they collected from the process.

Â So, what I'm saying is that if the temperature range was going from

Â a certain temperature to a certain temperature,

Â only if it was going from 100 to 300 degrees only.

Â You can't use this equation to try and predict something that would happen at

Â any temperature less than 100, or any temperature greater than 300.

Â You don't want to go beyond the range.

Â Similarly for amount of potato,

Â you don't wanna go beyond the range where you have actually collected the data.

Â So, finally in order to complete the question that we had and

Â how you would answer it.

Â The last thing is, we were asked to predict the percentage of broken chips

Â based on settings of 50% potato and a cooking temperature of 175 degrees.

Â So, if you go through these calculation,

Â you're saying that you calculate an expected value of 6.056% broken chips.

Â So, that's what your interpretation, in terms what you find here.

Â All right.

Â Now, let's take an example here that we are going to solve using Excel.

Â So, let's take an example of data that you have in the form of an Excel spreadsheet.

Â But you're also going to see the data on the slides first, and use this to

Â practice the use of multiple linear regression to do some kind of analysis.

Â So, this is a company that manufactures various types of sparkling lights.

Â The manager is interested in getting a better understanding of overhead costs.

Â So, we have the data given to us.

Â The data that she has tracked is on total overhead costs for the past 36 months.

Â So, we have 36 months worth of data.

Â This is going to be the y variable.

Â The y variable is going to be total their overhead costs.

Â To help explain these, she has also collected data on two variables that,

Â obviously, she believes are related to the amount of work done at the factory or

Â the overhead cost.

Â So, these variables are machine hours,

Â the number of machine hours used during the month.

Â And production runs, the number of separate production runs during the month.

Â So, these two variables represent how much work is being done in the factory.

Â And what she believes is, these should have an impact or these

Â do have an impact on the overhead costs that are being incurred at that factory.

Â All right, so we have an explanation of what each of these measures are.

Â So, let's go ahead and take this data and pull that into Excel.

Â And use it to come up with an analysis of what we find in terms of first,

Â is there a relationship between these three things,

Â between this one y variable and the two x variables?

Â The two x variables being machine hours and production runs.

Â And second, we want to see what is that relationship if there's a relationship.

Â So, let's move to Excel and do this calculations first,

Â do this analysis first.

Â 20:58

So, here you see the data for the fireworks company problem.

Â You have the Excel spreadsheet that should be available to you.

Â And the data has three variables, three columns of variables.

Â Overhead costs is going to be our dependent variable and

Â machine hours and production runs are going to be our independent variables.

Â So, let's go ahead and do the analysis here.

Â We go to data.

Â We get to the add in, which is data analysis.

Â And there, Excel has it alphabetically.

Â So, let's go ahead and find regression, and here's regression.

Â We hit OK.

Â And then it's asking for the input range.

Â So, our y variable in this case is the overhead.

Â So, we simply go to overhead and we highlight the whole

Â column There,

Â and then we go for the x range in here, depending on how many variables you have.

Â You're gonna have multiple columns.

Â In this case we are going to have two columns, because we have two x variables.

Â Now, this also should give you an indication of how you would have

Â to put your Excel spreadsheet in a way that you can use it for regressions.

Â So, you want your x variables to be in consecutive columns,

Â right next to each other.

Â It doesn't give you the ability to skip and go to other columns.

Â So, you need to put all the x variables in consecutive columns,

Â which we already do have it here.

Â But in case you don't, then you'll need to do that.

Â You also need to pay attention to clicking labels,

Â the first row has the labels, so we wanna make sure that Excel knows that.

Â And you don't need to change the confidence level, it's set up at 95%.

Â We're simply dealing with p values as the way to interpret the results,

Â although confidence intervals would give you exactly the same results.

Â So, 95% confidence interval is simply looking at a alpha value of 5%.

Â But we don't need to worry about that at this point.

Â For the output, we are still going to ask for it to be in the new worksheet play.

Â And we're not gonna worry about the residuals and things like that.

Â Although, these would be things that you can check for,

Â whether assumptions of aggression are being violated and things like that.

Â We're not getting into those kinds of advanced information in this course.

Â So, let's just leave that unchecked and we hit OK, and Excel should give us results.

Â So, here you have the results for the regression with the regression statistics,

Â the ANOVA table and the coefficients.

Â 23:42

So here's the data that you had and

Â you used this to do the analysis of y as a function of these two x's.

Â And here are the results that we got.

Â So, the results here that we got here are broken up into different slides.

Â You have the first slide, which is giving you the model significance and

Â some of the information about the overall model.

Â So what do we see here?

Â We see that the F statistic in the ANOVA table is 107.03,

Â which is a very high F value.

Â And we also see that the significance value is really, really small.

Â So it's 3.75 multiplied by E to the power of -15 so many,

Â many, many, many, many 0s until you get to a number here.

Â So the model is significant.

Â Next, we move on to C the R Square.

Â The R squared that we're getting over here is 0.87.

Â So you look at the R Square value of 0.87 and

Â that's telling you 87% of the variation.

Â In the overhead cost is being explained by the two X variables that we had,

Â which is the production runs and then number of changeovers that we have.

Â So our production runs and the number of hours that we have.

Â So those are the 2 x variables that we had, and we looked at it's effect on

Â overhead costs, so 87% of the variation is being explained, and

Â what you can also see from here is that the Adjusted R Square is only

Â slightly lower in terms of it being adjusted for number of variables.

Â Which it's only at 0.86, so that's an indication that we might find that both

Â of these independent variables might be significant in affecting the Y variable.

Â 26:03

So here, you have the equation, the intercepts, and the P-values for

Â the intercept, and for the two coefficient.

Â So you have the intercept which has a P-value of 0.55 we said

Â earlier that we're going to use the intercept but

Â we're not going to interpret its P-value in terms of it's not significant.

Â We're going to interpret the other two P-values,

Â which are telling us that both of these X variables, the machine hours and

Â the production runs are significantly effecting the overhead costs,

Â and we can get an equation from this based on what we see over here.

Â So the equation that we can get is that overhead costs are going to

Â be equal to 3996.68, plus 43.54 times machine hours,

Â and plus 883.62 times production run.

Â That's the equation that we can come up with based on this model.

Â So here you see how you would interpret the result that you would get from

Â the data that we used in order to do this calculation.

Â So finally in closing, we looked at a very simple type of a regression model.

Â The most basic type of regression model, the most common type of regression model

Â that we can use in most situations we can use this.

Â So what are the other kinds that are out there?

Â So as we were talking about earlier, we can take

Â discrete variables and throw them in as independent variables, we would have

Â to do some kind of coding of those discrete variables in our throw them in.

Â We could also do some kind of interaction analysis.

Â Similar to what you would do in a tubian analysis of variance.

Â You could an interaction analysis based on multiplicative terms.

Â You can do the effect of X on Y depends on values of a different X.

Â So it depends, the effect of x1 on the values of Y, depends on values of x2.

Â And that you can find in some way in terms of doing the analysis.

Â You can find it by doing a multiplicative term,

Â multiplicative interaction term and adding that into the regression equation.

Â You can also for nonlinear effects you can add square terms and

Â cubed terms, that is a quick way of looking for nonlinear effects.

Â If the square term is significant,

Â if the cubed term is significant, what does that mean?

Â You simply take an X value and you square it.

Â So in our case, if we wanted to look at nonlinear effects

Â of the number of production runs you have created another column that says,

Â production runs squared, and we would have added that in if we had a hypothesis of

Â that there's gonna be a nonlinear effect of production runs on overhead costs.

Â There are other regression models that you can use out there in terms of

Â