As promised we now introduce the Sigmoid function. This takes our original linear regression function and wraps it into its own function. As we see here to the right, 1 over 1 + e to the- x. Where x is supposed to represent our original linear function which we'll see in just the next slide. And this new function will always take values between zero and one, no matter the value of x, and x again representing that full linear equation. And it smooths out the effect of high or low values of x. This way our algorithm is not skewed by these more extreme samples and it manages to find the obvious visual threshold. So instead of trying to fit y equals beta naught + beta 1 x as we do with linear regression, we can try to fit y equals a function of beta naught + beta 1 of x with that function being what we mentioned earlier. 1 over 1 + e to the power of the negative of our original linear function, which we see here is beta naught + beta 1 of x. The resulting algorithm that we have here is going to be called logistic regression. Note that this is not going to be a regression algorithm. Again, it has logistic regression with regression in its name, with regression generally meaning how much but this is actually a classification algorithm as in choosing which one which class, it's just unfortunate naming. Also notice that y beta x is always going to be between 0 and 1, and the location where it hits the 0.5 is going to be meaningful in that it truly is meant to represent that 50/50 chance of either outcome given our model. And then as opposed to linear regression approach, which can take on any value, here we can only take on values between 0 and 1. And now we see how we can correctly classify all values on the left of our decision boundary, as well as to the right of our decision boundary as well. Now I want to tie this back to the original linear regression problem, so that we have an intuitive sense of the model being learned, p of x here is going to be the output of our logistic regression. And can be thought of as the probability of the sample being in a certain class versus the other, now, we're not going to go too in depth over the algebra. But e to the- z and z being here, beta naught + beta 1 of x is going to be equal to 1 over e to the z. So our denominator has can be transformed into 1 + 1 over e to the z. And we can multiply the top and bottom by e to the z. And we'll end up with the equation that we have here on the right. Don't worry too much if you didn't follow along with the algebra, you can just trust that we will have this equation that we have here on the rights. We can then do some more algebra on our own. To see that we can show that the odds ratio. Is just equal to e raise to the power of our linear function. The idea being here, as you see us walk through the steps that we're trying to isolate what the linear function actually does. So our linear prediction is no longer a function is no longer y but a function of y. So rather than p of x, it's p of x over 1- p of x. And if we recall that P of x as a probability, we can see that we now have turned our probability into an odds ratio. So we started off with a 0.75 chance of getting a certain value. That is the same as saying we have three to one odds, with P of x = 0.75. 1- P of x would be 0.25, so it'd be 0.75 over 0.25 and we'd end up with e to the power of our linear function is equal to three to one odds, rather than our original logistic function being equal to 0.75. We can then take the log of both sides and we see that our linear equation Is just going to be a linear function of x, where our y here, which is now a function of our original y can be seen as the log odds. So a unit increase or decrease in our x values will change our log odds in a linear fashion, according to whatever the beta 1 is that we have learned. So that's going to be how you can use it for interpretive purposes. Now let's get back to our visual example to gain a deeper understanding of the boundary created by our logistic regression. With the one feature, the boundary is just a point corresponding to y equals to 0.5. And then we have two labels not churned versus churned, and we only have that one feature, the amount of usage. Now with two features, we'll be working with a straight line. And in general, as we move up to higher dimensions, that decision boundary will just be a hyperplane. The idea being that it's just going to be a linear function. Don't mind the little rerouting on the figure. It is meant to be linear. We just wanted to show how it'd be classified using a straight line and didn't want to run through any one of our examples. And with our new decision boundary, we can now predict a new example. We see it falls somewhere around phone usage being, let's say, eight, and data usage being around 20. And we can predict this example according to where it falls, given our decision boundary. And we find that it would be whatever the blue label is, whether that's churned or not churned. And we can use this classifier or more generally any binary classification in a multi class classification scenario as well. We see here that we now have a third grouping. So rather than churned and not churned, we are predicting between the labels not churned, canceled and left for a competitor. And one technique to accomplish this multi class classification as a method called one versus all. So how does this work with one versus all? We would take one class say not churned and that's going to be our blue dots here. And declared that all else is going to be our other class. Here, the other is going to be the churned class, as well as the canceled class and we can fit a logistic regression. We then have our logistic regression that we see here on the left, that's going to define that decision boundary between most likely to have churned versus all other. We then do it again for all our other classes. With each class we're going to be estimating the binary logistic regression versus all other classes. So we see here, let's say that's cancelled versus not canceled. And then the same we're going to do that was for the purple labels, we can do the same for our red labels. So the red labels being at the end left or competitor versus all else. So we have each one of our three different classes where we found the logistic regression of one versus the rest. And we end up with logistic models splitting out three probabilities, one for each class. And the estimated category is going to be the class with the highest estimated probability for each one of those, one versus all. And we end up with these three separate decision boundaries. Given the highest given probability for each one of our separate three logistic regression problems. So, there we see left for competitor in the blue, and not churned is going to be our pink here. And then, churned is going to be our purple values that we have here on the right. Now let's take a step back and see how we can fit a logistic regression function mode using sk learn. First we want to import our model, so from sklearn dot linear model, we import logistic regression. We're then going to want to instantiate the class, so we say LR, that's going to be our object, is equal to LogisticRegression. That's going to be what we imported just before and we are passing in these hyper parameters for regularization. So we specify that regularization to avoid overfitting and we are using specifically the L2 norm which we learned earlier. That's just going to be our coefficient squared and then our C is going to be that regularization constant. But here it's going to be essentially the inverse of that lambda that we learned before. So a higher c actually means less penalty. We're then going to fit our model using our training set. We have our X train for our features, our y train for our labels, and we come up with the fit. And then we are able to also get our prediction values by passing in to LR.predict, once LR has been fit our x test our holdout set, so we come up with our prediction. Then as we did with linear regression, we can also call the coef_ attribute to view each one of our coefficients. I do want to note at this point that if you want p values for either logistic or linear regression coefficients, I would suggest also looking into using the stats models package, which is better for statistical inference, but not quite as seamless as sklearn for general machine learning. Another thing is that sklearn is going to come with this nice cross-validation method, which allows us to try several parameters very easily. And then once it exhaustive search on CV split set, it refits a model with the best choice of parameters similar to what we did with grid search CV. Now, Logistic Regression and classification in general has many applications. Other applications besides just customer churned that we can use Logistic Regression for and include, maybe we want to predict customer spending, and we can reframe the question as, how likely is a customer to be a top 5% spender using previous purchase data. For customer engagement, we can predict which customers are most likely to engage in the next six months. In e-commerce, we can predict which transactions are fraudulent, using customer characteristics such as location, IP address, etc. And within finance and risk evaluation, we can predict whether alone we'll default or not default. And we want to keep in mind this idea of interpretation versus prediction. In addition to just prediction or predicting labels, we may want to evaluate the importance of each factor in influencing our outcomes. Recall how we learned that coefficients being larger, still have interpretive values to them, to each one of this coefficients, just rather than increasing our outcome variable by beta units or trying to do linear regression with a unit change in x. It will increase the log odds by beta units with a unit change in x, still giving us an idea of the influence of beta of X on our outcome variable. So to recap, in this section, we introduced logistic regression as our first classification algorithm outputting the probabilities of different classes. We then showed the differences between linear regression and logistic regression as well as showing that how the two tied together. And then we close out to show how we can actually use logistic regression to predict the class with our customer churn example. As well as looking through how to do that with sklearn and discussing some other examples for which you may want to use logistic regression or other classifiers. In the next video, we'll discuss the important topic of choosing the correct classification error metrics. And I look forward to seeing you there.