So as we discussed, we're going to introduce another method of evaluating our classification metrics namely here the receiver operating characteristic curve or the ROC curve. An ROC curve indicates the sensitivity or the recall. Out of all of our actual positives, how many did we get correct on our y-axis, and then the false positive rate or one minus the specificity on our x-axis? Specificity is our true negative rate, if you recall, we discussed that in the last video, one minus this value is going to be the false positive rate. So if we got all of our negatives correctly identified, then our false positive rate is zero. Then for every actual negative we predict incorrectly as positive, we increase this false positive rate. This looks at the predictive probabilities that we output which is just going to be a list of scores, not a list of classes. So rather than outputting one or zero, it's going to output a value such as 0.9, meaning its pretty certain that the value is one, or 0.1, meaning it's pretty certain the value should have been zero. It's then going to plot both the sensitivity and the false positive rate values for various score thresholds. So here we can start considering thresholds other than the 0.5 that we've been using for logistic regression. We can think again to our trade-off. If our threshold for predicting positive is really high such as 0.99, then we will have a high true positive rate which is going to be at the top of our chart, but also a high false positive rate which is also going to be to the right of our chart. If we pick a very low threshold, say 0.01, then we will have a low false positive rate to the left of our plot but also a low true positive rate which is our bottom bar plot. That's what's creating this straight diagonal across. So we're not interested here in the classes predicted but how meaningful the class probabilities output by the model actually are. That diagonal of this matrix represents the value that can be obtained by just randomly guessing. If we randomly guess, then we can set them all to one, all to zero, and end up on either corner, otherwise, we could end up somewhere in the middle. The lower right portion, we want to avoid, those are models that end up here doing worse than guessing, and this will almost never happen with our models. The top left is where we want to be, and the closer we are to that top left corner, the better we are with all the way in the top left being our perfect model. That means we have a really high true positive rate, and because it's all the way to the left, a really low false positive rate. So now the ROC area under the curve, sometimes called ROC AUC, gives a measure of how well we are separating the two classes. One is perfect classification and will be close to that top left corner with a high true positive rate and a low false positive rate. We see as we move further to the right that the value decreases, and as we get to 0.5, this is essentially as good as picking at random. Similar to our F1 score, this is again a balanced metric. As opposed to accuracy, which can have an inflated value such as predicting all zeros, such a useless model, this will not fall trap to that same unbalanced metric. The curve will always connect that bottom left, no false positives, but also no true positives. Essentially a threshold that predicts all zero to this upper right corner, all the true positives, but also a lot of false positives, essentially a threshold that predicts all one. In practice, it will almost always be a convex curve as we see here with the 0.9 or the 0.75. Similar to the ROC curve, we can plot the precision-recall values for various score thresholds. Now, this is an unbalanced metric. This will mostly be a decreasing curve. The curve will end up at the right where recall is equal to one, we predicted all ones, and then precision would just be equal to the number of positives over all the possible labels. Because we predict them all as one, so it's just the number of positives versus all that we predicted as positive, that's going to be our precision. Think at the left where you're starting at predicting all zeros, so there's no recall, but also no precision because we haven't predicted any true values. But then we can get very high precision almost immediately which is why we see that jump in Model 1. Once we have a threshold of, let's say, 99.99 and then we are very certain for that single one that we had predicted it correct, so, therefore, we have a high precision. Here, our area under the curve will depend on how unbalanced our data set is. So how do we choose the right approach? Which approach works best for choosing a classifier? The ROC curve will generally do better for data with balanced classes. The precision-recall curve will generally be better suited for data with imbalanced classes, so something similar to what we saw with our leukemia example. The right curve will depend on tying our results, so true positive versus true negative to our outcomes, and the relative costs of false positives versus false negatives. For example, if we want to predict whether a customer is likely to churn and initiate intervention. If that customer does churn, the prediction threshold takes that customer's value, loss if the customer churns, and the costs of intervention both into account when trying to come up with that threshold. Since business outcomes, relative costs of false positives versus false negative, will probably lead to a specific decision threshold for an application, results at that threshold may be more relevant than results across all threshold, which is what we're getting with both the ROC and the precision-recall curves. We may want to just look at precision, just look at recall, just look at the F_1 score. Here's an example where we extend this to multiple classes and we have here a three classification problem. Again, that blue diagonal is going to be the true predictions by the model, assuming that we got them correct, so we predicted them all correctly. Accuracy once we extend to three labels or three classes, is just going to be the ratio of this diagonal over the total number of samples. Very similar to what we saw with just the two examples or the two labels. Now, there's no directed generalization of ROC, precision-recall and the remainder of the measures, but we can look at both precision-recall, specificity, etc, for each class, as a one versus all approach. We can look at the specificity of class 1 versus the rest, class 2 versus the rest, and so on. With that, when we're looking at multiclass classification, it will still also be important to pick or define the right metric for the problem at hand. We want to think for multiple classes at this point, what is the cost of misclassifying one class as the other? What's the cost of misclassifying class 1 as class 2, class 3 as class 1, etc? We're also looking at our larger confusion matrix. Our confusion matrix is what did we predict versus what are the actual values? We can see here predicted class 1 versus actual class 2, we can see that within the matrix, will help us make these decisions as well. We can actually pull out these confusion matrices using Python. The error metrics are going to be located in the appropriately named metrics library of scikit-learn. Here we see from the Metrics library, we import the accuracy_score. For essentially any one of these metrics, they follow a similar syntax, where the inputs are the actual and predicted labels respectively, as we see here. The test versus what we actually predicted, so the test being the actual values. We're able to get an accurate score there, and if we wanted to do the same thing using any of our other scores, we see that we have precision_score available, recall_score, f_1_score, roc_auc_score. We can actually pull out the confusion matrix as I promised in just the last slide, the ROC curve, the precision-recall curve, and even more if you want to look deeper into the metrics library. Some of them that take in the predicted classes, predicting either 1 or 0, such as accuracy, the F_1, precision and recall. Some of them are going to take in the predicted probability, so we want to keep this in mind. As we discussed, roc_auc_score and the precision-recall curve will both bring in probabilities. If you're unsure, I would suggest just looking at the documentation and it'll give you a clear picture of which one you should be passing into each one of these different error metrics. Now, let's recap what we've learned here in this section. In this section, we broke down the importance of how to separate out different error types, such as type 1 and type 2 errors, and looking at them in relation to both the actual values of our labels, as well as the values that we predicted for each one of those labels. With that, we discussed many approaches to measuring classification outcomes outside of accuracy, such as precision, recall, the F_1 score, the ROC curve, and the precision-recall curve. Then we learned how with our different ways of measuring error, we must realize that the error metric and model we end up choosing, will depend on whether our business problems should focus on accuracy overall, or having a low number of false positives or a high number of true positives, etc. With that, we close out our section here on measuring error and we'll jump into our notebook on logistic regression, as well as actually looking at these different error classification metrics in practice. Thank you.