Now let's discuss this concept of classification error metrics, which will be of utmost importance to keep in mind as we begin building machine learning models for classification. So in this section, we're going to go over different types of errors to be aware of when we're working with classification problems. Some approaches to measuring classification outcomes according to the different types of errors that can be made. Finally, how to ultimately use classification error metrics to choose between our different models. Throughout the last course, we have extensively covered how models are selected based on splitting the data and calculating the error on some model set. Let's now move on to how to best calculate that error. The choice of the right error metric depends heavily on the question and the data available. For example, assume we're trying to classify patients likely to get leukemia. In our training data, a large majority, 99 percent of patients are healthy. Let's say we build a classifier and use the accuracy as our error metric. If we use the accuracy as our measure, a simple model could be built that always predicts healthy, and although this is a useless model, it would result in 99 percent accuracy. Thus we see the importance in understanding our data and choosing the appropriate metric. Accuracy is often not the right metric for a binary classification problem. So when thinking about errors with classification, we often talk about a confusion matrix. The vertical axis in our confusion matrix contains rows that correspond to the ground truth, either positive or negative here, and the horizontal axis, so each one of our columns corresponds to what the model ends up predicting, either true or false. Either positive or negative. The blue diagonal that we see is going to be the elements that are correctly predicted values, and then the red diagonal of from the right corner to the left corner is going to be the elements corresponding to the errors. In the bottom left is going to be a false positive, which is also called a type I error. And then the top right is going to be a false negative, which is also sometimes called a type II error. So let's walk through some of the measures that we may use for this measuring error with classification versus our classic accuracy measure. We can calculate accuracy as the sum of both correct predictions, true positives and true negatives are blue values that we have. The denominator is the total number of samples. So total correct prediction over all of our labels. This is probably the most common error metric, but it can be deceiving in situations like we just saw before, where the population may be schemed. Next, we have recall, or you may have heard this called the sensitivity, which is our ability to identify all the actual positive incidents. So one, we're trying to recall all the actual positive instances, recall measures the percentage of the actual positive class that is correctly predicted. So out of that top row of true positives, how many did we predict correctly? In other words, this is going to be the capture rate. So with our example, what percentage of the true leukemia cases is our model capturing? Notice that you can easily achieve a 100 percent recall by predicting everything to be positive. Everyone has leukemia. So then out of all of our actual positives, because we predicted everyone has leukemia, we got those all correct. To balance that, enter precision. With precision, we are identifying out of all are positive predictions that will leftmost column out of our predictions, how many did we get correct? When the model predicts leukemia, how often is it right? If you always predict leukemia, then your recall the a 100 percent, but your precision here, all of your predictions, a lot of them will be wrong, so it will suffer a lot. So you'll have this trade-off. So notice you can also predict one really showcase to be leukemia and everything else you can predict to be non-leukemia and end up achieving a 100 percent precision. In that case your recall would be very low. You only captured one of the two cases out of all of them. So there's this trade-off here between precision and recall. Next, we have specificity. So trying to avoid our false alarms. Here, we're only looking at that bottom row of actual negative classes. Specificity is concerned with how correctly the actual negative class is predicted. In other words, it's going to be the recall for classes zero out of all the cases where we do not have leukemia, how often do we correctly identify those patients as not having leukemia? We can see how it would be much more important for our leukemia example to have higher recall, identify all the true positives correctly than any of our other measures that we've discussed. So putting these all together, we have accuracy to measure how well we classified overall, but can be heavily thrown off by skewed data. We have precision, which is out of all our predicted positive. We see the denominator here is true positive and false positive. So all of our predictive positive, how many did we actually get correct? We have recall, and here our denominator is all the actual positives, both true positive and false negatives. False negatives are the predicted negative when they were actually positive. So we are saying out of all our actual positives, which ones we calculated correctly, and then specificity, which we will just be recall for our negatives? So for all actual negatives, which ones did we predict correctly? The last important metric is going to be this F1 score. The F1 score is two times the product of our precision and recall over their sum. This is sometimes called the harmonic mean. The F1 score is a nice metric because it uses both precision and recall, and it tries to capture this trade-off between recall and precision. Unlike just using accuracy, this trade-off will more heavily weight if either precision or recall is too low. So optimizing F1 will not allow for the corner cases like predicting everything to be one. In this next video, we will continue our discussion of classification metrics with an introduction of another metric, the ROC curve, as well as showing how we can quickly make each of these values readily available that we just discussed using a scalar.