Now, in general, I want to walk through first the different steps that you want to take when either over or under sampling, and make sure to keep this ordering in mind. We will want to first do our train test split. It'll be best practice to do a stratified test-train split, to ensure that we keep that balance for both our train and test set, in regards to those unbalanced classes. We can then go ahead and either under or oversample to our training set. Doing this second, we'll ensure that we don't have those same values in our train and test set. Think, if we were to actually oversample and then do our split, so let's say we oversampled, so we have duplicates from our minority class, we can have those samples in both our train and test set. Then finally, once you have your training set and you've over and under samples, then you can fit your model using that balanced dataset in order to come up with a prediction. Now with unbalanced classes, that data often isn't easily separable. We have to choose to make sacrifices to one class or the other. Consider the following data about an email campaign. Here we see a slightly skewed data with one yes and two no's for outcome variable customer response. Now if we label our customer response outcome of yes, they responded as that minority class, which is the case here, we must take into account that by either upsampling the yes's, or downsampling the no's, we are in danger of adding more weight to the features relating to customer response paying yes, as a proportion of the full dataset. Thus we become more likely to wrongly label a few majority class points. Sudden the majority class here being no, they did not respond as yes's, as they did respond. Our ability to catch all the minority classes will go up. But as a proportion of our predicted values of our actual predictions, we are more likely to have a given value predicted incorrectly. Thinking back to our air metrics, that would mean that our recall goes up, since we're able to capture more of our minority class. But our precision goes down because we have more predictions that are incorrect. Now with this trade-off in mind, we want to talk about the pros and cons of both upsampling and downsampling. Starting here with downsampling. Downsampling will add tremendous importance to our minority class, but we'll typically shoot up our recall, but bring down our precision. So values like 0.8 recall and 0.15 precision are not uncommon when downsampling that majority class. Think here about our specific trade-off when we're downsampling. We're definitely going to be increasing the ability of our model to correctly predict that minority class. But that's at the cost of losing a lot of valuable data that can help us predict that majority class, and gorgeous eliminating rows from that majority class. Now, upsampling mitigates some of this excessive weight of the minority class. Recall is still typically higher than precision, but that gap will be a little bit lesser. Values like 0.7 recall and 0.4 precision aren't the uncommon. You see we're bringing down recall slightly, but bringing up our precision. Often when we're choosing the minority class as our positive class here, these are often considered good values for an unbalanced dataset. We want to think here that we are, again, increasing the power to predict our minority class. This time without diminishing the data available to find the patterns of that majority class. The downside here though, is that we are then fitting to duplications of the same data in the minority class, and thus giving more weight to and overfitting to those repeated rows. How do we determine whether to upsample, downsample, or even resample. The cross-validation methods that we have used with our models is not just limited to finding the correct choices of model hyperparameters. But it can also be applied as I hope you can tell him, hinting towards to finding the best method to use for finding the balance classes. For example, we can see the ROC curve for our sample of size 10, or of size 13, even of size 18. Here, being the number of rows for both the minority and majority class. This example does show a higher value being best for each one of our different sampling techniques. Obviously, this is not necessarily going to be the case. We will have to run cross-validation in order to see which one performs best. Now first things first, we know that every classifier, depending on model or hyperparameters use, will ultimately produce a different model. Furthermore, every dataset we use produced by various sampling, will ultimately produce a different model as well. With this in mind, we know we can choose the best model using any criteria, whether that's accuracy, I wouldn't suggest accuracy, or another one of our error metrics including AUC, area under the curve. Remember, each model and each sampling method, use will produce a different ROC curve as we just saw in the last slide. Once we find a good model for actual business application, you could walk along this ROC curve and pick any threshold value, depending on the importance of either your precision or recall values based on your business objectives. Now, just to recap, we discussed the issues that arise when class outcomes are unbalanced. Again, the majority of our models will be trying to reduce that overall error, trying to optimize just on getting the class correctly, no matter what that class. Models can end up being optimized with heavy skew towards that majority class. We then discussed some approaches to dealing with unbalanced data. With them we discussed the pros and cons of upsampling, downsampling, and resampling to get a balanced dataset. In the next video, we'll go into more technical versions of how we should actually go about upsampling, downsampling, and resampling as well. All right, I'll see you there.