Now let's discuss the loss functions use when we're trying to do boosting. Now there are a few options that are used in practice when it comes to boosting. And at each stage or for each of our weak learners, we can determine a margin for each point in our data set. Now, the margin is calculated to be positive or to the right of the plot for correctly classified points, and is to the left on the plot for incorrectly classified points. The value of the loss function can be thought of as the distance from our decision boundary, so from that margin. As in the case of logistic versus linear regression, we can penalize misclassified faraway points heavily or not. And the loss function gives us the penalization and determines what type of boosting algorithm we're actually going to be using. So, the most frequently discussed loss function is the zero, one loss. This function returns one for incorrectly classified points and ignores correctly classified ones. And this is a theoretical loss function, this is actually where we're used. And this is due to the fact that it's not differentiable. A function such as this one, that is not smooth and not compact convex is difficult to optimize. Instead there are various types of loss functions for different boosting algorithms that we will use in practice. So, let's first see how the loss function looks for AdaBoost. AdaBoost or adaptive boosting, was one of the first boosting algorithms used in practice. AdaBoost uses an exponential loss function. As it can be seen here, very negative points can strongly affect the loss. So if the distance from our margin is large and incorrect, we end up with a large contribution to that overall error for that given point. And this makes it so that AdaBoost can be very sensitive to outliers. Now we'll also cover a second type of loss function associated with a boosting algorithm. Scikit-Learns gradient boosting function, another popular boosting algorithm used often in practice, uses a log likelihood loss function. And the reduced value of the log likelihood loss function for large margins, for misclassified points, of course, makes this version of boosting, here, gradient boosting more robust to outliers than AdaBoost. Now let's compare the two ensemble methods that we've discussed so far, such as bagging and boosting. For boosting, we can use the entire data set to train each of our classifiers. So boosting is what we just went over. And then, for bagging, we recall we had both subsamples and we would only train each classifier on one bootstrapped sample. Now, starting with bagging, base learners, each one of those smaller trees are independent from one another. And usually they are not going to be stumps but rather full trees. Whereas for boosting, the base learners are going to be weak learners and they are created successively, where each learner builds on top of the previous steps. Now another difference, bagging only takes into account the data of the bootstrapped sample. Boosting on the other hand takes into account not only the current data, but it also accounts for residuals from previous models when building each one of the successive learners. For bagging, all samples are going to be equal and coming up with that final classification. So it can be an equal vote. Whereas for boosting, at each iteration, the previous mistakes are trying to correct by weighting them more heavily. So we're going to have different weights for each one of our weak learners. And then finally, with bagging, we don't have to worry about excess trees causing overfitting, whereas with boosting we do have to be aware of overfitting. So why is this the case? As we see here, as we increase the number of trees, what we are doing is continuing to try and improve on mistakes that we made by prior trees. So at a certain point, we do risk that danger of overfitting, because we keep trying to improve and improve off of the errors from past trees. With that in mind, we want to use cross validation to hone in on the correct number of trees that should be used for our boosting models. Also, our learning rate needs to be optimized in order to properly regularize the model. As I mentioned earlier, if you have a lower learning rate, you should probably use more trees as the learning rate represents how much we are correcting the model at each step. So, these two hyper parameters, the learning rate and the number of trees are going to be related. Ideally, to bring down the amount of correction at each step, we should set that learning rate, also called the shrinkage as it shrinks the impact of each successive learner, to a value less than one. Another argument available for gradient boosting is the subsample. And this is another parameter we can use to add randomness to reduce overfitting. By using a subsample, our base learners don't train on the entire data set. And this alone allows for faster optimization, as well as a bit of regularization as it will not perfectly fit to our entire data set. Here we see that just using the subsample does not seem to aid our overall test error much. But combining that with a low learning rate, so that it also does not overcorrect seems to do well. Another parameter we can use is max_features. The max_features will determine how many features to consider when trying to find each split. So again, reduces the possible complexity of our model. And we see, this is decently in regards to improving our test set error. Now I do want to note that how well this hyperparameter tuning will do in practice, will actually depend on the data set. And as usual, you probably want to consider using cross validation when deciding between each one of these different hyperparameters. So that close out this video. In our next video, we'll do a short walk through of the actual syntax of both gradient boosting and AdaBoost in SK learn. All right, I'll see you there.