The goal in this video will be to show how we can reduce variance further than we've done so far with just bagging. If our bagging produced n independent trees, each with variance Sigma squared, our keyword being independent trees here, then the bagged variance would be Sigma squared divided by n. So the larger n is, that is the more trees that we are using, assuming independent trees, the more we can reduce this overall variance. In reality though, these trees are not independent. Since we are sampling with replacement, they are likely to be very highly correlated. As we see with this equation, if the correlation is close to one, we end up with no reduction in variance, which should make sense if you keep using the same or very similar trees, you're not gaining any new information if you keep using the same decision tree over and over. You need to ensure that each one these decision trees are somewhat different than one another. So what's our solution? We can just introduce more randomness. We need to try and make sure that those trees are significantly different than one another and thus decorrelated. To achieve this, we restrict the number of features the trees are allowed to be built from. So each tree will be built from a random subset of not just rows, but of a random subset of columns as well and by default, our classification model, we'll limit that subset of features, that subset of columns, to the square root of the total number of features available and regression will take one-third of the total number of features that are available. This will force different decisions for each tree depending on which one of the features are now still available for that tree, for that subset. This resulting algorithm is what is called random forest. So random forest is essentially the bagging that we just learned. So bootstrapping and aggregating with not only the subset of the rows being random, but also the subset of the features or columns also being random. So that's our new type of subset, not just a subset of those rows, subset of the rows and columns. Generally speaking, we need a bit more trees, but with those extra trees, we generally eventually get better out-of-sample accuracy compared to just simple bagging as we see here on this graph. So we know we can further reduce error using random forest. But how many trees do we actually need for random forest? Similar to with bagging, we can see at what threshold are out-of-bag error tends to plateau and at that point, any additional trees will no longer improve our results. So that's how we come up with that number of trees by testing different values for the number of trees and where it tends to plateau. So how do we run random forests in practice? It's going to be very similar to all the classification methods that we learned before. We're going to import the class. So from sklearn.ensemble, we import that random forest classifier. We're then going to initiate it. Here we're setting the number of estimators, the number of trees here equal to 50. We fit on our training set, we come up with our prediction on our test set and then similar to before, we can use cross-validation in order to see which different hyperparameters perform better for out-of-sample fits and also for random forest classifier, many of those hyperparameters will be similar to decision trees. Then, if you want to do regression rather than classification, all you need to do is use the random forest regressor rather than the random forest classifier and then ensure that your y-variable is a continuous value rather than different classes. Now, what about the cases where random forest does not reduce the variance enough? So in the cases where even random forest are overfit, we can introduce even more randomness. We can also randomly select the actual splits in each one of our decision trees. Recall that decision trees typically use a greedy search to find a best split. So if a certain feature is available in a given subset, that feature may always be chosen as that first split at the top of the decision tree. As opposed to with this method, which will pick where to split and which feature to split at, at random. These extra random trees are called fittingly, extra random trees. The hope is with enough random splits, we still have majority classes in each one of those leaf notes and the vote will still be a good classifier when all aggregated together, even if those individual components are a bit weaker. Now we're going to walk through the syntax for creating these decision trees and as usual, it will be the same steps. First, we're going to import the class containing the classification method. So here from sklearn.ensemble, we import the extra trees classifier. We then set the number of estimators. We initiate the object, fit it on our training set, run our predictions on our test set, and then again, we can tune those parameters using cross-validation and use the extra trees regressor for regression. So it's all the same steps that we saw before, here replacing what was random forest with extra trees or what was bagging with extra trees and coming up with a more random classification method than either of those other two options. Now just to recap what we learned here in this section, in this section, we introduced this concept that we may be able to improve our predictive power by combining models and this is what is called an ensemble-based method. We introduced bootstrap aggregation or bagging, with the bootstrapping set being the step where we get random subsets of the original training set to build our classifiers and the aggregation step being the step where we aggregate the classifications together using a majority vote. Finally, we showed how we can evolve from bagging to more random, less likely to overfit versions of the model, such as random forest with a random subset of the features and extra trees classifiers, which will even choose the splits of the decision tree at random. With that, we'll take what we learned and see it all in action in our notebook demo. All right, I'll see you there.