0:21

The one difference is that at each split, when we split the

data each time in a classification tree, we also bootstrap the variables.

In other words, only a subset of

the variables is considered at each potential split.

This makes for a diverse set of potential trees that can be built.

And so the idea is we grow a large number of trees.

And then we either vote or average those trees

in order to get the prediction for a new outcome.

The pros for this approach are that it's quite accurate.

And along with boosting, it's one of the most, widely

used and highly accurate methods for prediction in competitions like Kaggle.

The cons are that it's, it can be quite slow.

It has to build a large number of trees.

And it can be hard to interpret, in the sense that

you might have a large number of trees that are averaged together.

And those trees represent bootstrap samples with bootstrap nodes

that can be a little bit complicated to understand.

It can also lead to a little bit of overfitting

which can be complicated by the fact that it's very hard

to understand which trees are leading to that overfitting, and so

it's very important to use cross validation when building random forests.

1:27

Here's an example of how this works in practice.

So the idea is that you build a large number

of trees where each tree is based on a bootstrap sample.

So, for example, this tree is built on a random subsample of your data and this is

a separate random subsample of the data and

this is a separate random subsample of the data.

And then at each node we allow a different

subset of the variables to potentially contribute to the splits.

1:50

Then if we get a new observation, say

this V observation here, we run that observation through

tree one, and it ends up at this leaf down here at the bottom of that, tree.

And so it gets a particular prediction here.

Then the next, we take that same observation,

we run it through the next tree, and

it goes down a slightly different leaf, and

it gets a slightly different set of predictions here.

And finally we go down the third tree,

and we get an even different set of predictions.

Then what we do is we basically average those predictions together in order

to get the predictive probabilities of each class across all the different trees.

2:50

I'm also telling it to fit the outcome to be Species

and to use any of the other predictive variables as potential predictors.

I'm setting prox equals true.

You'll see why I'm doing that in a minute

because it produces a little bit of extra information.

And then I could use when I'm building these model fits.

So here it tells me that I built the model and I've

done bootstrap re-sampling and then, I

tried a bunch of different tuning parameters.

And so the tuning parameter in particular is the number of

basically tries, or number of repeated trees that it's going to build.

3:23

I can look at a specific tree in our final model fit using the get tree function.

So I applied get tree here to our final

model and I say I want the second tree out.

And this is what the tree looks like.

So each of these columns, or each of these rows corresponds to a particular split.

And so, you can see what the left daughter of the

tree is, the right daughter of the tree, which variable we're splitting

on, what's the value where that variable is split and then

what the prediction is going to be out of that particular split.

3:53

You can use this centers information as well to see what

the predictions would be, or the center of the class predictions.

So what I've done here is, I'm looking at two particular variables.

The petal length and the petal width.

So I plotted petal width on the X axis and petal length on the Y axis.

I then get the class centers.

So these are going to be the centers for the predicted values.

So I'm going to send in the model fit, and I'm going to give

it this prox variable which we asked for in the previous fitting.

And when I'm going to tell it we're looking at the training data set.

And so that gives us the class centers.

Those class centers will then, we can then plot

those to see where they fall in the data.

So now, I've created the centers data set, as well as the species data set.

And what I'm going to do is plot petal width versus petal length.

And I'm going to color it by species in the training data.

That's what I did with this qplot command.

Then I'm going to add points on top of that.

That are the petal width and petal length, corresponding to the color being the

species, and now I am using it from the irsP which is the centers of the data set.

So what you can see is each dot here represents an observation.

And the x's show the color center, or

the observation centers for each of the different predictions.

So you can see that we predict the, each species has a prediction for these two

variables that's right in the center of the

cloud of points corresponding to that particular species.

You can then predict new values using the predict functions.

So you past to predict our model fit and the testing data set.

And here, I'm also setting a variable, testing predict right, which

is that we got the prediction right in the data set.

In other words, our prediction is equal to the testing data set species.

I can then make a table of our predictions versus

the species to see what that variable would look like.

So I can see for example I missed two values here with my random forest model.

But overall [INAUDIBLE] it was highly accurate in the prediction.

I can then look and see which of the two that I missed.

And perhaps unsurprisingly you can see the two that I missed,

marked in red here, are the two that, in-between, two separate classes.

So remember there was one class up in this right corner, and one class right here in

the middle, and this cloud, and the two points

that lie right on the border we were misclassified.

So you can kind of use this to explore, and see where

your prediction is doing well and where your prediction is doing poorly.

6:19

Random forests are usually one of the top

performing algorithms along with boosting in any prediction contests.

They're often difficult to interpret because of these multiple trees that we're

fitting but they can be very accurate for a wide range of problems.

You can check out the rfcv function to make sure that cross validation

is being performed, but the train function in caret also handles that for you.

For more information you can read about

random forests directly from the inventor here.

The Wikipedia page for random forests is also quite good,

and the elements of statistical learning covers it as well.