Hello, and welcome to the lesson on Decision Trees.
Decision trees are a powerful algorithm that can be
used for both classification and regression tasks.
One reason decision trees are popular is that they are simple to build,
and another is they're easy to visualize,
which means you can explain why the model made the predictions that it did.
Some machine learning algorithms are harder to do that with.
One example of an easy one,
easy model to explain is linear regression.
You have the formula so you know putting in a value,
why are you going to get the output you do.
Other algorithms might be harder to do that with such as k-nearest neighbors.
You actually have to understand the distribution of
data to be able to understand why it's making a prediction.
That's a more complex problem.
Decision trees are easier because you can literally see
why each decision was made as you traverse the tree.
By the end of this lesson, you should be able to understand why
decision trees are powerful and how to use them effectively,
how to create and use them in Python using the scikit-learn
library and know how to apply it to both classification and regression tasks.
There's two activities, one a website that
is introduction to machine learning but a visual introduction,
the second is our course notebook.
Let me first quick demonstrate this.
The visual introduction to machine learning is a very powerful website.
I really like this because it shows the data
visually while also providing some text to give you background.
So you can see different ways to interact with the data,
how to split the data,
how to visualize it,
how to apply machine learning,
in particular, decision tree,
which involves chopping the data into subcategories or chunks.
This is useful in order to be able to build the tree,
and that's what this all talks about.
Before going any farther though,
I want to talk about our course notebook.
In this notebook, we're going to introduce the decision tree algorithm
and how it can be used for both classification and regression.
First, we're going to talk about the fundamental concepts that are
used to create a decision tree for classification purposes.
These includes entropy and information gain.
Then we're going to show how a decision tree can be used to classify the average dataset.
We'll look at the decisions surface,
and we'll see how that can be used to understand the impact of a hyperparameter.
Then we'll introduce a new concept called feature importance that decision trees
allow you to understand which features were
most important for building the decision tree.
Then we're going to look at how to visualize a tree,
and then we're going to introduce a new dataset,
the adult dataset that is right for a classification task.
This is a bigger dataset than the average data so it provides
a nice more real-world example of the application of decision trees.
Then we're going to introduce decision trees for
regression and also another new dataset called
the Automobile Fuel Prediction Dataset and show
how the decision trees can be used to build a regression model for these data.
These last two datasets are going to be used in many other algorithms,
so you'll see the performance,
the relative performance of different algorithms on these data and be able to
understand better how these different algorithms work and how they compare.
First, we start with our standard setup code before jumping into the formalism.
When you're building a decision tree,
one of the first concepts to keep in mind is you have to split the data.
That's the fundamental concept in the decision tree.
You split the data into two categories,
and the tree then encapsulates that with nodes in the tree,
and the nodes keep being split until you reach a terminal node,
which is known as a leaf node.
The following figures shows this.
This is a figure I made.
We start with our big dataset represented by this black square.
We split it into two,
A and B, and you see these datasets here.
And then we split B into two more,
C and D. This is a tree.
We have our root node. We have two nodes here at this level.
This one is a leaf node. It's not split anymore.
This node is a non-leaf node.
It's split into two leaf nodes.
The data are color-coded,
such that these here in C go with C,
and the blue ones here in D go with D et cetera.
This is visually what a decision tree is,
but we have to figure out how do we know where to
split and how do we know when to stop splitting.
The key point for where to split
has several different concepts or techniques that can be used.
One is variance reduction.
You can choose to maximally reduce the variance along a given feature.
That's often used for regression problems.
The Gini impurity is another technique,
and it's designed to minimize misclassification particularly in a multiclass setting.
A multiclass would be where you're predicting Class 0,
Class 1, Class 2 et cetera.
The alternative is a binary class,
where you're saying true or false.
The other technique is information gain,
where you're trying to create the purest child nodes.
The idea here is you want to keep data that's near each other and similar together.
To do this, we have to introduce the concepts of an entropy and information gain.
entropy is simply a statistical measure of the information content in a dataset.
And this section here talks about what that means.
Formally, we have to take the logarithm of the probability,
multiply by the probability and sum up. That gives us the entropy.
I find it easier sometimes to look at this in code,
so we demonstrate this in code.
So we can compute the entropy for binary probability. We have the probability.
One is, say, success zeros false and you see the entropy.
The maximum entropy is 0.5 in this particular example.
If we have a different case, it would be something different.
We can also compute entropy from data.
Here, we're going to load the tips data.
Pull out two categorical features: day and time.
Here, I just show those two categorical features.
And then we can actually look at this data in terms of a pivot table and say,
"What's the total values?"
Here's the individual values for each one.
And then we can compute the entropy for
these in terms of the relative frequencies, and so we can do that.
And at the end, we can get what our node counts would
be for that particular splits that we would see.
Now, we don't split on entropy.
Instead, we split on information gain,
and information gain is a way of using entropy
to actually compute how much gain do we get by making a particular split,
and that's what this entire section here talks about.
We can compute that a counts.
We can turn that into a probability.
We can turn the probabilities and entropies.
And then we can say if we make a split at a certain value,
what's the information gain?
And as we change that split value,
we'll change the information gain,
and the idea is to maximize that.
Now, what about Decision Tree Classification?
It's very similar to the way
other types of machine learning model building that we've done the scikit-learn,
we have some hyper parameters.
Decision trees have a lot of hyper parameters and some of them are listed here.
We're going to want to vary these to get different results.
First, we're going apply it to the Iris data set because we've seen the Iris data set.
You should start to get a pretty good feel for how to split, how good you can do.
We use our helper functions,
we make our plot, we see our test and train data.
Then we create our classifier,
we use a random state for reproducibility,
we fit it and we score it.
And you can see, pretty good score about the same as we got with the k-nearest neighbors.
We can then do a classification report,
and we can make a confusion matrix just like we've done before.
One thing I want to introduce though is,
we can do feature importance here.
We can say, "What are the most important features?"
The way you do that is,
you just access the feature importances attribute from our decision tree classifier.
So here, we zip these two together.
That simply combines them into a new data structure.
We then iterate through it pulling out the name
and the feature importance, and print them out.
That's all we're doing here. The result shows us that
the Petal Width and Petal Length have most of the feature importance.
In fact, together, those are over 96 percent of all feature importance.
That means we could just use those two features
and have captured most of the information in our data.
We can look at the decision tree surface.
This should be similar code to what you saw before with the k-nearest neighbors.
Here we split up our data,
you can see the decision tree.
There are nice linear boundaries.
We can also visualize the tree,
and by doing this,
we'll simply compute the breakdown.
So here's our root node.
This is actually using the Gini coefficient not the information gain.
We could have changed the tree to be built with the information gained instead.
We have 90 data values,
you can see values,
and it tells you we're going to split on Petal Width.
So if you're less than Petal Width you go over here,
if you're not great at that value, you come over here.
This now is a leaf node.
We now split on Petal Width again.
Remember, most of the information was in Petal Width,
so we're going to split on that a lot.
Again, what's the split value?
How many data points do we have?
If you're true, you come over here.
If it were false, you come over here then we split on Petal Length.
Again, we split on Petal Length over here,
we end up with four new child nodes and then it stops.
And this shows you the values that you're getting out.
Remember, this is a three class problem;
class Setosa, Virginica, and Versicolor.
That's what these values are showing you.
That there's 31, 0 in the first class,
31 of the second class,
one of the third class in this particular leaf nodes.
So it shows you the purity, if you will.
You can see that most of these nodes are fairly pure.
Dealing with, it's not as this one.
So we can then look at the variation of our hyper parameters.
Again, we're going to try.
And this case, not the number of neighbors,
but the number of the depth of our tree.
It's either three, six or nine.
Three is less than what we had at first.
You can see there's just some basic splits through the data.
Six, we start to get a little more structure,
and then nine we have even more structure, right?
We've even cut out that one little point there.
But you see how a decision tree is cutting the features,
and showing us the results here in the decisions surface.
We can also now move to a new data set,
the adult income prediction task.
The idea here is, we have data from a census and we want to predict,
does somebody make above or below $50,000?
We have the two codes cells where we define our local data file name,
and then if it exists, we use it.
If not, we grab it from the website.
We then do some basic processing of this data set,
and then we can actually create our label,
and then we actually show some values here to make sure our label was done right.
And then, we actually do something called the Zero model performance.
This is an important idea and you always will understand,
if I have a, in this case,
binary classification task, I could simply say,
in this case, most people are in the low salary category.
So what happens if I just say everybody's in the zero model,
at the low salary class?
Well then, I would be right 75 percent of the time.
Now, that's a bad idea for a model
because you're never going to predict somebody's highest salary,
and you want to make sure that you're actually measuring both categories.
But this gives you sort of a baseline idea of,
you should be doing that well.
If possible, should be doing that well or better.
If you're not, you need to have a good reason why.
It may be that, you know what,
you do a good job of predicting the high salaries and
a bad job at the low salaries, or vice versa.
But it's always important to understand what that zero model is.
We can then go through and convert our categorical features into
a appropriate categories or appropriate features.
And then we can actually combine those together into a features data set,
and then we can actually get to the classification.
First step, we take our features and label,
turn them into trading and test data,
and then we get some metrics.
We've created our decision tree classifier.
The only hyper parameter we specify is the random state for reproducibility.
We fit the model, and then we score the model.
And this is actually doing really well.
Look at this, it's 82 percent on recall.
It's not even in terms of the class.
This class is lower and we're going to have to worry about that.
What this is telling us is that it does a good job of predicting the majority class,
and a reasonable job in predicting the minority class.
Now, the next thing to do is, how do we do regression?
Well, the simple answer is almost the exact same way we did classification.
What I'm going to do here is load in a new data set,
the auto MPG data.
We're going to, again,
define the local filename,
and then download it if the file doesn't already exist.
We extract data, and then we can actually build a classification.
In this case, I'm going to introduce one thing that I want to highlight here.
We're using the formula interface which is
fundamentally provided by a library called Patsy.
This is actually how statsmodel uses it.
But I wanted to do this for the data to use in scikit-learn.
So, this is what we do.
We say, let's relate the miles per gallon
feature to the cylinders feature which is categorical.
That's why we prefix it with the C,
the displacement feature, the weight feature,
the acceleration feature, the year categorical feature,
and the origin categorical feature.
So here's our input features,
you can see there's a lot of them.
Mostly because we have all of these categorical features.
You can try removing those and seeing how the performance changes.
And then we import our regression algorithm,
we split our data into training and testing.
We create our regressor,
we fit our regressor, and we score our reggressor.
Now, when you look at this, it's 53.
You think, "Well, that's really bad."
But remember, this is a regression task,
and so the score is not the classification accuracy.
Instead, it's a measure of how well we're predicting the continuous feature.
So this particular cell makes several other metrics measurements,
including the Pearson correlation coefficient and shows those.
So here are a good thing I like to look at is,
if you look at this, what's our mean absolute error? It's only three.
And if you're thinking about this in terms of the prediction of a miles per gallon,
that means you're getting within three,
four miles per gallon value which isn't too bad.
So with this notebook, I've introduced the idea of
decision trees and how to use them for both classification of regression tasks.
A lot's been presented, but hopefully you have
a better feeling of how to build a decision tree,
how to use a decision tree and why it's an important algorithm to understand.
If you have any questions, let us know. And good luck.