Just as in tree algorithms such as ID3, and C 4.5 were invented in the 80s and 90s, they are better at certain types of problems in linear regression, and are very easy for humans to interpret. Finding the optimal splitting when creating the trees is an NP hard problem, therefore, greedy algorithms were used to hopefully construct trees as close to optimal as possible. They create a piecewise linear decisions surface, which is essentially what a layer of ReLus gives you. But with DNNs or Deep Neural Networks, each of the ReLu layers combines to make a hyperplanar decision surface, which can be much more powerful. But I am skipping ahead to why DNNs can be better than decision trees. Let's first talk about decision trees. Decision trees are one of the most intuitive machine learning algorithms. They can be used for both classification and regression. Imagine you have a data set, and you want to determine how the data is all split into different buckets. The first thing you should do is, brainstorm some interesting questions to query the data set with. Let's walk through an example. Here is the well-known problem predicting who lived or died in the Titanic catastrophe. There were people aboard from all walks of life, different backgrounds, different situations et cetera. So we want to see if any of those possible features can partition my data in such a way that I can with high accuracy predict who lived. A first guess at a feature could possibly be the sex of the passenger. Therefore, I could ask the question, is the sex male? Thus, I split the data with males going into one bucket, and the rest going into another bucket. 64 percent of the data went into the male bucket leaving 36 percent going into the other one. Let's continue along the male bucket partition for now. Another question I could ask is about what passenger class each passenger was. With our partitioning, now 14 percent of all passengers were male, and of the lowest class, whereas 50 percent of all passengers were male, and of the two higher classes. The same type of partitioning could also continue in the female branch of the tree. Taking a step back, it is one thing for the decision tree building algorithm to split sex into two branches because there are only two possible values. But how did it decide to split passenger class with one passenger class branch into the left, and two passenger classes branching to the right. For instance, in the simple classification and regression tree or CART algorithm, the algorithm tries to choose a feature and threshold pair that will produce the purest subsets when split. For classification trees, a column metric to use is the gini impurity, but there is also entropy. Once it is found a good split, it searches for another feature threshold pair, and splits that into subsets as well. This process continues on recursively until either the set maximum depth of the tree has been reached, or, if there are no more splits that reduce the impurity. For regression trees, mean squared error is a common metric split. Does this sound familiar how it chooses to split the data into two subsets? Each split is essentially just a binary linear classifier that finds a hyper plane that slices along one feature's dimension at some value, which is the chosen threshold to minimize the members of the class falling in the other classes' side of the hyperplane. Recursively creating these hyperplanes in a tree is analogous to layers of linear classifier nodes in a neural network. Very interesting. Now that we know how decision trees are built, let's continue building this tree a bit more. Perhaps there is an age threshold that will help me split my data well, for this classification problem. I could ask, is the age greater than 17 and a half years old? Looking at the lowest class branch of the male parent branch, now just 13 percent of the passengers were 18 and older, while only one percent were younger. Looking at the classes associated with each node, only this one on the male branch so far is classified as survived. We could extend our depth, and or choose different features to hopefully keep expanding the tree until every node only has passengers that had survived or died. However, there are problems with this because essentially, I am just memorizing my data, and fitting the tree perfectly to it. In practice, we are going to want to generalize this to new data, and a model that has memorized the training set is probably not going to perform very well outside of it. There are some methods to regularize it such as setting the minimum number of samples per leaf node, a maximum of leaf nodes, or a maximum number of features. You can also build the full tree, and then prune unnecessary nodes. To really get the most out of trees, it is usually best to combine them into forests, which we'll talk about very soon. In a decision classification tree, what does each decision or node consist of? The correct answer is linear classifier of one feature. Remember, at each node in the tree, the algorithm chooses a feature and threshold pair to split the data into two subsets, and continues this recursively. Many features are eventually split assuming you have set a maximum depth for more than one, but only one feature per depth at a time. Therefore, linear classifier of all features is incorrect because each node splits only one feature at a time. Mean squared error minimizer and euclidean distance minimizer are pretty much the same thing, and are used in regression, not classification.