Hi guys, this lesson will be about gradient boosted decision trees. This algorithm is considered to be the best general purpose algorithm for solving classification and regression problems. Of course, in some particular areas like computer vision or speech recognition, there are algorithms which are more suited to solving some particular problems. But in many real world applications which you will face while working as a data engineer or data scientist, gradient boosted decision trees is the method of choice. First of all, what is boosting? Boosting is a method for combining outputs of many weak classifiers to produce a powerful ensemble. There are several variants of boosting algorithms, AdaBoost, BrownBoost, LogitBoost, and finally, Gradient Boosting, which you will learn during this lesson. When you work with big data, with big training data sets, they have on the one hand, a large number of training examples. And on the other hand, a large number of features describing objects. In this situation, It is very natural to assume that you would like to train a really complex model, even having such a great amount of data. And hopefully, this model will be accurate. There are two basic ways, in machine learning, to build complex models. The first way is to start with a complex model from the very beginning, and fit its parameters. This is exactly the way how neural network operates. And the second way is to build a complex model iteratively. You can build a complex model iteratively, where each step requires training of a simple model. In context of boosting, these models are called weak classifiers, or base classifiers. We will proceed the latter way. Consider for simplicity, a regression problem where you have a training set which consists of pairs xi, yi. Where xi is a vector of real-valued features and yi are labels or targets. And our goal here is to find the function f(x), such that the square root loss on the test set will be as low as possible. So you train the model using only the training data, but you expect that the error in the test data set will be low. How to build such f(x)? In boosting, your goal is to build the function f(x) iteratively. You suppose that this function f(x) is just a sum of other simple functions, hm(x). And particular, you assume that each function hm(x) is a decision tree. Okay, here is an algorithm for training gradient boosted trees for regression. In the input you have training set, and M is the maximum number of iterations. The first step, you calculate an average of all the targets yi, and it is your initial approximation to the function f(x). And then you repeat line from three to five for M iterations. What do you do at the step number 3? You calculate the residual, what is it? Residual is a difference between the target yi and our current prediction, f m- 1 xi. And at each step, you are going to minimize this residual and train a new decision tree, such that it will fit this residual as close as possible. Informally speaking, you create an auxiliary training set, which consists of pairs xi y hat i. It is equivalent to the original training set, but you replaced the original labels yi with residuals. At the step 4, you feed the decision tree, hm(x), to this new target of residuals y hat i. And finally at the step number 5, you add this decision tree to the composition. But you multiply it by a constant, nu, which is called the regularization or learning rate. This constant is recommended to be small, and particular less than 0.1. Okay, if you are familiar with the optimization theory, you may have noticed that gradient boosting is somewhat similar to the gradient descent in the optimization theory. If you want to minimize a function in the optimization theory using the gradient descent, you make a small step in the direction opposite to the gradient. Gradient of the function, by definition, is a vector which points to the direction with the fastest increase. Since you want to minimize the function, you must move to the direction opposite to the gradient. To ensure convergence, you must make very small steps. So you're multiplying each gradient by small constant, which is called step size. It is very similar to what you do in gradient boosting. And gradient boosting is considered to be a minimization in the functional space. Okay, the summary of this lesson. You have learned what the gradient boosting algorithm is. Boosting is a method for combining outputs of many weak classifiers or regressors to produce a powerful ensemble. Gradient Boosting is a gradient descent minimization of the target function in the functional space. Gradient Boosting with Decision Trees is considered to be the best algorithm for general purpose classification or regression problems.