So what have we learned so far in regards to support vector machines? We know that they are a linear classifier. We also know that they won't return probabilities, but rather will return labels either one or zero, churned versus not churned and those labels are decided by which side of a certain decision boundary they fall on. That decision boundary was initially found by determining the hyperplane over that line that minimize errors, as well as finding the widest margin between our two classes. That boundary was decided purely by the support vectors in our data by those points that are going to be on the margin that we have seen earlier. We then discussed how we may actually be overfitting our data if we stick to this formulation. As what we see here is the best boundary which minimizes our misclassifications, minimizing our hinge loss and that will refer to our first term in our overall cost function that we see here on the left, which is just the cost for misclassification. This term is minimal given the boundary that we have here. However, for the second term in our overall function for support vector machines, the regularization term, there's an added cost to having a bit more of a complex decision boundary with higher coefficients. So we may be minimizing the first portion of our cost function but the same time the other portion, that regularization term is quite high. So what do we do? Here we now have our less complex decision boundary and for this one, because we now have some misclassifications, this one that we see in pink to the left, our first term that's just going to be the amount of misclassifications that we have, that hinge loss that we're trying to minimize is now a bit higher. But that regularization term should have been reduced by hopefully a greater degree so that that overall cost function is minimized. We want to note that the regularization effects here can be tweaked using this value of C, where here a smaller C will mean more regularized. So the smaller the C, the more; that means the larger one of a C is and the more of an effect, the more we penalize for higher coefficients. So let's give a closer look to this error term, to this regularization term that we have here. The vector Beta; each one of our coefficients represents the vector orthogonal to the hyperplane to our decision boundary. Recall that these coefficients multiplied by their respective features added up, or what we would call the dot product of these coefficients and their respective features will ultimately determine which side of the decision boundary our label with fall with values greater than zero for this dot product being a certain class and values less than zero being the other class. The dot-product adding up to exactly zero represents where this decision boundary actually lies. So that orthogonal line will be the value that we compute with values further away being larger positive or negative values depending on which direction that line points. The higher those coefficients are, the larger the effect each feature has on determining that positive or negative class. Now let's discuss how you can actually implement support vector machines using Sklearn in Python. So the first thing that we're going to want to do is import the class containing our classification method. So from sklearn.svm, we're going to import our model LinearSVC. So here we're specifying network just using linear support vector machines. We have not touched on Kernel support vector machines which we'll touch on in just the next video. So focusing in on that linear support vector machine, the first thing that we're going to want to do is create an instance of our class. So when we create our instance, we also going to pass in our hyperparameters, which here are going to be our regularization terms with a penalty equal to L_2 and the C value equal here to 10 and recall that a lower C value means more regularization and a simpler model. We're then going to fit that instance on our training set so we call LinSVC.fit and pass in our X train and Y train and once we fit our model, we can use that fit in order to predict on our holdout set, on our X tests as we've done with our other models. We can then tune our regularization parameters, that C and the penalty that we choose using cross-validation. We can use that GridSearchCV that we learned earlier in earlier courses and then if you want to do regression rather than classification, you can just call LinearSVM rather than LinearSVC in order to perform regression and you'd run all the same steps, except that your Y train would have to be a continuous value. So let's recap what we learned here in this section. We discussed support vector machine approach to classification; how we use our support vectors from our different classes in order to find that decision boundary with the largest margin. We compared support vector machines with logistic regression. The support vector machines were predicting actual labels versus probabilities of logistic regression. Our cost function for support vector machines only penalizes those on the wrong side of our boundary compared to all labels being or all outcomes being penalized with logistic regression. Support vector machines only depend on those support vectors and then comes up with the largest margin using those support vectors. Something that I'd like to note here is, and the difference is that this means that support vector machines may be more sensitive to values that fall within our margin, but they will not be affected at all by those large values classified correctly while outside our margin. If we recall, that was a major problem that logistic regression was meant to solve and support vector machines can be even stronger as they are going to completely ignore these outliers that are correctly identified. They won't have any effect on our overall model. We also went over the support vector machine, the cost function, which was that hinge loss to minimize misclassifications as well as that added regularization term. We discussed how regularization actually works in support vector machine models and that we can tune the value of C with a lower C, meaning higher regularization and a simpler model. With that, that closes out our section on linear support vector machines. We are now going to expand on that and show how powerful support vector machines can be using something called kernels. All right, I'll see you in the next video.