Now, let's dive into the actual cost function that support vector machines are trying to minimize in order to optimize that solution, in order to optimize that margin. Now, when we talk about logistic regression in comparison, we see here that this cost function smooths out and never really reaches zero unless we get exact prediction of one, which is rarely the case, because with logistic regression we're predicting probabilities. So there's almost always a penalization even if you get that labeling correct. Now, for support vector machines, we're going to depend on the hinge loss, which will not penalize values outside of our margin, assuming that we predicted them correctly, but will more heavily penalize those values that are further and further away from that margin. Now, let's walk through this to make a little more sense of it. Now, starting with the penalty at the point on our upper graph, this classification is going to be a correct classification given our margin and does not add any error term to our cost function. As you see lower down that on the y-axis, we end up with the value of zero in regards to the cost of that prediction. Now, right at the margin is where we'll begin to start penalizing using this hinge loss function. So we're not yet penalizing as we have our support vector, that any value to the left of this margin, of this dotted line, we will start to penalize. So as I said, we will actually incur some penalty for being inside the margin, even if we classify it correctly. We are coming up with a line here when we look at this decision boundary, whose value at that actual decision boundary is going to be equal to zero, and the value at that margin at the dotted line to the right is going to be equal to one, and that's going to be r. As you saw on the x-axis, one or equal to one, we have a y-value, a loss equal to zero. But once we move slightly to the left, we're going to end up with a loss of one once we're at zero, exactly that decision boundary. Then here we're essentially at the opposite margin from where we should be classifying. Again, that margin is going to be one unit away from our decision boundary at zero. Our decision boundary is going to be the value of zero on our x-axis on that graph below. Now, we incur a penalty of one more unit, then the decision boundary. So our decision boundary at zero gave a loss of one. Now, we're moving one more unit to the left, so we'll have a penalty of two. Then the further we move away from our margin that's identifying that boundary for pink, that rightmost dotted line, the higher our loss is going to be in a linear fashion, as we see here in the hinge loss. So the further we move to the left on our top graph, the higher up we move in a linear fashion in terms of our hinge loss. So now that we know the goal of support vector machines, which is to find that best boundary, let's look at some of the problems that this ideal boundary may create. Now, this boundary, which is maximizing the distance from the classes is pretty good. But what will happen when the data set is a bit messier than what we see here. So let's put a red sample very close to the blues, as you see on the bottom. What will now happen to our boundary? If we try to get that one red dot that we just entered correctly identified as well and then try to minimize the distance, our boundary will look like what we just see here. This probably not any better than our previous vertical boundary. Our previous vertical boundary is probably better and the reason why I say this is it looks very overfit. It seems that on the top of the boundary, on that top right, it would classify new records as blue. If you look just to the left of that boundary, even though just to the left of that boundary should probably be still classified as red. Therefore, it's probably still best not to move our natural boundary for that one red record. So we probably want to keep that initial one, we have to come up with a way of optimizing so that we are still able to separate these two groups, but allow for some objects to be misclassified within the process. This is where the regularization in support vector machines comes into play. It's similar to ridge or elastic net or lasso in linear regression, and the way we achieve that mathematically is we add a term in the cost function. So using our original cost function that we'll try to minimize the size of the coefficients. So try to keep that separating line as simple as possible. With this new function, it becomes more costly to get that one record right compared to moving our boundary. In the next video, we'll continue this discussion of regularization as well as begin to go through the proper syntax for fitting a linear SVM when you're using a scalar. I'll see you there.