Starting in the 1990s, the field of kernel methods was formed. Corinna Cortes, Director of Google Research, was one of the pioneers. This field of study allows interesting classes of new nonlinear models, most prominently nonlinear SVMs or support vector machines, which are maximum margin classifiers that you may have heard of before. Fundamentally, core to an SVM is a nonlinear activation plus a sigmoid output for maximum margins. Earlier, we have seen how logistic regression is used to create a decision boundary to maximize the log likelihood of the classification probabilities. In the case of a linear decision boundary, logistic regression wants to have each point and the associated class as far from the hyperplane as possible and provides a probability which can be interpreted as prediction confidence. There are an infinite number of hyperplanes you can create between two linearly separable classes such as the two hyperplanes shown as the dotted lines in the two figures here. In SVMs, we include two parallel hyperplanes on either side of the decision boundary hyperplane where they intersect with the closest data point on each side of the hyperplane. These are the support vectors. The distance between the two separate vectors is the margin. On the left, we have a vertical hyperplane that indeed separates the two classes. However, the margin between the two support vectors is small. By choosing a different hyperplane, such as the one on the right, there is a much larger margin. The wider the margin, the more generalizable the decision boundary is, which should lead to better performance on new data. Therefore, SVM classifiers aim to maximize the margin between the two support vectors using a hinge loss function compared to logistic regression's minimization of cross-entropy. You might notice that I have only two classes which makes this a binary classification problem. One of the class's label is given a value of one and the other class's label is given a value of negative one. If there are more than two classes, then a one vs all approach should be taken and then choose the best out of the per-unit binary classifications. But, what happens if the data is not linearly separable into the two classes? The good news is that we can apply a kernel transformation which maps the data from our input vector space to a vector space that now has features that can be linearly separated as shown in the diagram. Just like before, the rise of deep neural networks, lots of time and work went into transforming the raw representation of data into a feature vector through a highly tuned user created feature map. However, with kernel methods, the only user-defined item is the kernel, just similarity function between pairs of points in the raw representation of the data. A kernel transformation is similar to how an activation function in neural networks maps the input to the function to transform space. The number of neurons in the layer controls the dimension. So, if you have two inputs and you have three neurons, you are mapping the input 2D space to a 3D space. There are many types of kernels with the most basic being the basic linear kernel, the polynomial kernel, and the Gaussian radial basis function kernel. When our binary classifier uses a kernel, it typically computes a weighted sum of similarities. So, when should an SVM be used in several of just regression? Kernelized SVMs tend to provide sparser solutions and thus have better scalability. SVMs perform better when there is a high number of dimensions and when the predictors nearly certainly predict the response. We've seen how SVMs use kernels to map the inputs to a higher dimensional feature space. What thing in neural networks also can map to a higher dimensional vector space? The correct answer is, more neurons per layer. It is the number of neurons per layer that determine how many dimensions of vector space you are in. If I begin with three input features, I am in the R3 vector space. Even if I have a hundred layers, but with only three neurons each, I will still be in R3 vector space and I'm only changing the basis. For instance, when using a Gaussian RBF kernel with SVMs, the input space is mapped to infinite dimensions. The activation function changes the basis of the vector space but doesn't add or subtract dimensions. Think of them as simply rotations and stretches and squeezes. They may be nonlinear, but you remain in the same vector space as before. The loss function is your objective you are trying to minimize, is a scalar that uses its gradient to update the parameter weights of the model. This only changes how much you rotate and stretch and squeeze, not the number of dimensions.