We looked at the tip status set, and said that we could use either the tip amount as the label, or the sex of the customer as a label. In option one, we are treating the tip amount as the label and want to predict it, given the other features in the data set. Let's assume that you are using only one feature, just the total bill amount to predict the tip. Because tip is a continuous number, this is a regression problem. In regression problems, the goal is to use mathematical functions of different combinations of features, to predict the continuous value of our label. This is shown by the line, where for a given total bill amount times the slope of the line, we get a continuous value for tip amount. Perhaps the average tip rate is 18 percent of the total bill, then the slope of the line will be zero point one eight. And by multiplying the bill amount by zero point one eight, we'll get the predicted tip. This linear progression with only one feature generalizes to additional features. In that case, we have a multi-dimensional problem, but the concept is the same. The value of each feature for each example is multiplied by the gradient of a hyperplane, which is just the generalization of a line to get a continuous value for the label. In regression problems, we want to minimize the error between our predicted continuous value, and the label's continuous value, usually using mean squared error. In option two, we are going to treat sex as our label, and predict the sex of the customer using data from the tip and total bill. Of course as you can see from the data, this is a bad idea. The data for men and women is not really separate, and we will get a terrible model if we did this. But, trying to do this helps me illustrate what happens when the thing you want to predict is categorical, and not continuous. The values of the sex column takes, at least in this data set, are discrete, male or female. Because sex is categorical, and we are using a sex column of the data set as our label, the problem is a classification problem. In classification problems, instead of trying to predict a continuous variable, we are trying to create a decision boundary that separates the different classes. So in this case, there are two classes of sex, female and male. A linear decision boundary will form a line or a hyperplane in higher dimensions, with each class on either side. For example, we might say that if the tip amount is greater than zero point one eight times the total bill amount, then we predict that the person making the payment was male. This is shown by the red line. But that doesn't work very well for this data set. Men seem to have higher variability, while women tend to tip in a more narrow band. This is an example of a non-linear decision boundary, shown by the yellow lips in the graph. How do we know the right decision boundary is bad, and the yellow decision boundary is better? In classification problems, we want to minimize the error or misclassification between our predicted class and the labels class. This is done usually using cross entropy. Even if we are predicting the tip amount, perhaps we don't need to know the exact tip amount. Instead, we want to determine whether the tip will be high, average or low. We could define a high tip amount as greater than 25 percent, average tip amount as between 15 and 25 percent, and a low tip amount as being below 15 percent. In other words, we could discretize to the amount. And now, creating the tip amount or more appropriately, the tip class becomes a classification problem. In general, a raw continuous feature can be discretize into a categorical feature. Later in this specialization, we will talk about the reverse process. A categorical feature can be embedded into a continuous space. It really depends on the exact problem you're are trying to solve, and what works best. Machine learning is all about experimentation. Both of these problem types, regression and classification, can be thought of as prediction problems, in contrast to unsupervised problems which are like description problems. Now, where does all this data come from? This tip data set is what we call structured data, consisting of rows and columns. And a very common source of structure data for machine learning is your data warehouse. Unstructured data are things like pictures, audio, or video. Here, I'm showing you a natality data set, a public data set of medical information. It is a public data set in BigQuery, and you will use it later in the specialization. But for now, assume that this data set is in your data warehouse. Let's say we want to predict the gestation weeks of the baby. In other words, we want to predict when the baby is going to be born. You can do a SQL select statement in BigQuery to create an ML data set. We will choose input features of the model, things like mother's age, the weight gain in pounds, and the label, gestation weeks. Because gestation weeks is a continuous number, this is a regression problem. Making predictions from structured data is very commonplace, and that is what we focused on on the first part of this specialization. Of course, this medical data set can be used to predict other things too. Perhaps, we want to predict baby weight using the other attributes as our features. Baby weight can be an indicator of health. When a baby is predicted to have a low birth weight, the hospital will usually have equipment such as an incubator handy, so it can be important to be able to predict the baby's weight. The label here will be baby weight, and it's a continuous variable. It's stored as a floating point number which should make this a regression problem. Is this data set, a good candidate for linear regression, and or linear classification? The correct answer is both. Let's investigate why. Let's step back and look at the data set with both classes mixed. Without the different colors and shapes to aid us, the data appears to be one noisy line with a negative slope and positive intercept. Since it appears quite linear, this will probably most likely be a good candidate for linear regression, where what we are trying to predict is the value for Y. Add even different colors and shapes back in, it is much more evident that this data set is actually two linear series with some Gaussian noise added. The lines have slightly different slopes and different intercepts, and the noise has different standard deviations. I've plotted the lines here to show you that this is most definitely a linear data set by design albeit a little noisy. This would be a good candidate for linear regression. Despite there being two distinct linear series, let's first look at the results of a one dimensional linear regression, plotting Y from X, to start building an intuition, then we'll see if we can do better. The green line here is the fitted linear equation from linear regression. Notice that it is far away from each individual class distribution, because class B pulls the line away from class A, and vice versa. It ends up approximately bisecting the space between the two distributions. This makes sense since with regression, we optimize our loss of mean squared error. So with an equal pull from each class, the regression should have the lowest mean squared error in between the two classes, approximately equidistant from their means. Since each class is a different linear series with different slopes and intercepts, we would actually have a much better accuracy by performing a linear regression for each class, which should fit very closely to each of the lines plotted here. Even better, instead of performing a one dimensional linear regression predicting the value of Y from one feature X, we could perform a two dimensional linear regression predicting Y from two features, X and the class of the point. The class feature could be a one if the point belongs to class A, and a zero if the point belongs to class B. Instead of a line, it would form a 2D hyperplane. Let's see how that would look. Here are the results of the 2D linear regression. To predict our label Y, we used two features, X and class. As you can see, a 2D hyperplane has been formed between the two sets of data which are now separated by the class dimension. I've also included the true lines for both class A and class B, as well as the 1D linear regression's line of best fit. The plane doesn't completely contain any of the lines, due to the noises of the data tilting the two slopes of the plane. Otherwise, with no noise, all three lines would be perfectly on the plane. Also, we have kind of already answered the other portion of the quiz question about linear classification. Because the linear regression line does a really great job already of separating the classes. This is a very good candidate for linear classification as well. But, would it produce a decision boundary exactly on the 1D linear regression's line of best fit? Let's find out. Plotted in yellow is the output of a one- dimensional linear classifier, logistic regression. Notice, that it is very close to linear regression's green line, but not exactly. Why could this be? Remember, I mentioned that regression models usually use mean squared error as their loss function, whereas classification models tend to use cross entropy. So, what is the difference between the two? Without going into too much of the details just yet, there is a quadratic penalty for mean squared error, so it is essentially trying to minimize the euclidean distance between the actual label and the predicted label. On the other hand, with classification's cross-entropy, the penalty is almost linear and the predicted probability is close to the actual label, but as it gets further away it becomes exponential, when it gets close to the predicting the opposite class of the label. Therefore, if you look closely at the plot, the most likely reason the classification decision boundary line has a slightly more negative slope, is so that some of those noisy red points, red being the noisy distribution, fall on the other side of the decision boundary and lose their high error contribution. Since they are so close to the line, their error contribution would be small for linear regression, because not only is the error quadratic, but there is no preference to be on one side of the line or the other for regression, as long as the distance stays as small as possible. So, as you can see, this data set is a great fit for both linear regression and linear classification. Unlike, when we looked at the tips data set, where it was only acceptable for linear regression, and would be better for a non-linear classification.