In this section, we will cover the idea of splitting your data into different train and test sets, and how that will allow you to cross validate your models and have a better idea of how they'll actually perform out in the real world. Now, the learning goals for this section will be to learn to split our data into training and testing samples. As we saw briefly in the notebook, will be very important to have a holdout set to see how well our model will perform on unseen data. We'll then talk about cross-validation approaches and some ways to expand this notion of train and test so that you can train and test on multiple different sets. And then finally we'll talk about model complexity versus error, and we'll talk about the important relationship between model complexity and error and how to find that appropriate balance. So let's say that what we see here are some historical training set that we have available to us. As we mentioned earlier, briefly, we can actually use our model to fit perfectly to the data and get 100% accuracy. So for example, we can say that any movie that ran exactly 140 minutes, this will be our model, will make around $424 million, which is our first row. And every movie that ran for 129 minutes exactly made around $409 million. And then once we remove this outcome variable to see how well our model performed, we can predict gross revenue exactly using the model that's just matching them one to one. But we would probably do a very poor job of modeling movie revenue for unseen data, so it won't generalize well. And our problem is that we cannot ever see how our model would perform on unseen data if we train on our entire data set. So what would be a solution? We can split up our data set, and we take one portion and we call that our training data, and we use that to actually learn the optimal parameters given our label data. And then we have a separate set that we hold out as if it was unseen data. We remove the labels and take that learned model, and see how well we would be able to perform on unseen data. And for this test data, because it's actually pulled from our historical data set, we can see how well we performed given the learned model from the training data. So this will ensure that the model generalizes well to new situations. I would like to highlight that you always want to make sure that your splits are independent from one another. Often, parts of your test data can end up leaking into the training data, so that it's not really unseen data, and we would actually learn to that test data. And there's actually a term for this, data leakage, that you want to be aware of, so ensure that these are independent of one another. So to summarize, the training data is going to be used to fit the actual model and learn the parameters. Then, using this model, we can predict results on our test data. We would predict either the label or the numerical value using our model, compare that with the actual value, because we have that, since it's historical data. And then, depending on what our error metric is, measure the error to see how well we performed on unseen data. So let's look at an actual scenario. Here we have split data into training and test sets, and we'll be performing linear regression. That is, we're going to fit a line that explains our training set, as we see here to the left, and we're going to perform regression on that training data and determine which parameters will give you the best fit to our training data. The next thing we do are we going to predict the value for each point in our test data set, using our previously determined parameters from our training data set. And that's going to be all of these purple values that we see here on the right side. So how did these predictions do? We can measure it by the lines that measure the distance from the actual value to these predicted values on the line in order to determine how well are we doing on this unseen data. So let's go through a breakdown of how we work with our training data as well as our test data. We start off with our training data. Within our training data, we have a X_ttrain and Y_train, which will just be our features would be X_train and our outcome variable would be Y_train. We fit a model to learn the parameters given our label data set using this dot fit, which will be just the normal practice as you use all of your sklearn models, which we have seen in linear regression and we will continue to see throughout. And that will give us our model that will define the parameters for the relationship between our X, our features, and our outcome variable. Next we take our test data. We pass in our X_test from our test data into our model that has learned these parameters. And then we can use that to predict what the outcome variable would be for our X_test as if there were no labels for this X_test. And that will give us a prediction for each one of these individual values in our test set. We can then compare that since this again is just historical data to the actual values given our X_test set. And then we can evaluate our error metric between Y_test and Y_predict, and come up with our test error for a given model assuming that we were looking at some data that we have never seen before. Now, let's go over the syntax that will be needed in order to run your train_test_split within Python. The first thing that we're going to want to do is from sklearn.model_selection, we're going to want to import the trust train_test_split function. And then it's as simple as running train_test_split on our data here. We are splitting the data and putting 30% into the test set, that's going to be the argument test size is equal to 0.3. We'll specify what percentage of your data you want to hold out for your test set. You can also pass in an actual numerical value if it's between zero and one. They'll know it's a percentage. If it's some other value, then they will put that exact value as your test size. Also here, we just have data, and that's going to split into train and test. If we had here instead X,Y, so X being our features, Y being our outcome variable, we would actually split this into four different values. With, so we pass into train_test_split X,Y,test_size = 0.3, and the output would be four different values, X_train, X_test, and then Y_train and then Y_test. So you'd want to have on the left side of train_test_split four values, which we'll see as we get into the notebook. Now, there are also other methods for splitting data. We have the shuffle split here, which will allow you to come up with multiple different splits rather than just one split of train and test. You can have four different train and test splits, so it could be the last 30%, the middle 30%, or a random 30%, each one being a different holdout set. There's also a stratified shuffle split, which will allow you to also ensure that there's no bias in your outcome variable. Now, what do I mean by that? Perhaps your data set has, let's say it's medical, and it's between 1 if they are diagnosed with cancer and 0 if they're not. 99% of the population, or more, probably, will not be diagnosed with cancer, whereas 1% will. We want to maintain that 99% to 1% split in our test and train set. Using that stratified shuffle split will ensure that we maintain that 99% to 1% split moving forward. All right, with all that in mind, we're going to move to a notebook in order to see the details of how we can use train_test_split in practice. Thank you.