Now let's move on with question number 3. We now have the different subsets of each one of our different features according to their data types whether it's ordinal, categorical, without ordering numerical or binary. We're going to use that in order to encode each one of them, and we know that we need to encode them because you can't pass in strings into our SKlearn models. So we need to ensure that each one of them is passed through as a numerical value. Then we're going to want to make sure to also scale each one of our numerical values because if we do not scale, as we've talked about during lecture, we won't be able to feed properly into our k-nearest neighbors without some type of bias or skew, according to those larger versus smaller numbers. Then finally, once we pre-process all our data, we're going to save that pre-processed data frames so that we can access it later without having to run through each one of these steps next time. So we're going to import the LabelBinarizer, the LabelEncoder and the OrdinalEncoder. Here, we're just going to use the LabelBinarizer and the LabelEncoder. So you see we initiate those objects as lb and le. Something I'd like to note is that we're able to use the LabelEncoder for our ordinal values here, because our ordinal values are ready in alphabetical order. If they were not, we'd want to switch to using the OrdinalEncoder and go through the steps of ensuring that we're specifying which one of the values go first, which one goes second, and so on. So we're going to initiate each one of those objects, and then for each one of our ordinal variables, we're going to change that column in place to the ordinal version of that. So it's just going to be whatever that string one was is going to be replaced by an integer. So let's look back at what our ordinal variables were. We saw that our ordinal variables were contract, satisfaction, and month. If we think back to our month, it was these values that we have here and by using pd.cut, we got the five different bins. Now if we look at df.month, now that we've used the LabelEncoder, we see that we have replaced each one of those different bins with either 0, 1 or 2 or 3 or 4 because there's five different bins. We then look at them as a category just to see how many unique values there are as expected months has five, satisfaction has five, and the unique values for contract are three. That was that month to month versus a year versus two-year contracts, and we see the frequency for those top values as well. We're then going to take each one of our binary variables. So all of those that only had two options and use the lb, that LabelBinarizer, that will just change them each to zeros and ones. So that's essentially the one-hot encoding just for those that only have two values. Then finally, there may still be categorical variables out of our full initial list that are not ordinal and not binary, and we need to ensure that we one-hot encode those as well. So we're going to use that pd.get _dummies that we've used before, and we're going to specify that we want to use the column's categorical variables, which we have now used excluding the ordinal as well as the binary variables. Then when we set drop_first equals to True, that's because as we talked about way earlier, if we have all of them we have perfect multicollinearity, and there's no purpose for most models in terms of having that last feature. So if we start off with six different unique features, we'll end up with five columns and all know what that sixth column is according to those other five. So if all the other five are equal to zero, then we know that last one would have been equal to one. If any one of those five features was equal to one, then we know that, that sixth column would have been zero. There's no extra information there, so we just drop that first value. We then call df.describe just to see each one of our different columns, and we see their count, their mean, their standard deviation, and so on. What we notice is that they're going to be on vastly different scales, right? The gigabytes per month is going to be around 20, the monthly payments is going to be around 64, and a lot of these are going to be a lot lower besides those two. As we talked about, we want to ensure that everything is on the same scale. So we're going to use the MinMaxScaler here, so we're importing the MinMaxScaler, we're initiating the objects, and now we have nm. Then we're just going to transform not all of our variables, but our ordinal and our numeric variables. So those that are integers or those larger values, there's no point in doing a MinMaxScaler on values that are just zeros and ones. So we're excluding those. So replacing each one of them with their transformed versions of themselves, and now when we run df.describe, we see that each one of these are on a similar scale and values between zero and one of our values and we have their different means and their different standard deviations, but all on a similar scale. So now we'll be able to not be skewed by those lower or larger numbers. Then finally, we want to save all of our pre-processed data, so we just set outputfile equal to the name of the file we want to output to. We take our DataFrame, which is our transformed DataFrame called.to_csv. We don't want to save the index in that file. This will dump it into a CSV file that if we'd like to, we can access later. All right, that closes out question number 3. With that, we will move into question number 4 in the next video and start actually using k-NN along with our train_test_split in order to see how well we can predict whether or not a customer churned given our data set.