Welcome to our lab here on logistic regression, as well as error metrics. Here we're going to be using the human activity recognition with smartphones database which was built by just getting a bunch of study participants to walk around with smartphones with embedded initial centers. And see given that data whether or not we can tell they are walking, walking upstairs, walking downstairs, sitting, standing and laying. So we have this classification problem where we're trying to come up with the label of one of these six categories. And the first thing that we're going to want to do is import the necessary libraries as we've done before. We are also going to run os.chdir which is short for change directory and we're going to set the directory that we're working with as data. So that's going to be a subfolder within our current folder and then we are permanently changing that path so that we can easily access any files within that data folder. Then for question 1, we want to import the data and then do the following. Examine the different data types, so there are many columns. We want to know perhaps what the value counts are, as well as the different data types for each one of these many columns. Determine if the floating point values need to be scaled, so we want to see if we need to scale our data. Determine the breakdown of each activity, so how much of each activity are we seeing in regards to our outcome variable? And then we're going to want to encode that activity label as an integer because sklearn will not be able to take any string, we have to encode it as an integer for our different categories. So the first thing that we do is we run pandas.readCSV and pass it in our file path. We're then going to look at our different data types and we recall that when we run data.dtypes, we are getting the data type for each one of our different columns. In order to see how much of each of the different data types are coming up, we just run value counts the same way we would with any column. And we see that we have 561 floats and one object and that one object is going to be what we're trying to predict that's going to be the different strings for walking, walking up stairs, laying and so on. And we can see that here as we look at the data.dtypes and look at the last five values and we see that activity is going to be that one object, column. We're then going to see that the data is scaled from a minimum of negative one to a maximum of one. So there's some type of scaling here and the way that we're going to prove it is that for every single value, we're going to see that all of the minimums are negative one and all of the maximums are positive one. And so we're going to first look at the value count. So if we look at just the min value, we see that we get the min for every single column. If you notice, we took every single row but we did not include that last column, that's what that negative one does for us. And when we run .min, we get the min for every single column in a panda series. And then similar to what we just did for our data types, we can run value counts, and we see that negative one is the only minimum value for every single column, and therefore there's 561 negative ones has our minimum value. We do the same for the maximum value and we see that all the maximum values have been scaled down to one and all the minimum values have been scaled to negative one with different values in between. Next thing we want to do is look at the breakdown of each one of the activities. And we see here that our outcome variable has a fairly balanced set. So we see that they each take up an equal proportion of the overall rows and when we see this, we start to tune our minds as we discussed in lecture. That different types of error metrics are going to work better for different types of datasets, whether they're balanced or unbalanced. So here we're working with a balanced data set. So we want to start thinking what is the best type of error metric to use given that we have a more balanced data set compared to our leukemia example that we talked about in lecture that was very unbalanced with 99% being healthy and only 1% being unhealthy. Then as we mentioned, we cannot pass in a string into our sklearn, learned model, whatever it is here going to be logistic regression. So we have to encode that as an integer. So we're going to create our LabelEncoder object. First we just import it from sklearn.preprocessing. Then as we've done with our sklearn objects, we instantiate that object. We're then going to call that object that we just created, le and call fit transform on our activity column, and we're just going to set that equal to our activity column. So now our activity column has been changed to as we see here from the sample integers ranging from 0 to 5 for each one of our different categories, right? Those six different categories we had before are now not the value 0, 1, 2, 3, 4, 5. Now for question 2, we want to calculate the correlations between the independent variables. We want to then create a histogram of the different correlation values and then we're going to want to identify those that are most correlated whether it's positive or negative with one another. So I'm going to break this down into steps. So I'm going to create some code above. We first have our feature columns which are all of our columns up until the last column. So that's going to be data.columns, and then all them up until the last. And then in order to get our correlation matrix, all we have to do is specify those other columns that we want our feature columns and call the core. And that's just going to output a panda's data frame that's just a correlation matrix. So we're just going to look at that to see where we're at when we're getting started So that might take a second to run. The next thing that I want to and there we see it. We see the correlation between t body act whatever it is x and here of itself, which is why the correlation is one. And then we can see as well, the correlation between t body arc mean x and t body arc mean y, and that's 0.12. And what we notice is the correlation between x and y is the same between as y and x. So this whole bottom portion of our matrix is not giving us any new information including these ones, because we know every single value will have a perfect correlation with itself. So we're going to want to eventually remove all those values. So first we're going to use this np.trill indices from in order to get our all the indices for the bottom lower triangle. So that's going to give us all those values on this diagonal, so we look at the diagonals of all the ones and below. So this will give us all those indices, we'll see what that looks like, let's actually do that in a new cell because that took a second to run. So we create this new cell here, and we can look at this trill index value that it outputs. And it's going to be the, 0010. So remember rows are first, columns are second and it will be each of the indices if you were to zip these two together. So 00 if we look here, then 10, then 11, and etc. 11 being it will be 11 will be across this diagonal and then all the values below. So we want to replace all those with null. A very efficient way to do it is to ensure that we are working with a NumPy array. So we're going to change our current panda's data frame into an array. We're going to call that core array, and then we're saying for these indices that we just defined, set then equal to np.nam. So we're just setting them quickly into no values. We're then going to once we have nulled out all of that bottom triangle, let's actually look at what that looks like. We need to also call in what it looks like now. We now have let's actually set this equal to a data frame to make it clear. We see that we have nulled out all the values on the diagonal and below. Then we are going to set that back to a dataframe with our columns equal to our original columns and our index equal to our original index. We're then going to stack that dataframe. And when we stack, what we're doing is we're ensuring that there's only a single value column. And then we're taking each one of our different columns that we had originally. Let's actually again pass this in step by step so that we can take a closer look. So now, we have this new data frame which is has the same name as the one that we had before. And we're going to call .stack and we'll see what that looks like. And we see that is going to be our original index. And then each one of the columns, and then it's finally going to have the value for each one of the different correlations. So this actually becomes clear as we go through the other steps, we go to frame. This will change it to a panda's data frame. And we see here that we have the correlation between t body arc mean x, and each one of these different values. Just in a single data frame, the next thing that we do is reset the index. So we're going to call .reset index. And now we see that we clearly have the different correlations and then their values. And then we're renaming each one of these values from level zero to feature one, level one to feature two and then our zero column that we had here is going to be set equal to correlation. So we going to end up with this new matrix co-values and then we are also going to create one new column called abs correlation, which is just going to be the absolute value of that correlation because all we care about is the magnitude, not whether it's positive or negative. So I'm going to run this full thing, this full cell, it's going to take about five seconds to get it done and then we can start to claw out, so you see that that's done. We are going to start to plot out each one of the different correlations to see the distribution of higher versus lower correlations between each one of these different columns. So we're setting our axis and that's going to be equal to our abs correlation, right? This is our data frame. This is going to be the column, and then we're calling a histogram with 50 bins. And then we can just set our x label and our y label. So we run this and we see that most of the time there's essentially zero correlation and then we see much less at the top end that there's close to one in terms of correlation. But otherwise, maybe a couple of different modes here, but really fairly uniform across once it dips down from those lower values. So then we're going to say let's look at the most highly correlated values. We're going to sort the values by correlation with ascending equals false. So we're going to be going from top to bottom and then we just do .query. That's a way to filter down our database, similar to what we've done in brackets before we call .query. And then we can just pass in a column name. We only want the absolute correlation greater than 0.8. And we can look at those values and we see them all the way through the first 22,000 rows that we have for this and if we look at the actual size actually let's go right here, the actual size of the original core values data frame. This is going to be out of 157,000 different values because each one of the different cross-correlations, which is why we have so many possible values. So, here we see and we can explore this further, how highly correlated many of these two values are. And when we see something that's so highly correlated, we may want to do some type of feature engineering or feature selection, which we'll talk about further along as we go through this course. That closes out this section here. In the next video, we'll continue in the notebook and discuss the test train split. Thank you.