In this lesson, we'll discuss how to identify and handle missing values in R. Missing values in R are saved as na, you can easily check if there are any na values using this very useful is.na function. If you do is.na on a given column will get a logical vector with truth and false. True indicating presence of na and false indicating the absence of it. Let's quickly see an example here. Let's say we create this small vector called sales, which has two missing values at the 3rd and the 5th position. And if I do is.na on this, what I see is this FALSE FALSE TRUE FALSE TRUE. Which means true at the third and fifth position indicates that there is a missing value at the third and the fifth position. That was easy. Let's learn, how do we get the total number of missing values in a given vector or a list. All we need to do is simply put the sum in front of is.na. If I do it on the sales vector that we just created, what we receive is 2, which indicates that there are two missing values. And it's not very difficult to understand, because internally R saves the TRUE values as one and FALSE as zero. So some will just count how many truths and add them up very easy. You can use this to calculate the number of missing values in each column in your data frame. For example, if I want to know are there any missing values in the line item column in my data frame? I can simply use this some command. And as I see 0 indicating that there is no missing value here. If you want to test the number of missing values in the whole data frame, you can put is.na in front of the data frame itself, and then you can put sum. If I do it for the subset of alright is data. I receive this 28,137 missing values, but that's not very helpful because I don't know which column are these missing values coming from. So if I want to know the number of missing values in each column in my data frame, I should use this another function called colsum rather than sum. What colsum does is, it creates sums for each column separately. So let's do it for dataframe df, and as you see here that it tells me the number of missing values in each column separately. We can see that the missing values are coming majorly from three columns. The cardholderName the Tax and TotalDue, once we find missing values, there are various ways to handle missing values. We can either fill them using mean of the entire column. Or we can fill them using the median of the entire column, usually median is preferred when there are outliers in the data. There are model-based approaches also that try to take information from other variables. However, we are not going to cover that in this course. Finally, you can actually draw the rows which have missing values. Let's see each of them one by one. So mean imputation for replacing the na values with the mean of the column, we will use this useful command called replace_na that is a function in that tidyr package. So I have loaded the tidyr package here. And let's see how this replace_na works. It takes two arguments. The first one, is the list or the vector or the column name where it has to look for missing values. And then if there are missing values what it what does it replace by. So I tell R that go to the text column in the df data frame and replace na missing value by the mean of them. The mean function here takes another argument na.rm=True, this is because if there are na values R does not calculate mean or sum on these. When we say na.rm=TRUE, we are telling R that while calculating the mean or sum ignore the na values. So I replace all the missing values in text by the mean this column and put it in text. Let's see, how many missing values are there in text column now 0 as expected, another way to fill the missing values is using median imputation. You would prefer median when there are outliers in your data set. So for X position, I am going to replace the missing values in the TotalDue column by the median of it. And it works the same way, replace_na then you tell which column to look for the missing values in, and then you tell it by what you want to fill it and I do this and let's check how many missing values are there zeros as expected. Another approach to handling missing values is just by getting rid of the rows that have the missing values. For example from the dataframe df I can get rid of all the rows where ever Tax is missing. So the way it works is, instead of doing is.na I am doing this negation is.na, which is telling R that go to df data frame and give me all those rows back. Where is.na is FALSE in the text column, and when we see this now we have got this df non missing data frame, where the Tax column should not be missing anything because all those rules have been removed. Let's quickly check if this works. We see that there are no missing values in Tax, but you will also note that there are no missing values in TotalDue also. While we only remove rows based on the missing values in Tax. That is because the rows that were missing Tax and TotalDue were common and when we remove them on the basis of Tax they got removed for the TotalDue also, however, we see that there are still missing values in the CardholderName column. Here is a word of caution before removing the in values or filling them with na, first assess their nature. Are they missing randomly, or is there a pattern in them? There are useful visualizations that help in assessing that?