So, how does H2O deal with missing data by default?
And the answer depends on the algorithm.
For the tree algorithms, GBM, Random Forest,
it will treat NAs missing data as just another value.
Whereas for GLM and deep learning,
they come with a parameter missing values handling.
And this can have the value of either skip or mean imputation,
and I'll come on to those in a moment.
But if you can and if you can do it intelligently,
then you should always try and repair the data yourself first,
but never mechanically, never unthinkingly.
If you're just trying to automate your data repair,
you should just stick with the H2O default behavior.
So, how to repair data,
and I like to divide it up into two broad techniques,
which is either throw it out or make it up.
Now, throwing it out, there's two options here.
If we have a column with a lot of missing data, for instance birthdays,
if we've only been given birthdays for
20 percent of our database of the people in the database,
maybe we should just give up on using birthday and throw away that column.
The other approach, which is the skip,
is if a particular data sample,
a row in your database,
has any missing data, throw away the row.
So this is saying for the people who we don't know the birthday for,
don't use them to train the model,
pretend they don't exist.
In some situations, those are going to be good approaches.
In others, bad approaches.
If you have a large database with a lot of
columns and different missing data in each column,
you can end up with no training data at all if you're using skip.
So, watch out for that. What about Make it up?
There's lots of options here.
The simplest is to choose a default value for anyone who is missing.
This is the January first birthday that we saw before.
It's no good in that case.
In other cases, it can be good.
Another approach is to take the mean of
that column and give that value to everyone who's missing.
You could also use the median.
This, with birthdays, gives us June 30 for everybody.
Again, a bad idea.
But the other data approaches,
and this is a very common one, in fact,
it's the default mean imputation, it can work well.
But if at all possible,
there is a better approach,
which is to use correlations between columns.
Now, with birthday, you're unlikely to find many correlations.
But I'm going to use the example of, say,
a product database where the columns are the suppliers and the rows,
you're data samples, are the products.
So, say, for some product, a television,
you have a couple of suppliers who don't
give you a price for that television, they don't stock it.
You don't want to throw away that product from your database.
You don't want to get rid of that supplier from your database.
What you can do is learn a model,
train just another machine learning model,
a GLM or Random Forest,
to predict what the value would be if they did supply it.
For instance, you might find that that particular supplier is normally
about one percent more expensive than your other suppliers for televisions.
And the model could learn that,
and they could insert a price.
And this is a fairly good estimate of what
that television would cost from that supplier if they stocked it.
And you can go ahead and use that data to build a good model.
In this particular example,
is better than mean imputation,
skip or any of the other things we've looked at.
Another approach outside of H2O,
if you like, is just to go and get more data.
If you don't know your customers birthdays,
go and ask those customers what their birthday is or go
and stalk them on social media and see when everybody is wishing them happy birthday.
Some approach is more ethical than others.
Some approach is more expensive than others.
It's a balancing game there.
In the next video, we're going to look at a practical example of repairing data.