[MUSIC]
Discuss a latent variable model for clustering.
So what is clustering?
Imagine that you own a bank, and you have a bunch of customers,
and each of them has some income and some debt.
And you want from this data, so
you can represent each of your customers on two dimensional plane as a point.
And from this data,
you want to decompose your customers into three different clusters.
Why?
Well, for example, you want to find people who spend money on cars and
make some promotions for them, some car related loan or something.
This can be useful for different retail companies and banks, and
companies like that.
So find meaningful subset of customers to work with.
And this is a non supervised problem, so we don't have any labels,
we just have raw data, raw axis.
Usually clustering is done in a hard way, so for
each data point we assign its color.
This data point is orange, so it belongs to the orange cluster and
this one is blue.
Sometimes, people do soft clustering.
So instead of assigning each data point a particular cluster,
we will assign each data point from probability distributions over clusters.
So for example, the orange points on the top of this picture are certainly orange.
And they have a probability distribution like almost 100%
to belong to their orange cluster and almost 0 to belong to the rest.
But the points on the border between orange and blue, they kind of not settled.
They have for example, 40% probability to belong to the blue cluster, and
60% probability to belong to the orange cluster, and 0 to the green.
And we don't know which cluster at this point actually belong to.
So instead of just assigning each data point a particular cluster, we assume that
each data point belongs to every cluster, but with some different probabilities.
And to build a clustering methods with this property,
we will treat everything probabilistically.
Why can we want that?
Well, there are several reasons.
First of all, we may want to again handle missing data naturally.
And another reason, that may want to consider clustering
in probabilistic way, is to tune hyperparamters.
So usually, then you want to tune hyperparameters,you do a plot like this.
You consider a bunch of different values, for
example, for the hyperparameter number of clusters.
So on the previous image, we had free clusters, but
we can try some different amount like 5 or 4.
And for each of these particular values of number of
clusters we may train our clustering model.
Which is called GMM and we'll discuss it later in details.
So we're going to plot the training performance here like on the blue line and
here I'm plotting the log likelihood, so the higher the better.
And we can see that whenever we increase the number of clusters,
the actually performance in the training set improves.
Which kind of the usual thing with hyperparameters.
The more clusters you have, the model thinks its better, but it's actually not.
So, for example, if you put one cluster per each data point, the model loss
will be optimal, but it's not a meaningful solution to the problem at all.
So if you consider the validation performance of your model,
then it increases, then you start to increase a number of clusters,
then it stagnates somehow and then it starts to decrease.
And this is the usual picture for tuning hyperparameters.
You tune a bunch of models and
you chose the one that performs to the best in the validation set.
So this was probability small for clustering, but
it turns out that you can do this thing for hard assignment clustering.
Well, at least it's not always how to do it.
So if you train one of the popular hard clustering algorithmic tenets,
it will think that the more clusters you have, the better.
Both in training and on validation loss.
So it doesn't have any meaningful way to understand which number
of clusters do we want from the performance on the validation set.
The probabilistic way of dealing with clustering is also not ideal.
So for example here, we're not sure whether we want 20 clusters or
60 or 80, but it gives at least something, you have some boundaries.
What is the reasonable variable from this hyperparmeter?
So this was the first reason why we may
want to consider probabilistic approach to clustering.
And the second one is that we may want to build a generative model of our data.
So if we treat everything probabilistically,
we may sample new data points from our model of the data.
And in the case of customers, it will mean simple new points on the traditional grid.
It look like the points we used to have in the training set.
And if you're points are, for example, images of celebrity faces,
then sampling new images from the same probability distribution
means generating fake celebrities and their images from scratch.
And this is kind of a fun application of building probabilistic model for data.
So to summarize, we will want to build a probabilistic model for clustering.
And this may help us in two ways.
First of all, it may allow us to tune hyper parameters.
And it may give us a generative model of the data.
So in the next video, we'll build a latent variable model for clustering.
[MUSIC]