0:06

In this session we're going to introduce the most popular clustering methods,

the K-Means clustering method.

So you may wonder, who first proposed this K-Means clustering method?

Some people refer to MacQueen because he published a paper on

the K-Means Clustering algorithm in 1967.

But, some other people said, Lloyd actually internally,

in a company, published the, the one in the internal report in 1957.

Since the publication was internally published mo,

most people may not even know it.

So, he republished this one in a journal in 1982.

That's the reason some people say Lloyd first got this algorithm.

0:52

But no matter what, actually, the general common thing for this algorithm is,

they consider every center is represented by the center of the cluster,

or we can say, centroid.

The K-Means Algorithm essentially can be outlined as below.

It means, given K, the number of clusters,

first we can select the K points as initial centroid.

Of course, you can randomly select it, that's what our original K means,

that, or you may have some small way to select it we'll introduce later.

Then, we can put this one on repeat and here,

look, there is once we select the K centroid,

we can form K clusters by assigning each point to it's closest centroid.

Once assigning this, we may need to recompute the centroid

because the initial randomly selected centroid may not be a very good one.

So we recompute the centroid that means the mean point of each cluster.

Once you recompute the centroid,

some object initially assigned to say centroid A now

because of all the centroid changed this point may be even closer to centroid B.

Okay?

That means we need to go back to form K clusters

by assigning or re-assigning each point with its closest centroid now.

Okay?

Then we need to recompute the new centroid.

2:45

We can use Manhattan distance, L1 norm, or Euclidean distance, L2 norm,

but often, our initial algorithm was using this Euclidean distance.

3:00

Now let's look example.

How we can execute this K-Means Clustering Algorithm.

Suppose we got the into 2-D space.

Initially there were black points on the initial data point.

Then we can based on the first line of the algorithm,

we can select the K points as initial centroid.

Now, K is two.

We can select the two initial centroid.

3:29

Once we select this initial centroid

we can form K clusters by assigning each point to it's closest centroid.

Now you probably can see we can assign these, these black points to either

red centroid or to the blue centroid, you can see that's a current assignment.

3:50

Once you assign these to form two clusters, okay,

then we need to recompute the centroid, okay, where you see, you can

recompute the centroid once you recompute this red centroid mode down here, okay?

4:06

Then this, blue one actually is also moved, okay.

Once we recompute the centroid, we may need to

go back to reassign these point to its closest centroid.

In this case you probably can see, the closest centroid, like this,

blue points assigned to the red ones, and

these red, this red one assigned to the blue side of the cluster.

Okay? Then we can calculate the centroid again.

Okay?

So this repeat the loop here until it becomes stable or

until the, the difference the different assignment becomes really, really small.

Now the first important thing we should know is,

this actually is a pretty efficient algorithm for clustering.

Because if we're told we'll have n objects,

we will want to find a K clusters.

Suppose we get a t iteration it becomes stable.

You probably can see the computational complexity is, every time every

object that you need to see which cluster is it belongs to, so this time,

you're assigning to the corresponding K centroid, so that's why it's n times K.

In total, you have t iterations,

that's why the computational complexity is big O of t times K times n.

usually K and t, number of cluster and

number of iterations, is far smaller than number of objects.

That's why if you give me n, supposed quite big number of objects,

this complexity essentially is a linear to the size of n.

The number of objects, that's efficient algorithms.

However, K-means clustering often may terminate at a local optimal.

6:04

Then, the initialization becomes important if you want to find a high quality

Clusters.

We need to have a smart initialization or we need to randomly initialize this

many times, try to find a good clustering green dots.

6:33

There are many ways to automatically determine the best K.

There are some research papers, there are lots of efforts contribute to this.

In practice you also can

run a range of values as a K then select which one is the best.

Then for K-Means messrs it's quite sensitive to the noise data and outliers.

So there are variations like K-medians, or

K-medoids algorithm we try to overcome this outlier noise data problem.

7:07

Another thing is K-means actually works only to objects in the continuous.

That means numerical data in the, in the, in dimensional space.

How to handle clusters on the categorical data?

Actually, people invented something called K-modes algorithms.

7:27

Another important thing we should know is, because we use the sum of the square

distance, it then, it is not a good idea to try to find

a clusters with non-convex shapes or non-circular shapes.

Okay. What you find is something

closer to a ball.

7:46

So we can use density based clustering or

kernel K-means algorithm to do if we get into a non-convex shapes.

We're going to introduce this algorithms later.

So, because K-means

came quite early and very popular used, there are lots of studies

to work out the different variations of the K-means method.

For example, how to choose initial centroid.

8:28

Another aspect is how to choose different representative prototypes

for the clusters.

That means we may not use the mean point,

then we may use K-Medoids or K-Medians or K-Modes.

We're going to introduce this in the subsequent discussion.

8:48

Another thing is how to apply feature transformation techniques.

So, we may be able to cluster things even if they are not in convex shape.

For example, they are massive developed card, weighted K-Means or Kernel K-Means.

We are going to introduce the Kernel K-Means method

[MUSIC]