0:00

As you may remember from our course introduction in October 2006,

Netflix announced to competition with a million dollar prize for

an algorithm that could beat its existing Cinematch algorithm by at least 10%.

I'm happy to be speaking with Yehuda Koren,

a key member of the team that won the Netflix prize in 2009.

Dr. Koren is now staff research scientist at Google in Israel.

At the time of the Netflix prize, he was a researcher at AT&T Labs and then at Yahoo.

Yehuda, welcome to our course.

>> I'm glad to be here.

>> So let's start at the beginning.

Could you remind us about the nature of the Netflix Prize Challenge,

and what your team's first effort was at solving it?

>> Yes, this bring some good memories.

So it happened in October 2006,

Netflix published a dataset with one other enumerating

step that the center of the real customers,

half million customers gave to too many thousands

of movies over the six preceding years.

And at the time, it was considered a used dataset.

It was a game changer for many of us, and

it attracted a lot of attention as probably thousands of researchers and

practitioners was started play was the data.

1:30

The goal that Netflix has put is to improve the accuracy of the Cinematch

system by 10%.

And 10% was measured on an error metric known as RMS root mean squared arrow,

where they want us to predict the score for

each user and the movie in the test set.

And the two-way thing is a star rating between one and five.

So imagine their two way thing is false stars and I predict 3.5,

and the square, they're always is 0.5 squared.

And they have read overall writings, and

the requirement was that there would be around 0.85.

And this was very tough.

2:24

I think no one realized how difficult it is to reduce the error to this level,

otherwise, people won't participate.

It took almost four years to achieve this level.

But I say, I believe that like everyone will optimistic at the beginning

that we can close this in few months.

And then, by the way, we were totally strangers to the field.

But I think that the nature of the dataset,

the fact that it deals with more of which everyone

is familiar with got us really excited.

3:05

And we started counting the data.

What we did in the beginning was white standard.

I'm coming from computer science.

I was working on graph algorithms, so I said, okay, whatever, big graph.

This is not the best presentation, of course, but

it's a graph connecting users with movies, so

I think this is naturally related to what's known as Megawood methods.

So let's optimize them, and let's see if what I wanted to do in the beginning.

I definitely chose to find longer range interaction between notes in the graph.

I ended up tuning the Megawood methods so they optimized the RMS in metrics.

So when we tried to predict a rating for

a movie from other ratings for similar movies [INAUDIBLE],

the weights we give to other movies will not be arbitrary,

but will be driven by some optimization method that

it's supposed to be good at optimizing the RMS.

I was working then at AT&T, as mentioned, I was working mostly with Bob Bell.

4:26

Bob is a statistician, who was working with the other approaches.

He used PCA, which is very similar to singular value

decomposition and- >> PCA means Principal Component Analysis.

>> Yes, Principal Component Analysis, and this was, you can see an that its an early

version of message authorization which became really popular in this competition.

And so you asked at the beginning, so

metric book factorization came up as a big factor in analyzing and

decrypting the Netflix Prize that is set.

5:08

And a key to using metrics factorization was to

address only the existing entries in the metrics.

So we have a matrix with 500 rows corresponding to users and

almost 20,000 calls corresponding to movies.

99% of the endless in the matrix are missing.

And if usual incident component analysis of a singular

varied composition will address all entities, which is how to scale.

And also, not a good idea for

this specific metrics that we were competing on.

And we learned in gradually starting with PCA,

which was treating all entries in the matrix.

And so essentially, it is transposed in that, it is training as

movie by movie metrics by multiplying it with its transposed.

So this PCA that Bob did got combined with the enabled methods I was working on.

6:23

So another major component was understanding the data,

what we call the global effects, residing in the data.

Global effects are like the average score that the movie receives.

Some movies are just better than others.

And you can say with confidence that the good movie will get the high rating

without knowing anything about the writer.

[CROSSTALK] >> Some people just like movies more than

others.

>> Yes, exactly.

That's the other side of the coin.

And some people also are less critical than others and

tend to give higher ratings.

So if it was unknowing nothing about the movie, just say for

this writer, then anything is going to be high.

So this is just that, I call it data cleaning because you say,

okay, let's remove these effects, and try now in a model of a cleaner data.

So this was the big part of what we did in the beginning.

And by the way, later on, we stop doing this,

because it was better to have these effects baked in into the model.

So the model is part of learning the Megawood model or

the methods factorization model also land most of these passes.

But in the beginning, we didn't realize this and we were cleaning the data.

And cleaning the data is important, in general, I always advise it.

>> Great. >> So these were small steps.

>> Let's jump ahead for a minute.

>> Yeah. >> Because we could take you through all

the steps, but over time, a bunch of different teams that were

competing seemed to come to the same realization.

That the techniques they were using were making progress but

they weren't going to get them to this magic 10%.

And they seemed to figure out that the key to solving this

challenge was to find a way to merge together the different algorithms that

each had different strengths and weaknesses in to some composite algorithm.

How did that insight come about and

what was sort of the general high level approach that pulled everything together?

>> Yes, I said this is probably the one thing that people decides matrix

factorization.

I think, yes, people remember phonetics competition,

the insane number of predictors that we have used which is a shame.

[LAUGH] No project assistant should use so

many predictors, and so this happened gradually.

9:12

So, I mean the alternative would be to really isolate what is unique

in what Bob is doing and what's unique in what I'm doing and

then find the single predictor, combining the merits of both.

And this is very difficult, and

far less effective than combining predictors, so this was reality.

9:33

And as you move on, you are generating more and more predictors,

I think after a few months, we have ten predictors, we thought that's a lot,

so now we need to be systematic about how we combine them.

So, we use linear regression to find the best linear combination

of all predictors, some of them could get negative weights.

But I was fortunate to work with a statistician like Bob that ensures

that we are not overfitting that that's it, that the training said too much.

10:12

As time went on, I mean we accumulated more and more predictors.

And then yes at some point, it became apparent that we are not looking for

the romantic key insight or key algorithm that wins this competition but

we are looking for a big combination of methods.

Because no single elegant method can beat a combination of

many methods that cancel out the noise inherent at each predictor.

12:09

People have good writing days and bad writing days.

And this is a very slow effecting the data.

And if you don't model it no matter how many models you blend,

you're not going to hit the required precision level.

And then there are some subtler effects like it's

very important to know how many items the user rated in a given day.

This is just something in the Netflix that it might be an artifact.

12:38

Maybe it represents the fact that at some point,

Netflix asked user to rate many items, maybe some people tend to rate in batches.

But this has a profound effect on certain movies,

was that they were being rated alone or as part of a big batch.

So all this was key to the progress that we made slowly.

That's why it took several years,

because these effects were hard to find and unexpected to us.

13:55

And this was another surprise for us and it was integrated into

the model and also was critical in imputing the precision.

>> And certainly one of the things this gave was a way for

somebody who had a hypothesis.

If somebody thinks, gee, movies on Sunday might get happier and

therefore higher ratings.

They could build a very simple predictor.

And if it had any value, then mixing it into this ensemble could show if it

had incremental value without having to solve the whole problem themselves.

If they could just find little things that added incremental value,

there was a framework to build that value.

>> Yes, yes, where the human distributed the computation this way.

I mean, we could split the work among seven people,

each of them is working independently at the end.

14:49

There is a single blending algorithm, it was very convenient and effective.

It was over-used certainly.

>> But it was also sort of recursive in that, if you think about how recommender

systems work, instead of finding the perfect data, we find lots of data and

a way of combining it in the hopes of giving you a good prediction [LAUGH].

>> [LAUGH] That's a good point.

Yes, so this is crowdsourcing at two levels at least.

>> Yeah, absolutely.

So the folks in this course, our learners, have heard me talk a lot about the Netflix

prize, both about the fact that it was over-optimizing the wrong thing.

That error measures for things people rated is not actually the most

important business question for a company like Netflix.

But also, the fact that it brought significant machine learning research

effort into recommender systems.

From your perspective, looking back, how has this prize and challenge

changed the field both for better, and if you think it has at all for worse?

16:03

I mean I did this because I was motivated by the competition

but maybe matters a little bit to Netflix because this

shows actual predicted rating to people [INAUDIBLE].

But in all scenarios I saw It doesn't add much value, and

it's not even correlated with [LAUGH] with the real metric.

So it's, first of all, ranking, even if we talk about accuracy.

I mean you hinted that accuracy metrics in general are not the most important,

which might be true.

But I mean for us, data people, that's what we know to do.

So product decisions might be much more important but this is beyond us.

17:36

But then later I published a paper that for [INAUDIBLE].

The simple dense SVD on the dense metrics is working much better for

the real ranking problem.

So yes it's, I think there is the [INAUDIBLE],

metrics is misguided in several ways.

And community, including me, was obsessed with it for several years.

But this was fine.

I think one year after the Netflix competition

people started publishing about other methods.

About learning to rank, about diversification metrics,

about all sorts of method.

I think there is still a lot of work to do in finding the right metric.

And being obsessed about other for

several years still bought a lot attention to the field.

Many good researchers, tons of papers,

many sessions in conference and spawned books.

It made the field popular, which is very good.

And in my eyes,

the competition was a big positive boost to the computer systems field.

>> Well, it certainly grew this from a field where people would get together.

You'd have 150, 200 people.

And suddenly, it was 400, 500 people.

19:33

And then I was I was of the view a graph visualization researcher before.

It totally changed my research directions towards

data science which I am finding fascinating.

That's what I am doing since then.

And not only working on a problems, but

problems that involved modelling and analyzing big data.

That's what I am doing and so this Netflix competition

20:34

>> I'm sometimes frustrated about metrics, it's very hard to design metrics.

It totally captures the of the like to popular items.

Sometimes it's good, sometimes bad.

So I'm still looking for people to come out with good,

with ways to measure the quality,

the user impact of systems.

21:43

>> And the challenge of saying, yeah, these are objectively better but

they still might not be better when the user gets them.

>> Yes. I see it again and again.

The offline metrics are misleading us every time.

>> Wonderful.

Thank you so much for joining us and

we will remind people that this has been an interview with Yahuda Coren at Google.

See you next time.

>> Thank you. [BLANK AUDIO]