0:00

With all these work done, let's look at a bigger example.

It does not illustrate the scale challenge, we'll, which we'll come back

very briefly in a moment. But it is slightly bigger than four node,

now we're talking about eight nodes, nodes one, two, three, four, five, six, seven,

representing eight webpages, interconnected by directional links, these

hyperlinks. If we just stare at this graph and say,

which node is most important? Well, we may say by degree, in degree

count, node four is most important. And I just got four, it's bigger than any of

the other nodes in degrees. But, if you think it along the line of

Google Page Rank, which implicitly assumes certain navigation behavior,

you see that four basically gives away all this importance score to a single node

downstream: three. And three does not give it back to four.

It spreads it further to two and eight. So this gives you a hesitation.

Say, maybe node three by page rank calculation of importance score will be

higher than four. And, indeed, we'll see that's the case.

Intuitively, two, three, four should be highly ranked, and that is what we will

see. And, which nodes will have the least

importance score? Well, those nodes that are not pointed by

many other important nodes, who concentrate on them.

And we can see, probably node seven and node six will have low importance scores.

And, indeed, that's the intuition we'll confirm in a short while.

These will be lowly-ranked, will be highly-ranked.

So to make this a little more efficient, I've written out the H matrix already,

just based on the topology of this given graph here.

Now, we can then construct H hat, but H hat is the same as H, because there's no

dangling nodes. And then, we can construct matrix G with

randomization parameter, or anti-randomization parameter being 85%.

And then fifteen percent of the time, we'll be doing random hopping among the

pages. Then you can write down that G matrix, and

then you can start doing the iteration. Pi equals pi transpose, equals pi

transpose G, from one iteration to the next.

And this shows the first initialization, all the nodes have one-eighth, exactly

evenly distributed. It doesn't matter what initialization is

used actually. It would change the convergence speed a

little but not the end result. The first iteration is shown here.

3:29

And I've already normalized them. So what you see here is the normalized and

ordered by the importance scores in descending order.

So node three is the highest importance score and therefore ranked

number one, followed by node two followed by node four.

You can see, node three actually, is by far the most important.

0.2 sounds like small number, but remember, the pi is now normalized,

so they all add up to one. So 0.2 out of eight nodes is actually

quite impressive. Two and four are about the same.

And, nodes six and seven are by far the smallest.

Okay, that's 0.06, 0.04, but then again, Google page rank returns the ranking.

It does not show you the actual scaling, so the scaling is only intermediate step.

4:22

And this confirms our intuition. What we just said before, that node three

perhaps is the most important, even though node four has a higher in degree.

That's because PageRank calculation models the navigation behavior.

Saying that, if a lot of people come to this web page, but then they all have to

go out to this page, then this page actually shares a hundred percent of the

importance scores spreading from node four.

5:27

In fact, when we generalize PageRank, the

famous algorithm we just showed--a little bit, generalize a little bit--

you see what exactly I mean by that in the advanced material part of the lecture.

You will see a striking parallel between PageRank and DPC.

The PageRank's used by Google for billions of searchers each day.

And DPC is used by all the 3G cellphones invented by a number of companies and,

but, eventually, most of them funnel through Qualcomm, and so these are very,

two vastly different industries. This was invented early 90s, was, was in

the mid to late 90s, But, mathematically, they have a very interesting parallel.

This is a network of interfering nodes. And the topology, the graph is represented

by a matrix G. Where the Gij's are the direct and the

difference channel gains. This is about a graph of web pages,

hyper-linked and directed. Turns out there's another matrix, same

symbol but very different definition, of Google matrix.

They help you to rank these webpages. The functionalities are vastly different,

but it turns out the generalized version of the PageRank we just saw has exactly

the same mathematical formula as DPC. So if you are curious about that please

come to the advanced material part of the lecture.

Now PageRank can be understood from quite a few different angles.

We talked about this eigenvector angle. We talked about this iterative computation

angle. Modeling the hopping behavior, navigation

behavior of a user or this random walk on graph behavior.

There's also an economic growth model angle which is actually what led to this

equivalence. But the biggest challenge for Google using

PageRank is the challenge of scale, the N is too big.

But fortunately, these graphs, H is sparse.

7:50

And the other graphs' matrices are rank one matrices.

So you add two rank one matrices to a sparse graph, you get a very

well-structured graph G that helps a lot in scaling up the computation.

Notice that this computation is done in a central server.

DPC is done in a distributed way that's the word distributed power control.

So there's a big contrast between those but then again DPC we're talking about

tens of nodes, here we're talking about millions of nodes relevant to a search if

not the entire set of webpages 40 to 60 billion out there.

There are quite a few tricks you can use to improve the scalability of this

centralized computation. One is numerical linear algebra methods,

okay? There are a lot of ways to decompose this

matrix H, so that you speed up the computation of matrix multiplication.

There are also a few more tricks, for example.

It's only the order that matters, so don't worry too much about converging on the

scale of the importance scores. Certain web pages can be grouped together.

The dangling nodes can be treated separately and you may also have a

hierarchy. You can group certain clusters of nodes

together and run an approximation before you distributing a given importance score

among those nodes in a cluster. In the homework problem you looked at

this cluster-based hierarchy approximation.

So whether it's speed up method or approximation tricks, there are many

approaches to make this algorithm work so fast for a large N.

Now finally. Google also plays a game with companies

called SEO, search engine optimizations, and as soon as Google became popular

around 98, SEO popped up around 99-2000. And since then, it's been over decade of

constant struggle back and forth between the two.

So SEO in a nutshell, basically try to increase your webpage ranking by doing

things that try to leverage the PageRank's algorithm.

For example, they'll say here's a truly important webpage I construct.

Then if you pay me I will provide pointers to your webpages even though your webpages

may not deserve a high rank. So basically try to change this structure

of the H matrix. But then, Google is not sitting idle there. It's also reacting.

Two recent reaction is in early 2011 and May 2012, where Google announced some

changes in the way they run the actual detailed version of the PageRank methods

that would try to counter SEO's artificial lift of certain web pages' importance score.

10:57

So, there's a lot of interesting stories that we'll not have time to go into.

Let's quickly summarized what we have seen so far in this lecture.

The key concept is that we can view the collection of hyperlinked web pages as a

network. There's a graph that we'll represent by

matrices such as H, H hat and G. There is also a functionality of

navigation that we're implicitly modeling through the way we construct these

matrices. You can disagree with those models, but so

far Google's search result has proven to be quite useful.

And this is a general theme: the connectivity pattern

provides a hint on the node importance. We'll come back to this in lecture four

with three more additional importance metrics suitable for other purposes.

So PageRank, this famous algorithm we just saw and construct, uniquely defines

and then efficiently computes a consistent set of importance scores based on these

ideas. And one of the angles we just saw and

emphasized is the dominant eigenvector angle.

That's the pi star transpose equals pi star transpose times Google matrix G.