The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

Loading...

En provenance du cours de Stanford University

Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming

548 notes

The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

À partir de la leçon

Week 1

Two motivating applications; selected review; introduction to greedy algorithms; a scheduling application; Prim's MST algorithm.

- Tim RoughgardenProfessor

Computer Science

So what are greedy algorithms good for? Well, it turns out they're well suited

for a number of fundamental problems across different domains of computer

science. And to wet your appetite, for the many

examples that we're going to see, I want to begin by discussing the problem of

optimal caching. The punchline of the lecture is going to

be that a natural greedy algorithm in fact minimizes the number of cache misses

over all possible ways of managing a small fast cache.

So what is the caching problem? Well on the one hand, there's going to be

a big, but slow memory which you can think of as holding everything you might

be interested in. And then there's also going to be what we

call a cache and so this is a much smaller memory to which access is much

faster. So this situation comes up all the time

across different doings, computer science, architecture, operating systems,

networking. just to mention a couple of really

obvious examples. You could imagine, the small fast memory

being something like an L2 cache. And the big slow memory being main

memory. Or perhaps actually main memory is the

fast memory, and the big slow memory would be disc.

Now, your task, in the caching problem is to process what we're going to call a

sequence of page requests. So a page request just means that the

client wants to access something in memory and it's guaranteed to be in the

big slow memory. But if its not already in the small fast

memory then you got to bring it in, you got to put it in there for the subsequent

access. The algorithmic aspect of the problem

answers the picture when there is a cache miss or also known as a page fault.

That is when there is a request for some data which is not already in the cache.

When that is the case, you have to bring it into the cache.

The design question then is what do you evict from the cache in order to make

room for this new piece of data which you have to bring in.

So to illustrate the issues, let's look at extremely simple example.

A cache that just has four slots for pieces of data.

Let's assume that initially the cache is seated with the four pieces of data I'll

call a, b, c, and d. Now remember the input is a sequence of

page requests, so these are requests for different pieces of data.

Now when a new request comes in, you're basically sitting there crossing your

fingers that what's been requested is already in the cache.

So for example, if the first request comes in for the piece of data marked c,

you said good we're good to go it's already in the cache go ahead and access

it. Similarly if the next request is for d,

you don't have to do anything. Good times roll, and d just gets accessed

directly. The problem arises when something is

requested that's not in the cache. So let's say the next request is for the

data item e. Now remember, you have to bring e into

the cache and of course you have to evict one of these four pieces of data to make

room for it, and your algorithm has to decide which.

Can you get rid of a or b or c or d. For this example, let's assume that we

evict a to make room for e. Assume further that the next request that

comes in is for a new piece of data f. Again it's not in the cache so we have to

evict something to make room for it. Let's assume we get rid of b in order to

bring in f. And now an unhappy situation but

something that could certainly occur is we get a request for something that used

to be in the cache but which we have since evicted.

So for example if the next request is for a, then we're stuck.

It's going to be another page fault, we have to evict something to bring in a.

And similarly, if there's a b, again we're paying the price for evicting b to

make room for f in the past. So in this example with these particular

choices for page evictions, we incur four page faults.

Now the first two the e and the f there's nothing we could have done about it.

We were given a cache that initially did not have e and f and then e and f showed

up, well what are you going to do, you're going to miss no matter what.

However, two were caused by our unfortunate eviction decisions early on,

to evict a and b only to find them requested after eviction.

And with 20/20 hindsight, we can conclude we really should have evicted c and d,

not a and b, to make room for e and f. So the point of this example is to

illustrate first of all the caching problem, how it works.

You have this small fast memory, it can't contain everything at once and so you

have to sort of manage the cache and evict things, to make room for stuff as

it gets accessed. That's the first point of the example.

The second point is to illustrate there's really two types of cache misses or page

faults. There's the ones which you know, really

you can't do anything about. No matter what algorithm you use, you're

going to suffer those faults. But then, depending on the eviction

algorithm, you maybe able to avoid some of the cache misses that you would incur

with an inferior algorithm. Algorithm.

So, now the obvious question is, how well can we do?

What's the best algorithm? How do we minimize the number of cache

misses, suffering only the ones that are inevitable?

So this question was given a very elegant answer by Belady back in the 1960's.

And I'm going to state the answer as a theorem.

it's a theorem we're not going to prove, for reasons I'll discuss in a second.

but what the theorem says is that a natural greedy algorithm is an optimal

algorithm for the caching problem. That is, it minimizes the number of cache

misses over any way you might think about managing the cache.

And the natural greedy algorithm that is optimal, is called the furthest in the

future algorithm. So what is the furthest in the future

algorithm? Well, it's exactly what you think it

would be. It's basically what seems like a good

idea at the moment you have to perform a eviction from the cache.

Basically you want to put off judgment day.

You want to put off the regret of evicting this particular piece of data as

long as possible. When are you going to regret evicting a

piece of data? Well, it's when it gets requested next.

So if we have four things in the cache, you know, one is one is requested next,

one is requested in seven time steps and one is requested in you know, 70 time

steps, that's the one you want to evict now because it will take the longest

until you actually regret that eviction. So for example, in the example on the

previous slide, you can check that the furthest in the future algorithm would in

fact evict the ones you want to evict, a and b,

not the ones that we evicted in the example, c and d.

Now at this point, many of you are probably justifiably scratching your

heads. You're wondering, you know, why is this

useful. It doesn't seem like this is what we

wanted. The objection, to this result being that the furthest in the future

algorithm is clairvoyant. Its very definition assumes that you know the

future, it assumes that at the moment that you have to make an eviction you're

aware of when each of the pieces data in the cache will be requested next.

But if you think for a minute about the motivating applications for sudding the

ultimate caching problem, this assumption simply doesn't hold, you simply do not

know the future, you simply do not know when each of the pieces of data in your

cache will be requested next. So this algorithm is not defined, it is

unimplementable. Despite that, this is still an extremely

useful result to know. Why.

Well, two reasons. First of all, this unimplementable

algorithm can never the less, serve as a guide line for practical.

Implementable algorithms. For example it naturally motivates the

LRU or least recently used caching algorithm.

So what you do in the LRU algorithm is that instead of looking forward in the

future, which you can't do generally, you look in the past.

And you say, well, let me guess that whatever's been requested recently, will

be requested again soon. Whatever hasn't been request been

requested for a long time, will continue to not be requested for a long time.

So, that sets as a proxy for the piece of data that's going to be referenced the

furthest down the future, you look for the one that was most recently referenced

the furthest back in the past. So that's the LRU algorithm.

And as long as data exhibits what's called locality of reference,

meaning whatever's being requested a lot in the recent past is also going to be

what's requested in the near future. Then LRU is going to approximate furthest

in the future. And indeed, LRU is in many applications,

the gold standard amongst practical implementable caching algorithms.

The second reason this theorem is useful in practice is because it served as an

idealized benchmark. A hypothetical perfect scenario against

which you can compare your latest and greatest cashing hereistic.

So for example, maybe you have a caching application and you start by implementing

the LRU least recently used caching algorithm and then as a sanity check you

probably want to go back later once you have hindsight you look at the last few

days of traces of logs of page requests and you say, how well did we do.

Let's look at how well our caching algorithm LRU did.

And let's look at how well we would have done had we known the future.

And hopefully, you're just a few percent away.

And then you can conclude that, yes indeed, the data seem to have locality

reference. Yes indeed, LRU is doing almost as well

as if we know the future, and we can proceed.

On the other hand, if you go back through the last few days of logs, and you find

that your caching algorithm is doing much worse than furthest in the future, then

it's back to the drawing board with respect to your caching algorithm.

You should work harder, understand the data better, and come up with a smarter

heuristic. So for almost all of the greedy

algorithms that we cover in this course, I'm going to explain to you why they are

correct. I'm going to prove it rigorously.

this algorithm is an exception. I'm actually not going to prove this

theorem for you. the way it works is by what's called an

exchange argument. So again, you may not have seen any

examples, but you will soon. but the exchange argument to prove the

latter result, as far as I know it's pretty tricky, believe it or not.

Even though the algorithm is natural, you might think this result feels a little

self-evident. Try to prove it rigorously.

Not easy. Not easy at all.

Indeed if you look at a textbook and say operating systems or a field like that,

generally you'll see a description of this algorithm.

You'll see the claim that it's optimal but you won't find the proof.

some algorithms textbooks, for example Algorithm Design by Kleinberg and Tardos,

do include the proof of this theorem. Those of you that are interested, I

challenge you to prove it yourself without looking it up on the web or in a

textbook. I think if you try to prove it you'll

appreciate the subtleties that come up in understanding whether greedy algorithms

are correct or not. In the best case scenario, I would love.

Love to see, a simple proof of this theorem, something that I could explain

in a class like this, in say five minutes or less,

that would be amazing.

Coursera propose un accès universel à la meilleure formation au monde,
en partenariat avec des universités et des organisations du plus haut niveau, pour proposer des cours en ligne.