0:04

I just want to briefly discuss hash tables because they are quite

costly related to words and birthday problems, but

we won't spend a lot of time in lecture on this more information in book.

So, we talked just briefly about the balls and urns model and,

of course, this is very heavily studied in classical combinatorics.

And so, we talked about,

well, what's the probability that no urn has more than one ball.

What's the probability that no urn's empty?

You might want to know how many empty urns there are.

How many urns have k balls and so forth.

These so called occupancy problems, are well studied,

dating back many many decades.

0:59

The probability, and we looked at this briefly when we talked about asymptotics.

The probability that a given value occurs k times in a random word of length N,

is (N choose k) (1/M) to the k (1- 1/M) to the N- k.

That is, you get your value, k times the probability of 1 over M, and

the other values are not yours, with probability (1- 1/M) to the N- k.

So, given N balls and M urns, the probability given urn has k balls.

1:32

Now we looked at the Poisson approximation that works for

the situation where N/M is alpha and

it's fixed and k is a small constant.

That's the so-called Poisson approximation.

And we showed how to come up with that,

approximation in when we talked about asymptotic.

2:01

So, and this is a very good approximation.

So for example, this plot gives for

this is for alpha = 10.

So N = 10,000, M = 1,000.

That's a plot of the binomial distribution.

2:44

So an application using hashing algorithm so

people in computer science are very familiar with this algorithms.

I'll just briefly describe them for people.

In math that maybe you haven't seen them.

They were very well covered in books on algorithms like our or book.

The idea is we want an efficient way to put key value pairs in a symbol table.

And we talked about this for binary search trees.

And you want to search the table for the pair of corresponding to a given key.

So the hashing strategy is to come up with a function

that maps each key onto a random value between zero and M-1.

And then develop a collision strategy to figure out what to do when two

keys hash to the same value.

3:35

And the basic algorithms that we'll talk about one's called separate

chaining where we essentially represent urn with linked list.

Another one is called linear probing where we just use an array, we never let.

To get in the same place where we scan through the empty spots on the collision.

And we'll talk briefly about both of those algorithms.

And the model that we use is this idea that the hash function,

we assume what's called the Uniform Hashing Assumption, that the hash function

maps each key into a rad value between zero and n minus 1.

And it's been shown to be not so difficult for typical types of

keys to get hash functions that reasonably approximate this uniform assumption.

4:28

So, that's the setup.

So hashing with separate chaining is really easy to implement and this is just

a diagram from our algorithms book that show It was a hash table of size 5.

And so the keys are letters.

And then the hash values are supposed to be random values between 0 and 5.

It's just a word.

So the sequence of hash values is just a random word.

And then we store the letters in the list index by the hash values,

so those are the earns and the keys, and the associated values are the balls.

5:08

So, that's easy to implement and,

of course, we're going to be interested in how long these lists are.

So again, This is just the balls into urns model for hashing with separate chaining.

5:25

And I'm probably spending too much

time on animations, but [SOUND].

So, it what we want to know is when we're going to search for

an item in this we're going to have to go through all the balls in the urn,

so we want to know the average number of balls in each urn.

5:50

Well that's obvious, there's N balls, there's M urns.

Our average is N/M balls in each urn.

And actually, that doesn't depend on anything.

It's not very helpful at all.

6:03

If all your balls fall in one urn,

still your average number of balls per urn is N/M.

Doesn't have to be random, even.

So that observation is not much help.

6:16

So, and we know that the probability that a given urn has k balls,

that's the occupancy distribution.

Generally, we're going to have, try to keep the number of

urns large enough that our number of balls per urn is a constant, like alpha.

So we know that.

But the programmer wants to know is

something about what's the chances that they're distributed evenly.

So we have to do a little more math to provide that answer.

So here's what a programmer might say.

So finally make sure that M is big enough, so

that N/M is less than some constant alpha.

And it turns out to be not difficult to do that.

7:01

Then the average number of probes for searches is going to be less than alpha,

I know that.

But what's the chance that some search will use way too many probes?

Under the uniform hashing assumptions, say, 5 alpha probes.

7:30

And the trick is to use Stirling's formula to bound k factorial.

And then if you do that then we get something that converges very rapidly.

So something like e to the minus alpha times u over 5 to the 5 alpha.

And say for alpha equals 10,

that's button of 15 zeros, N to the minus 15.

So you can say to the programmer that if you take alpha equals 10,

this is extremely, extremely unlikely to happen.

And that's the kind of information the programmer needs.

So that's hashing with separate chaining,

that's a typical calculation using classical occupancy distribution in some

of the asymptotics that we did in lecture four.

8:36

And everything's fine until we get

a collision where, and then we take two balls,

and we know from the breathing column that going to happen relatively soon,

so when it happen we just can skip to the right until we find an empty urn.

8:57

You don't want to do this as the table start to get full as

we will see still we're going to want to know that have analysis that

tell us that tell us the average number of collisions when we use this algorithm.

And it's a relatively easy algorithm to implement as well.

Found in standard algorithm books as well.

So that is the key question, what is the average

number of probes to find one of the keys,

if you use this algorithm, as we would run your program.

9:43

from k bigger than 0 of N over N, N- 1 over N,

down tot N- k + 1 over N.

Which is really,

well the proof is given in ten pages in the textbook.

It really was a landmark result.

People were thinking this was too difficult a problem to really address and

Knuth was able to show this result.

And if you let the table get full,

if you let N get to M and it's around the nugent function.

And if you don't let the table get full.

Which we don't, in practice.

And I say don't let it get more than half full,

then it's going to be one over one minus alpha.

And if you let it get more that half full it's about two proofs.

11:00

And Knuth taught that good writers don't use footnotes in technical writing.

But he said he couldn't resist putting in this one footnote that

said that he formulated this derivation.

It was the first non-trivial algorithm he had every analyzed satisfactorily.

And it says it had a strong influence on the structure of these books.

And it really is where the analysis of algorithms began,

was when Knuth solved this problem.

11:45

It seemed reasonable,

that we should be able to analyze when you're probing with the symbolic method.

So, we can resist putting in one foot note, in our book, either.

And that one said that we don't know the answer to this exercise.

Now we couldn't figure out how to use the symbolic method to analyze linear probing.

It seemed like their answer was so simple and

so similar to birthday coupon collector, that we should be able to get it.

And really we put this in as a challenge to students and researchers.

Really, you know, we should be able to do this one.

And it took some years, but eventually, and you can read about this on,

in the second edition of the book turn out this problem has a very,

a deep connections to properties of commentarial objects.

Like, random graphs.

The Gambler Ruin problem.

Path links in trees and many other classic algorithms.

And it's explained by A distribution called an Airy law.

And very fascinating amount of mathematics to explain this.

And Knuth was one of the people who solved it.

And Phillipe and some co-authors solved that, as well.

Although, actually, still we don't quite exactly,

precisely know the answer to this exercise, we know quite a bit.

So there's actually still a footnote left in the second edition.

But linear probing is worth studying to see both the potential and

the limitations of what we know about analysis of algorithms at this point.

One of the foundations mentioned here is properties of mappings and

that's what we're going to talk about next.