Instead of hashing. And there's another surprising situation that happens in
today's world. For example Java publishes its hash function. And so if you're trying
to provide a service over the web. An adversary can learn your hash function and
just send you data that causes huge performance problem by just making all
that data hash to one particular item. And that's definitely something to worry
about. And, and in the real world you can nowadays find on the web particular
sequences of keys that will cause particular services to crash. And again,
that's a little harder to do with something like a red black tree where we
have performance guarantees. When you make an assumption you better be sure and
you're depending on that assumption, you better be sure that it holds somehow. This
is different than for example for quick sort when we, our assumption was we're
going to create randomness and we are going to depend on that randomness. In
this care we're kind of hoping for randomness and maybe that doesn't really
always hold. So that's certainly something to be aware of when using hashing in
practice. So here's just simple example on hashing in java. So what we can do is it's
pretty easy to find a family of strings that have the same hash code for example
with just a little fooling around now days you can just look it up on the web, you
can see that these two character keys, both have the same hash code because when
you just do the math in a base 31 hash code it'll tell you that answer. Well what
that means is that actually, just like working in binary you got, you can combine
those things. In all possible ways, and you can get two to the N strings, for any
N of length to N that all hash to the same value. And somebody's implemented a
service in Java that it uses a simp le table that takes string keys, you can
cause that to crash in this way. Little bit scary for some systems designers. At
least reason for pause in using hashing. Now, hashing also has a extremely
important application in today's Internet commerce. And so the, it's the concept of
so called one way hash functions which mean that we, we, use it for secure to try
to be, have some secure fingerprints for use on the web. And there's been a lot
research done to develop functions that take keys as input, and then produce
values that look random. In such a way that, it's hard for someone else to find
another key that collides with that. This technology is, is useful for storing
passwords and digital fingerprints and things. But it's too expensive for use, in
a symbol table. So the bottom line is separate chaining versus linear probin
collision resolution message methods. Now there's a number of considerations to take
into account. Separate chaining is really easy to implement both insert and delete
it performs, it degrades, it does so gracefully and the clustering is, is maybe
less of a problem if you have a bad hash function. Linear probing tends to make
better use of space. And also it'll perform better for huge tables whereas
caching is involved. And if, in the classic algorithm or computer science
problems for people to think about is what do we do to delete in these two situations
and exactly how do we resize. Those are all at the level of exercises in the
context of the kinds of algorithms that we've seen. And as I mentioned, there's
been many, many improved versions of hashing that have been studied. I
mentioned the two probe, or double hashing version. Another way to use two hash
functions is just to hash the two positions and put the key in the shorter
of the two chains. In, in that case, then the expected length of the longest chain
will be lg, lg N which is quite an improvement. You get constant time
expected and lg, lg N worst case. Double hashing is the variant of layer probing
where you just skip a variable amount, not one each time. And that pretty much wipes
out clustering but it, it is more difficult to implement delete for that
one. In a new method called, relatively new method called Cuckoo Hashing. It's
another variant of linear probing where we hash a key to two positions and insert the
key in either one. If occupied you, you reinsert the displaced key into its
alternative. It was in one, each one can go to two. And that one actually gives
constant worst case time for search. That's another variation on the team. And
all of these things allow us to make better use of memory, allows the table to
become nearly full. It would have been very exciting. Thing to be researchers in
the 1950's who cared so much about memory and nowadays a little extra memory is not
something that people care about so much and most people just go with the easy
algorithm except for really performance critical applications. What about hash
tables versus balance search trees? Well hash tables are really simple to code
usually if you don't have to do the hash function. And if you don't have order in
the keys at all then you need the compare to, to implement balance search trees. So
you have to use hashing if you don't have the comparison. And it'll probably be
faster for simple keys to use hashing. It's a few arithmetic operations to do the
hash versus lg N and compares for the balance tree. And there's some better
system support in Java for strings that cache hash code means that you don't even
have to compute the hash if your, your simple table for strings is in an inner
loop. On the other hand, balanced search trees have a much stronger performance
guarantee. It, you don't have to assume anything. It's going to be less than lg N
and compares and it's got support for all those ordered ST operations, and compared
to and is pretty easy and natural function to implement. So it's more flexible and
more broadly useful. And actually the Java system and other systems include both so
that programmers can make use of either one in diff erent situations. That's our
context for hashing algorithms.