0:04

Next we're going to look at Ternary search tries, which is another data

Â structure that new data structure, that responds to the problem of too much

Â memory space used by our way tries. this is a very easy to describe and

Â implement data structure. And it came out of the same paper that

Â Jon Bentley and I wrote in the 1990s when we developed the three way Ternary

Â quicksort. the idea is now we're going to actually

Â store characters in nodes in values. but, we're not going to store whole keys

Â in there, just characters. but we're not going to use this idea of a

Â big array, where a nominal link is an implicit value of a character, actually

Â going to store the characters. but what we're going to do is, make each

Â node have three children not two, like in a binary search tree.

Â And not r, like in a Nr way trie, but exactly three and the three children are

Â for the trie corresponding to smaller keys.

Â The trie corresponding to trees that start with the same, character in the tri

Â corresponding to larger trees. And just given this, this des,

Â description and experience with many of the data structures that now, we've

Â talked about already in this course. many of you could go off and implement

Â this data structure, and we'll see how simple the implementations are.

Â 2:38

now, but for the left and right links those are for keys that start with some

Â letter that appears before S on the left, and after S on the right.

Â and then say, on the left, the node for B is, if you go down this link, it's all

Â the keys that start with B. Otherwise the B is just to divide the

Â ones that are less, from the ones that are bigger, and so forth.

Â and again, if you [COUGH] search for a key by say, look search for SHE, so

Â that's going down middle links and matching keys.

Â Then some nodes will have values associated with them, that's the value

Â associated with the last letter, last character in the search key.

Â So, that's what the data structure looks like.

Â and I'd begin to describe the search algorithm, it's quite simple given the

Â definition of the data structure. Again, we follow links corresponding to

Â each character of a key. If we have a character that's less than

Â the character in the node, we go left, if we have one that's greater, we can go

Â right. and if it's equal we move to the next

Â character, and take the middle link. so this is an example of a search for a

Â SEA. Start with S, that's a match, we take the

Â middle link. Now the next character is E that's a

Â mismatch and E is less than 8, so we go to the left.

Â 4:16

now we find a note that has E, and we didn't update our pointer into the key

Â character, because we didn't find a match.

Â So, we're still looking at the E, and now, now we can match that E, so we go

Â down the middle link. Now we're looking at the A in the search

Â key, and that's less than L, so we go to the left.

Â now we're still looking at the A in the search key, and that's a match and it's

Â the last character in key, so, we return, the value associated with the last

Â character in the key. So, it's the same basic algorithm, that

Â we used for trie's except, we just have a different way to decide whether we match

Â the character. We explicitly store characters in nodes.

Â We follow middle link on a match, and update the pointer in our key character.

Â Otherwise, we, we follow the left or right link, because we know that the key

Â is going to be found in the left or right sub-tree.

Â And each node has exactly three links. So here's a example of search in a TST,

Â again this is the example I just talked about in a dynamic form, form.

Â So, if S is the first character in the key, matches the S in the first node of

Â tree, so we take the middle link, and move on to the second character in the

Â key which is e. Compare that against h, it's not a match

Â in fact it's less, so we go left. So, now we're still looking at e and c,

Â and it's a match with the character in the node that we're currently processing.

Â So, we take the middle, and now we move to the next character in the key which is

Â a. And now we take a compared to l, and a is

Â less, so we go left, we're still looking at the a, didn't updated it, 'cuz it's

Â not a match, and now it is a match. and it's the last character in the key,

Â so we look at the value, and that's the value we return.

Â what about unsuccessful search? well, let's see how that works.

Â So, for shelter, it starts with an S, so we take the middle link, second, and move

Â to the second character. Second character's an h, and that's a

Â match. So, we take the middle link, and move to

Â the next character. Third character's an e and third

Â character, and that, that's also a match. So again, we take the middle link, and

Â move to the next character. the fourth character's an l, that's also

Â a match. So, again, we take the middle link, and

Â move to the next character. Now the next character is a t, which is

Â not a match, so we want to move to the right but moving to the right takes us to

Â a null link. So that means that shelter is not in the

Â TST and, and so we return null. Again, in every, in every step what we do

Â is completely well-defined. and any search for a key that in the TST,

Â it's going to return it's associated value.

Â Any search for any other key, is either going to, go out on a null link, or end

Â on a node that involves a mismatch. so, with that basic idea, let's take a

Â more detailed look in the demo, of how the tree gets constructed.

Â and following through this demo, will give you a pretty good feeling for how

Â these trees are constructed. So, we're going to associate the value 0

Â with a string, SHE. so, again every time we move to a new

Â letter, we have to create a new node, until we get to the end of the key.

Â And at which case, we put the, associated the value with the node that contains the

Â last character in the key. so, that's the starting point.

Â key is a sequence of characters down the middle links that [COUGH] goes to from

Â the root to some node. And the value is in the node that

Â corresponds to the last character. so, how do we put a new one in this tree

Â so the in this Ternary search trie. Our first character is S, so we go down

Â the middle. Our next character is not a match so

Â [COUGH] we go to the left, and it's a null link but we have a node that we want

Â to put there. That's the one corresponding to the

Â second letter, in sell's so we do that, and then we make nodes with middle links,

Â going until we get to the last character in sell's, and we associate that with the

Â value 1. So, what about SEA, let's see how that

Â one works. so, first character is S, so I take the

Â middle link, second character is e which is less than h and not a match so we move

Â to the left and continue looking for the e.

Â now this node we find the e, so we take a middle link, and move to the next node

Â [COUGH] next character which is a, a is less than l.

Â it's not a match so we go to the left, that's a null linl, and that's where we

Â put our a, and we associate it with the value 2.

Â OK, here's sh, shells so we have a match at s, go to the middle link, and go to

Â the next letter, match h middle link next letter.

Â and then we go down and add the three nodes corresponding to the last three

Â letters and put a three at the node corresponding to the last letter.

Â OK? now B Y we look at S not a match it's

Â less, so we're going to go to the left, that's a null link.

Â That's where we put our first character, and then we go down the middle link to

Â add the node for the last character,and then ah,associate the value for there.

Â and then B is a similar thing on the right.

Â And then go for the t and h and e, and that gets our share with the value 5.

Â 10:37

SEA again, it's associative, so we do the search, so S is a match.

Â So we go down the middle link, and move to the next letter.

Â E is less than h so we go to left, and keep looking for e.

Â Now we find e is a match, so we go down the middle link, and move to the next

Â letter. that's a, which is less than l, so we go

Â to the left to look for the a, and there we find it.

Â and we're [COUGH] building an associate symbol table, so we update and overwrite

Â the old value with the new value, 6. S, H, O, R, E.

Â we match the S, match the h, now we're on O, that does not match e, so we go to the

Â right, and keep looking for O. Now, we don't find it, so we create a new

Â node for O, R, E, and then put the value 7 in the values associated with the last

Â character in that key. so, that's the construction of a Ternary

Â search tree containing eight keys. Let's take a look at a slightly bigger

Â Ternary search trie. so here's our example with three letter

Â words. now for that example, we didn't show all

Â the null links, but there's lots of null links, 26 null links in each leaf for

Â this 26 way trie. And there's other null links higher in

Â the tree, and actually counting up there's over a thousand null links in

Â this tiny 26 way trie. But in a TST, it's got about the same

Â number of nodes it's actually got more nodes, because it's got one for each

Â character. And the other ones correspond to links,

Â but not that many more nodes. But the key thing is, it has many fewer

Â null links, because there's only three links per node.

Â And this effect is obviously going to be much more dramatic when R is much bigger.

Â like a 256 way trie or even a unicode 65,000 way trie.

Â that's the key benefit of TSTs, that and they're much more space-efficient, then

Â our way trie's for large R. so, let's take a look at the

Â implementation in Java, and then we'll look at the code, and then we'll talk

Â some more about performance. oh, the representation in Java is very

Â straightforward, as I said many of you could code this up, just from the

Â description. what does a node have to have?

Â It's going to have a value, it's going to have a character.

Â it's going to have references to left, middle, and right TSTs.

Â and that's what it's built from. so whereas the standard trie

Â representation has a ray of links with characters implicit.

Â TST has explicit characters with only three links per node.

Â so that's TST representation and then given the representation the code is

Â follows almost immediately. with our standard setup if we're supposed

Â to, now we use, we call a recursive routine, giving the root as first

Â argument. And give the tree returned going to give

Â a reference that back to the root. so our recursive routine takes a node and

Â returns a node and for most of the time, it, it doesn't do much.

Â But when it can do node, is created this this code is effective, because any link

Â we go down, we replace with a link that we got back, our standard setup.

Â so, the recursive port routine if we're that character d, we pull out the d

Â character. if once we've done that, if our node is

Â null, we create a new node and give it that character.

Â 15:25

otherwise, we go either down the left or right link under the same circumstances,

Â without moving to the next character in the key.

Â so that's extremely straightforward code for put in Ternary search trie's.

Â and again, get is even simpler. it's a similar set up that we've used

Â several times before. Use a recursive routine that returns a

Â node, and we pull our value out of that node.

Â we don't have to cast in this case, cause we're not running arrays, and our nodes

Â just has the value type in it. and if we get a null node back, yea, we

Â return null that its not there, the recursive routine will pull out the

Â character if its less than our if our search character is less we go to the

Â left. And if its greater we go to the right,

Â and if its equal we and we're not at the end of the key, we go down to the middle

Â and we move to the end of the key. If its equal and we are at the end of the

Â key, we return the node. And even that node, is going to contain a

Â value or it's not, but that's what we'e return.

Â so, extremely straightforward implementation of both get and put for

Â Ternary search drives. so this, adds this line to our table,

Â where, the amount of space taken for Ternary search trie's is just three

Â lengths you know, plus the value character in each node.

Â so similar to what red-black BST and hashes would use, so no space

Â disadvantage. the determining the cost of search hits

Â and search miss, these values are given for random keys, and they're the subject

Â of deep analysis. but it's not too many.

Â Le, Let's, let's put it that way. And our experimental results bear that

Â out. So, for, for a search miss in a Ternary

Â search trie, you don't have to look at the whole key.

Â Usually, you look at maybe fewer characters than than the whole key, and

Â actually these results are, are for typical, or random.

Â but actually TSD's really shine, in the case where data is not random.

Â they very well adapt to the data. like text written in English language, is

Â not really language really random. And you can see that Ternary search trees

Â actually out perform both hashing and red black BSTs for our benchmark, which is

Â deduping these big text files. [COUGH] now you, you can go ahead and try

Â to get worst case guarantees. By using rotations to balance the the the

Â internal binary search tree's corresponding to each character in, in

Â TSTs. although that's probably normally not

Â worth the trouble. but the bottom line is that TST is at

Â least it's as fast as hashing for string keys and it's it's space efficient for

Â sure. And it's pretty easy to speed it up.

Â the idea is to speed it up, that works extremely well in practice is to, use our

Â way of branching, at the be, at the top of the TST.

Â so rather than fool around with the first, couple of characters, simply do a

Â big branching at the root. You can do r way branching at the root of

Â 26, but in practice, it's pr, proven it works fine to do, let's say, r squared

Â way branching at the root. and immediately take care of the first

Â two characters, which is the common case, and then get down to small TSTs.

Â and there's variations of this, that you might imagine, but this is a really

Â simple one that almost always works. you have to deal especially with one and

Â two letter words. and we'll leave that for exercise.

Â but with that change now we get a, an algorithm that's is definitely faster

Â than hashing in red-black BSTs for our benchmarks.

Â and that'll turn out to be the case, for many applications.

Â So, just to summarize what are the trade offs between TSTs and hashing?

Â for hashing you have to examine the entire key, in order to compute the hash

Â function. we talked about examples, where people

Â try to skip doing that, and wind up with performance problems.

Â Not much difference between a hit and a miss in a hashing, and there's a lot of,

Â of reliance and maybe difficulty of implementation ah,depending on the hash

Â function. and also, hashing is really only good for

Â search and insert. the other kinds of operations that we'd

Â like to perform in symbol table when there's order in the keys is not

Â supported by hashing. with TSTs, it only works for strings or

Â if you have keys that are numbers, then you can rework them into strings, or

Â string like things. And that's that's a restriction but does

Â cover a huge number of applications. it, it only examines the key characters

Â that it needs to, and in particular, a search miss might involve only a few

Â characters. So, in typical applications you'll get

Â sub-linear performance in TSTs. You can get searches done without looking

Â at the whole key. the other thing is there's other, the

Â ordered symbol table operations you can, you know easily support with TSTs.

Â just by extending, what we did for binary search tree.

Â and even more important there's other interesting operations that we can add

Â for string keys. that are very useful, and we'll see

Â applications, and implementations of these, in just a minute.

Â so the bottom line is that this data, is a relatively new data structure.

Â but, its better than the classic ones for important applications.

Â Its faster than hashing particular for search misses.

Â and its got more flexibility even than red black BSTs, and that's what we will

Â talk about soon, but that's an introduction to TSTs.

Â