Next we're going to talk about binary search trees. And we saw that binary trees are a classic structure in combinatorics, and they're enumerated by the Catalan numbers. But they're also a classic structure in computer science. It's a fundamental data structure that is used for all sorts of things, but one of the most important is for implementing symbol tables, where we have lots of information, each piece of information is associated with a key, and we want to insert things into the symbols table, and we want to retrieve them by key. You can find much more information about symbol tables in our Algorithms book or in the books that are associated with the book. But I'll describe what binary search trees are because that's one of the most important algorithms in computer science and it uses a classical data structure, the binary tree. So we work with nodes, and those are things that we can write programs to manipulate. Each node has a key, and the keys have values that are so called comparable values, that means we can take two of them and know whether one's less, equal, or greater than the other, that's all. Now usually in the implementations we do nowadays, we assume the keys to be all distinct. We can do things with equal keys, but both for the analysis and for the implementation, it's better to use distinct keys. Now, in the binary tree representation, nodes have two subtrees, left subtrees and right subtrees, which are empty or might be whole trees. And if there's keys in the left subtree, all those keys are smaller than v. And if there's keys in the right subtree, they're all larger than v. That's what a binary search tree is. Now [COUGH] in Java, again, without going into detail of Java code, for those of you who don't know it, you might have to read the Algorithms book a bit to really follow all of this. And those of you who do know it, but either way, it's very elegant and simple, small amount of code to implement nodes and binary trees in Java. So we build a data type that holds this associated information, so that we can manipulate it and keep it together and write programs that manipulate, you'll see what those look like in just a minute. And the only operation that we need to perform is to be able to create a new node, with a given key and value. And that's what this code does, is defines that data type, a node's got four fields, it's got a key and a value, and it's got a reference to the two subtrees. The left subtree has smaller keys, and the right subtree has larger keys. And that's how we represent it. And for programming language aficionados, we usually use generic types for keys, which means that we can hold any type of data with that. And again, you can read about these details in the Algorithms book. So that's what the computer representation of a binary search tree looks like. We have nodes that have keys and values. And then each node points to another node, which is another binary search tree, the one on the left with smaller keys and the one on the right with larger keys. And it's a recursive structure. And eventually, we get to places where the left subtree is empty, or the right subtree is empty, in which case, those links would be null. So that's the structure. And now, what I'm going to describe is programs for manipulating that structure for inserting new keys and for searching for keys. The easiest is search, and let's look at the search implementation. This is a recursive program that operates on that recursive structure. There's some programming language jargon at the top that has to do with the generics, and otherwise, let's look at what it does. So this is the data type for manipulating binary trees, what is the binary tree? It's a root node, so that's what the first line of code says. Then there's the definition of nodes, and then this is the method for retrieving the value associated with a given key. I'll do it line by line. We have a variable x that we set to be the root of the tree. As long as that node, x, is not null, then we're going to compare the key at x, that's x.key, with the key that we were given to search for. That comparison can have three outcomes, either less than 0 or greater than 0 or equal 0. If it's less, the key we're looking for is less than x's key, then we'll go to the left. Go to the left is x = x.left. That's recursively moved down the tree to the left subtree that contains smaller nodes than x's, and we have a smaller key, so that's where we want to go. If it's greater, we go to the right. And if it's equal, we're done, we return x's value. That's the implementation of search for binary search trees. So in this example, you can see to search, say, for M, we start at S. M is less than S, so we go left. It's greater than E, so we go right. And then we're successful to see M. If we're searching for something that's not in the tree, say, we're searching for Q, we go left. Again, Q's less than S. It's bigger than E, we go right. It's bigger than M, we go right, but now we're on a null link. And essentially, that says that if Q were in the tree, it would have to be here, but there's nothing in the tree here, so it's not here. And in that case, down to the last statement in the get implementation, we return null. So, it's a very simply expressed algorithm. This is in a modern programming language. It's also easy to program these and even assemble a machine line, which people my age have done that even. In any programming environment now, you can implement binary trees pretty easily with a small amount of code. So that's unsuccessful. Now what about insert? So this is the code for insert out of the algorithm floor, and essentially, what it does is, when we do an unsuccessful search, and we come to a null link, so, say, you insert Q, we get to the same place. But then what it does is create a new node for Q. That's first line, x equals null, return new node. And then attaches it there, and so that's in the recursive calls, rather than just x = x.left, we say x.left = recursive call, and that's what does the attachment of the new node. This is a little bit tricky, but it's very concise recursive code. And then now we've implemented both insert and search. And so that's an important algorithm that we want to analyze the performance of this algorithm. It's actually got good performance, and it's widely used. Now the key fact that we have to worry about in doing this analysis is that the shape of a binary search tree depends on the order of insertion of the keys. So, the best thing that could happen would be the keys are inserted in such a way that the tree is perfectly balanced. That's like the tree that's ascribed Merge Sort that we looked at in the first lecture. The height of that tree is log N, and in that case, all searches are going to be guaranteed to take less than log N, key comparisons say. A typical case is more like the one that I showed for the example, where there's some nodes on either side for most subtrees, but it's not perfectly balanced. That's the one that we want to analyze, and we'll get to that in a minute. And there's also the worst case, say, the keys come in in order. If your keys come in in order, then it's not very good performance because the average cost of searching for a key in the tree is going to be about N/2, and that's going to be a performance problem. If you think of a huge symbol table, and nowadays most of them are huge, let's say a million or a billion keys, this one over here is going to get to any key with 20 or 30 searches. This one could require half a million or half billion. It's a very important performance difference that's going to make the difference between being able to address a problem and not address it at all. And where does the average fall in between? That's where the analysis of algorithms comes in. So it's reasonable to analyze binary search tree structures under the assumption that the keys are inserted in random order. Now that's a starting point, certainly, you've got situations where clients do not insert them in random order. And it's not like Quick Sort, where we can randomize because the insertions might come from some completely external force, and we don't have any of those source, and we don't have any way to rearrange their orders. So it's not a randomized algorithm that we can guarantee performance probabilistically, the way we can with Quick Sort. But still, it's a reasonable model that actually can be validated in lots of practical situations. So it's a good starting point for analysis of a tree algorithm. And when you do that, if you have, this is BST built from 80 keys inserted in random order. You can see that the trees you get, they're different, but unlike what we saw for random binary trees with each tree taken equally likely, these are kind of balanced. Well sometimes, you get a little imbalance. So that's typical random binary search trees, and again, our goal is to characterize this analytically and explain why this is different from what we get with random binary trees. So we're going to analyze them both. Now, the key thing is that when you're analyzing binary search trees, the shape of the tree is a property of permutations, not trees. Our input model is a random permutation. That's N things in arbitrary order, N factorial possibilities. Binary search trees, where we're talking about a random tree, we're talking about one out of a cataly number of possibilities, which is way less. So that's really important to remember that we're talking about property of permutations, not tree. So you can have, and this gives all possibilities for 1, 2, 3, and 4 nodes. And the one that's blown up is just one illustration that shows that two different permutations can lead to the same tree shape. So this case, 1 comes in, and 3 goes to the right, and then 2 goes to the left of 3, and then 4 goes to the right of 3. But 4 and 2 could come in the other order, and you'd still get the same tree. And again, on the other hand, the tree where the keys come in all in order, 1, 2, 3, 4, that's the only permutation that gives that shape. The ones that are more balanced is more permutations that I give rise to them. Now that's an important observation to try to understand why we get more balance with binary search trees than with binary trees. So that's an observation, and we're going to validate that. Now in order to get it done, we really want to think about the question, if you have a tree shape, how many permutations are going to lead to that tree shape? So that means if you would take a permutation and insert it into initially empty binary search tree, how many of them are going to give that shape? Well, we sort of just did that. The answer to that is there's two different ones. The 2 has to be first, that's the one at the root. And then either you can have the 1 go to the left and the 3 go to the right, or the 3 go to the left and the 1 go to the right, and those are the only ones. Let's look at a bigger tree. How many permutations map to that one? Well, that's a more complicated problem. So well, one thing you know is that the root has to be 4. So you've got at least the first element of permutation has to be 4. And then you have to have 1, 2, and 3 on the left and 5 and 6 on the right. Well 5 and 6 on the right, there's only one order that gives that shape, 5 has to go first and then 6. And on the left, you have the two possibilities. Either the 2, 1, 3, but the other thing is that those two possibilities with the 1, 2, 3 and the 5 and 6, they can be intermixed any number of ways, so the answer is 20. You've got 5 keys, that's the stuff that goes on the left and the right, and you can intermix them 5 choose 2 ways. And then you've got the two possibilities for the left, and in this case, the one possibility for the right. And those are the 20 permutations that lead to that tree shape. And you can check that, validate that if you don't quite believe or understand that map. So that's a bigger example. And then, from that, it's not hard to come to the general case, how many permeations map to a general binary tree like that? Well, there's some number of nodes on the left subtree, say the left subtree's t sub l, then size of t sub l. Then the root has to be that number + 1, and then the right subtree has its number of nodes. And then by the same argument that we just gave, if you define P sub t to be the number of permutations that map to a tree t, then you have a recursive formula. P sub t = t sub l + t sub R choose t sub l, so number of different ways to intermix them, the smaller elements in the larger elements. And then for all the ways of intermixing them, you can still put them in any one of the P sub t sub l orders or any one of P sub t sub R orders. You have to multiply all those together to get the number of permutations that map to a general binary tree. So that's an interesting observation, and actually, you can apply that root formula recursively and just read off from the tree the number of shapes that mapped to it. But we're going to use that exact formula a little bit later on. So and the other thing to observe from this formula is that that binomial coefficient is going to be much, much larger When the two subtrees are nearly balanced, that's the Gaussian distribution. So it's much larger at the middle, and we verified that analytically. If it's within square root of that, it's going to be much, much larger. And if t sub l is small, or t sub r is small, it's going to be exponentially smaller. So the number of permutations that map to an unbalanced tree is way, way smaller than the number that map to a balanced tree, and that's why, that's what we observe in practice. So in summary, so far, I've described two different binary tree models that they're both fundamental. Binary, random binary trees from the catalan is the classic combinatorial structure. Binary search trees is a classic Computer Science structure. In the one model, in the BST model that we use in simple table implementations, balance shapes are much more likely and actually one way to think of it is, what's the probability that the root is of rank K, for given K, 1 to N? Well, in the BST model, it's built from a random permutation. So, it's the first element in the permutation that goes at the root, so the probability than any particular value is at the root is just 1/N. It's the same for all. That doesn't seem like it balances, but if you compare it. And so, anyway, that sort of, balances. And second of all, compare that probability with what you get for the other case. So the Catalan model, on the other hand, each tree shape is equally likely. And this is the probability that the root is of rank k in a random binary tree. You can't have any tree on the left and any tree on the right out of all possible trees. That's called a Catalan distribution. And that gives rise to these much more unbalanced trees to fundamentally different models. I want to emphasize that, because we go into the analysis, and we get into the math without the picture, it's easy to lose track of the fact that we're talking about something that's totally totally different. In fact, here's a plot of that cattle land distribution. That the root is rank k in a relatively it was binary tree with N nodes. And again, this is normalized the way that we usually do this distributional parts, so that we can see how it converges to a curve and look at the curve. Most of the weight is often at the edges, extremely close to the edges, in fact. So as N gets larger, most of the probability that the root is a rank and number two is because exponentially small. And the probability that the root is the smallest or the largest actually is a quarter. So it's actually a pretty good chance that it's going to be unbalanced. And just, by the way, there's no magic in plotting a distribution like this. And people who are comfortable with programming, I encourage you to go ahead and try to develop plots like that. It's easy to write inefficient programs. But if you think about it, and this one isn't even all that efficient, but anyway, this is the problem that I used to produce that plot. So as I mentioned, the probability that at least one of the two subtrees is empty is about a half, so that's why it's going to be unbalanced. It's like flipping a coin and put in an empty or putting in a random one, it's going to lead to much, much different shape than we get for the other case. Now, again, just to full disclosure of the code that I'm using to make these pictures for people that are comfortable with programming, the code that I used to generate random binary trees. And it's all very similar to the code that I talked about for binary search trees. The only difference is that I ask, add coordinates to get the height and the width, and there's little calculation in terms of where to place nodes. And I used our programming model from the algorithms book site to do the drawings. But what I want to talk about now is the generate method that is supposed to generate a new tree at a certain depth. And what that node does is, basically it's a recursive program that computes the internal rank of the root according the appropriate probability distribution, and then recursively generates trees on the left and on the right. And in this case, to make the trees look good, I actually included the invisible external nodes in the rank field. And the point of this is it show there are a lot of similarities, and I use this same code to generate both kinds of trees. The other difference is for random BST, I uniformly choose the value of that rank. And for a random binary tree, I went to that Catalan distribution to use it. So with a relatively small amount of code, I've been generating these tree diagrams. And then, again, full disclosure, and I'm not going to go through that. This is the code using our standard drawing model from the algorithm's book site that scales and draws the binary tree, as I talked about, a few minutes ago. And, for those of you who enjoy programming, it's an interesting exercise, to implement the centered bi level method that I did for other random Catalan trees, and to draw general trees and forests and other things. Each one of those tasks in intriguing programming exercise for people who enjoy programming. So that's an introduction to binary search trees, and next, we'll get into the analysis.