Welcome to part one of our probability review. The first time that we need these concepts in the course, is for those of you who want to understand the analysis of Quicksort. Why it runs in big O of n log n time on average. And these topics will also come up a couple of other times in the course. For example when we study a randomized algorithm for the minimum cut program in graphs and also when we try to understand the performance of hashing. Here are the topics we're going to cover. We'll start at the beginning with sample spaces and then we'll discuss events and their probabilities. We'll talk about random variables, which are real valued functions on a sample space. We'll talk about expectation, which is basically the average value of a random variable. We'll identify and prove a very important property, called the linearity of expectation, which will come up over and over again. In our analyses of randomized processes. So that's going to be the topics for part one. Then we'll conclude the video with one example tying these concepts together in load balancing. And this video is by no means the only source you can turn to to learn about these concepts. A couple of other sources I recommend are the online lecture notes by Eric Lehman and Tom Leighton. Also, there's a Wikibook on discrete probability, which you could check out. And I want to be clear this is really not meant to be a course or a tutorial on probability concepts, it's really only meant to be a refresher. So I'm going to go at a reasonably fast pace and it's going to be a pretty cursory presentation. And if you want a more thorough review, I would check out one of these other sources. Or your favorite book on Discrete Probability. And along those same lines, I'm thinking that many of you have seen some of this material before. Don't feel compelled to watch this video straight from the beginning to the end. Feel free to just sort of dip in and review the concepts that you need a refresher on. So, let's start at the beginning with sample spaces. So what is a sample space? Well, we're analyzing random processes so any number of things could happen. And in the sample space is just the collection of all of the things that could happen. So this is basically the universe in which we're going to discuss probabilities and average values. So I'll often use the notation big omega to describe the sample space. So one thing we've got going for us in the design of algorithms is typically we can take omega to be a finite set. So that's why we're dealing only with discrete probability which is a very happy thing. because that's much more elementary than more general probability. In addition to defining the outcomes, everything that could possibly happen, we need to define what is the probability of each individual outcome. So of course the probability of each outcome should be at least zero, should be non-negative. And there's also the obvious constraint that the sum of the probabilities should be one. So exactly one thing is going to happen. Now I realize this is a super abstract concept and the next few definitions are also a little abstract. So throughout them I'm going to use two really simple, really concrete examples to illustrate what these concepts mean. So the first example is just going to be you take two six sided dice and you roll them. And of course, the same space is just the 36 different outcomes you could have of these two dice. And assuming that each of these two dice is well crafted, then we expect each of these 36 outcomes to be equally likely, to occur with a probability of one over 36. The second running example I'm going to use is more directly related to algorithms, and it's motivated by the quick sort algorithm. Recall that we're studying the implementation of Quicksort that chooses a pivot, uniformly a random in every recursive call. So, let's just focus on the very first outer most call of Quicksort and think about the random choice of the pivot just in that call. So, then in the sample space all of the different things that could happen is just all of the end different choices for a pivot assuming the array has length n. So we can represent the sample space just as the integer is one two all the way up to N corresponding to the array index of the randomly chosen pivot. And again by definition by the def construction of our code each of these things is equal to likely probability of one over N. Now let's talk about events. An events is nothing more than a subsets of all of the things that could happen, that is a subset of the sample space omega. The probability of an event isn't exactly what you think it would be, it's just the sum of the probabilities of all of the outcomes contained in that event. Right, so an event is just a bunch of stuff that might happen. We know the probability of each individual thing that can happen, we add them up to get the probability of an event. So the next two quizzes are meant to give you some practice with these concepts. And in particular, they'll ask you to compute the probability of events in our two running examples. So on the first quiz, this is our first running example where we think about two dice and we have our 36 possible outcomes. Consider the subset of outcomes in which the sum of the two dice equals 7. What is the probability of that event? Right so the correct answer is the third one. The probability is 1/ 6. Why is that? Well first let's be more precise about what this event is. What are the outcomes in which the sum of the dice is equal to 7? Well there's exactly six such outcomes. 1,6 2,5 3,4 4,3 5,2 and 6,1. Each of the 36 outcomes is equally likely, has the probability of one over 36. So we have six members of the set. Each has probability of one over 36. So the probability is 1/6. Let's move onto the next quiz, which considers our second running example, namely, the randomly chosen pivot. And the outermost call to QuickSort on an input array of length n. So recall that in quick sort, when you choose a pivot, you then partition the array around the pivot. And this splits the input array into two sub-arrays. A left one. Elements less than the pivot. And a right one, those bigger than the pivot. And the more balanced the split into theses two sub problems the better. So ideally we'd like a 50 50 split. So what this quiz asked you is what fraction of pivots, that is what's the probability that a randomly chosen pivot will give you a reasonably good split? Meaning both of the sub problems have size at least 25%. That is you get a split 25, 75 or better. That's what this quiz asks about. What's the probability that your randomly chosen pivot satisfies that property? So the correct answer to this quiz is again the third option. It's a 50% probability you get a 25-75 split or better. So to see why let's again be precise about what is the event that we're talking about. Then we'll compute its probability. So when does a pivot give you a 25-75 split or better? Well for concreteness, suppose the array contained just the integers between one and 100. Now, what's the property we want? We want that both of the two subarrays have at least 25% of the elements, neither one has more than 75% of the elements. Well, if we choose an element that's 26 or bigger in value. Then the left sub-problem will have at least 25 elements, the numbers 1 through 25. And if we choose an element that's at most 75, then the right subarray is going to have at least 25 elements, namely the numbers 76 to 100. So anything between 26 and 75, inclusive, is going to give us a 25-75 split. More generally, any pivot from the middle 50% of the quantiles, is going to give us the desired split. So we do badly if we get something within the first quarter, we do badly if we get something within the last quarter. Anything in the middle works. So more formally, we can say that the event s that we're analyzing is among the possible pivot choices. We're interested in the ones that is not in the first quarter and not in the last quarter. Now the cardinality of this the number of pivots in this set is essentially half of the overall number of pivot choices. I'm ignoring fractions here for simplicity. The probability of this event is the cardinality of this times the probability of each of the individual outcomes. And since we choose the pivot uniformly at random, each one has a probability of one over n. So you get n/2 / n, or 1/2. Now that we've explored the concept of events in our one or two examples. We see that the probability that the sum of two dice is equal to 1/6. A useful fact to know if you're ever playing craps. We know that a pivot gives us a 25-75 split or better in a randomized quick sort with 50% probability. A useful fact if you want to develop intuition for why quick sort is, in fact, quick. That's events. Let's move on to random variables. Random variables are basically some statistic measuring what happens in the random outcome. Formally, if we want to define it. It's a real-valued function defined on the sample space omega. Given an outcome, given a realization of the randomness. This gives you back a number. The random variable that we most often care about in algorithm design is the running time of a randomized algorithm. That's the case, for example, with the quick sort algorithm. Notice, that is, in fact, a random variable. If we know the state of the world. If we know the outcome of all the coin flips that our code's going to make. Then there's just some running time of our algorithm. So, in that sense, it's a random variable. Given the outcomes of the coin flips, out pops a number. The running time, say, in milliseconds, of the algorithm. Here, I'm going to give you a couple more modest examples of random variables in our two running examples. If we're rolling two dice. One very simple random variable takes as input the outcome, the result of the two dice. And spits out the sum. That's certainly a random variable. On any given outcome, it's going to take on some some integer value between 2, at the minimum, and 12, at the maximum. Our second running example is the randomly chosen pivot made by the outermost call to quick sort. Let's think about the random variable, which is the size. Meaning the subarray length, passed to the first recursive call. Equivalently, this random variable is the number of elements of the input array smaller than the randomly chosen pivot. This is a random variable that takes on some interval value between zero, at the smallest. That's if we happen to pick the pivot equal to the minimum of the array. And n-1 at the largest. That's if we happen to pick the maximum element as the pivot element. Next, let's talk about the expectation of a random variable. This is really nothing more than the average. Of course, when you take the average of some statistic. You want to do it weighted by the probability of its various values. Let's just make that precise real quick. Consider some random variable, X. The expectation, this is also called the expected value. And the notation is capital E, square bracket, then of the random variable. Again, in English, the expectation is just the average value. Naturally weighted by the probability of the various possible outcomes. Or more mathematically, we sum over everything that could happen. So let i denote one possible outcome. We look at the value of this random variable when that outcome occurs. And then we weight up times the probability of that outcome occurring. The next two quizzes ask you to compute the expectation of the two random variables that we identified on the previous slide. The first quiz is about two dice. And the random variable, which is the sum of the values of those two dice. What is the average of that random variable? What is its expectation? The answer to this question is the second option. The average value is 7. There's a bunch of different ways to see that. In my opinion, the best way to compute this is using linearity of expectation. Which is the next concept we're going to cover. If you wanted to, you could just compute this by brute force. By which I mean, you could iterate over all 36 possible outcomes. Look at the value of the two dice in each. And just evaluate that sum we had in the definition on the last slide. A slightly sneakier way to do it, if you don't know linearity of expectation. Would be to pair up the various outcomes. So it's equally likely that the sum of the two dice is 2 or 12. It's equally likely to be 3 or 11, 4 and 10, and so on. Each way of pairing up these values of the two dice results in 14. When you average, you get 7. But, again, the right way to do this is linearity of expectation. Which we'll cover next. The second quiz covers the second random variable we identified. Now we're back to QuickSort. And the random pivot chosen in the outermost call. The question is, how big, on average, an expectation is the subarray in the first recursive call? Equivalently, on average, how many elements are going to be less than the randomly chosen pivot? The correct answer to this quiz is the third option. In fact, it's actually quantity n-1 / 2, not n/2. But, basically, half the elements. Again, this sort of a sneaky way to see this if you want. Which is that, clearly, the two recursive calls are symmetric. The expected value of the left recursive call is going to be the same as the expected size of the right recursive call. The two recursive calls always comprise n-1 of the elements. Because they're symmetric, you expect half in each. So n-1 / 2 in each. Though for this problem, I think it's perfectly fine just to compute this using the definition of expectation. If we let X denote the random variable that we care about, the subarray size. Then we can just compute directly by summing over all of the possible outcomes. All of the possible choices of the pivot. With probability 1/n, we choose the minimum of the pivot. Resulting in 0 elements being passed to the first recursive call. With probability 1/n, we pick the second smallest element. Resulting in 1 element being passed to the first recursive call. With probability 1/n, we pick the third smallest. Giving us a subarray size of 2. And so on. With probability 1/n, we pick the maximum element. Giving us a subarray size of n-1. If you just compute this sum out, you will get, as expected, n-1 / 2. Expectation is the last definition that I'm going to give you in this part one of the probability review. Next, is our fifth and final concept for this video. Which is Linearity of Expectation. That's not a definition. That's more of a theorem. What is linearity of expectation? This is a very simple property of random variables that's super-super-important. This comes up all the time when we analyze randomized algorithms. And random processes, more generally. What is linearity of expectation? It's the following, very simple claim. Which I'll sometimes denote just by LIN EXP, for short. Suppose you got a bunch of random variables defined on the same sample space. Then, if you want to think of the expected value of the sum of these random variables. It doesn't matter if you take the sum first and then take the expectation. Or if you take expectations first and then sum. That is, the expected value of a sum of random variables is equal to the sum of the expectations of the individual random variables. One of the reasons linearity of expectations is so ubiquitously useful is because it always works. No matter what these random variables are. In particular, even when the random variables are not independent. Now, I haven't defined independent random variables, yet. That will come in part two, the probability review. But hopefully, you have an intuitive sense of what independence means. Things are independent if knowing something about one of the random variables Doesn't influence what you expect from the other random variable. Now I realize the first time you see linearity of expectation it's a little hard to appreciate. So first of all as far as the applications we'll see plenty throughout this course, pretty much every single application of probability that we'll see the analysis will involve linearity of expectation. But it may be hard to appreciate why this is not a tautology. Just symbolically, it may look like it has to be true. But to point out that there is content here, if I replace the sums by products, then this equation would in general be false, if the random variables are not independent. So the same thing is not true about products, it's really about sums. So let me just give you a trivial illustration of linearity of expectation, point out how it really easily allows us to evaluate the sum of two dice. So in our first running example let's introduce the random variables x1 and x2 for the results of the first and second die respectively. Now computing the expected value of a single die is easy. There's only six outcomes to a enumerate over contrast that with the 36 outcomes to enumerate over when we evaluated the sum of the two dies. So the average value of a single die you won't be surprised to hear is 3.5 right? So it ranges integers between 1 and 6 uniformly so 3.5 on average. And now using linearity of expectation, the sum of two dice is simply double the average value of a single one. So in the next slide I'm going to prove this property, prove linearity of expectation, but frankly the proof is pretty trivial, so if you don't care about the proof that's fine you can skip it without loss I'm inclusing just for completeness. And I got to say I don't know of another mathematical statement which is simultaneously so trivial to prove and so unbelievably useful. It's really something remarkable linearity of expectations. So how's the proof go, well honestly we just write out the sum, the definition of an expectation, then we reverse the sums, and we're done. So let me start with the right hand side of the equation. So that was the sum of expectations of the random variables. So now let's just apply the definition of expectation. So it's just a weighted average over the possible outcomes. In that one, instead of summing first over the random variable j, and then over realized outcome i, I'm going to do it in reverse order. I'm going to sum first over the outcome i and then over the random variable j. Now the probability of outcome i is independent of j so we can yank the p(i) outside of that inner sum. But now what have we got? So inside the parentheses we simply have the value of the sum of the xji's, xj's on the outcome i. And then over here, we're just averaging the sum of the xj's with respect to the probabilities, the pi's. So this is just the definition of the expectation of the sum of the random variables. So that's it. So linearity of expectation is really just a reversal of the double sums. Now for those of you that are rusty on these kinds of manipulations I just want to point out, this reversal of the double sum itself is there's nothing complicated at all about what's going on. So if you want a really pedestrian way to think about what's happening, just imagine that we take these sum ends, these xji, pi's. And we just write them out in a grid, where one, or let's just say, the columns are indexed by the random variable j, and the rows are indexed by the outcome i. And in a given cell of this grid, we just write the, sum end, xji times pi. So if you get lost in the notation with these double sums, the point is you can just interpret each of them in terms of this grid. Both of these double sums are nothing more than the sum of the values in all of the cells of this grid. One order of summation just says you group first according to row sums and then sum those up. That's the first summation. The second summation, you first take column sums and then sum those up. But of course it doesn't matter, you just get the result of everything in the grid. Okay, so there's no tricks up my sleeve when I reverse these sums, it's a totally elementary, trivial thing. Okay, so again linearity of expectation, trivial to prove, incredibly useful. Don't forget it. So I want to conclude this video with one final example in order to tie together all of the concepts that we've just learned, or just reviewed. And that's going to be an example about load balancing, assigning processes to servers. But this in fact is quite important for the analysis of hashing that we're going to see toward the end of the course as well. But for now lets just think about the following simple problem. For some integer n, you have n computer processes that have to be assigned to n servers in some way. Now, you're feeling very lazy, okay, so you're just going to take each of these processes and you're just going to assign it to a totally random server, okay with each server equally likely to get a given process. And the question I want to study is does this laziness cost you, at least on average? So if you look at the server, what's the expected load? So let's proceed to the solution, the answer of this question. So before you start talking about expectations one has to be clear about the sample space and what are the probabilities of the various outcomes. So remember the sample space omega just denotes every possible that could happened. So what are we doing for each process we're assigning to a random server, so all of the things that can happen are all of the different assignments of these n processes to these n servers. And if you think about is there are n raised to the n possible outcomes cause you have n choices for each of the n processes. Moreover, because each process is assigned to one of the servers uniformly at random, each of these n to the n assignments is equally likely, probability 1 over n to the n. Now that we have a sample space, we're in a position to define a random variable. And we already know what random variable we care about, we care about the average load of the server. Now all of the servers are exactly the same, so we just have to focus on one server, let's say the first server, and look at the number of processes assigned to it. And if you go back to the problem statement, what we're asked, is to compute the expected value of Y, the expected number of processes assigned to a server. Now of course, in principle, we could go to the definition of expectation and just compute by brute force the sum over all possible outcomes of the value of y and take the average. Unfortunately, there are n to the n different outcomes, and that's a lot. So what could we do other than this brute force computation? Well recall our example of linearity of expectation in the sum of two dice. We observe that instead of computing the sum by enumerating over all 36 outcomes, it was much better to just focus on a single die, compute its expectation and then conclude with linearity of expectation. So we'll do the same thing here. Instead of focusing on the sum y, we'll focus on constituent parts of y. So whether or not a single process gets assigned to the first server. And then we'll get away with that by linearity of expectation. So more precisely, for a given process j let's define xj to be, whether to be 1, if and only if the jth process gets assigned to the first server 0 otherwise. Zero, one random variables like xj are often called indicator random variables. That's because they, in effect, indicate whether or not a certain event occurs. In this case, whether or not the jth process gets assigned to the first server. Why did I make this definition? Well, observe that the total number of processes that gets assigned to the first server is simply the sum, j equal 1 to n of xj, xj says whether or not a given process, the jth process, is on the first server. The total number is just the sum of these over all j. Now, the benefit from this maneuver is we only have to compute the expectation of a extremely simple indicator random variable xj. This is like the win that we got when we were summing up two dice, by instead of having to compute the sum, the expected value of the sum, we just had to focus on the expectation of a single die, that was really easy. Similarly here, the expectation of a single xj is really easy. Specifically, let's write it out just using the definition of the expectation. So the expected value of an xjis well let's group together all the outcomes in which it takes on the value zero. So the contribution of the expectation is zero for all of those outcomes, and then there's the rest of the outcomes where xj takes on the value one and in those cases it contributes one to the expectation. Now obviously we get some happy cancellation happening here with the zero part. And all we have to worry about is the probability that xj takes on the value one. Okay what was xj again, how did we define it? Remember it's the events that, it's 1 exactly when the jth process gets assigned to the first server. How are processes assigned? Well remember the proposed solution assigned to each process to each of the n servers, equally likely with uniform probability. So the probability of the jth process is assigned to the first service is 1 over n. So this leaves us with just the sum from j equal 1 to n of 1 over n. That is we just sum up 1 over n with itself n times, this of course is equal to 1. So at the end of the day what we find is that the expected number of processes assigned to a given server say the first server is just 1. So at least if we only care about averages, we lose very little from this trivial process of randomly spraying the process to the server. On average, any given server has just one process on it. This is characteristic of the role that randomization plays in algorithm design in computer science more generally. Often we can get away with really simple heuristics just by making random choices. Of course, quicksort is one example of that where we get an extremely, prevalently used practically sorting algorithm just by making it randomly chosen pivets in every recursive call.