0:16

Now, it is important to define uncertainty, or entropy,

Â rigorously, because in information theory, the main thing we're concerned about

Â is quantifying changes in uncertainty after receiving a message.

Â So maybe in a card game, you're interested in how much your uncertainty

Â about what's in another player's hand changes after they play a certain card.

Â Or maybe you're interested in how much your uncertainty about

Â who your secret admirer is changes after you read one of their notes.

Â Or maybe, in a more neuroscience context,

Â you're interested in seeing how much your uncertainty about

Â who's speaking to you changes after you hear them say a few words.

Â Now, in order to discuss this with math,

Â we need to quantify this uncertainty.

Â You may have heard of other things, like standard deviation or

Â variants, referred to as some kind of uncertainty and

Â in certain contexts those are very good descriptions of uncertainty.

Â But today, we are going to think about a new kind of uncertainty and

Â that uncertainty is called entropy.

Â So, basically, the entropy of a probability

Â distribution tells us how uncertain

Â we are about that distribution.

Â So let's look at some examples.

Â To start, let's just think about discrete distributions.

Â Let's draw two different distributions and

Â ask, which one has a higher uncertainty?

Â Which one has a higher entropy?

Â 2:08

Remember, we can think of distributions, probability distributions, as these random

Â object generators that when you push a button, spit out a random object.

Â So, let's say this is distribution machine one and

Â this distribution can spit out four different objects,

Â (0,0), (0,1), (1,0) and (1,1).

Â And let's say that the chance that this distribution spits out

Â (0,0) is the same as the chance that it spits out (0,1),

Â or (1,0), or (1,1).

Â So, in other words, the probability of any one of these is 0.25.

Â So, let's draw that out.

Â There's our probability and this is our, we'll call it the state.

Â We'll say that the object the distribution spits out currently is its state.

Â So we have four different states (0,0), (0,1), (1,0,) and (1,1).

Â And the probability for each state is .25,

Â so that's our first distribution.

Â Our second distribution is a little bit simpler.

Â Our second distribution spits out either (0,0) or (1,1).

Â So it only has two different possible states.

Â So if we plot those, this looks like this.

Â And, again, let's say that the states are equally probable.

Â If they're equally probable, the probability of each one is .5.

Â So, the question I ask you is if you had to play a game

Â where you bet on the object that comes out, and

Â you get a dollar if you get it right and you get nothing if you don't,

Â 4:07

which distribution would you choose to play the game with?

Â Hopefully you picked distribution number two.

Â With distribution number two,

Â if you choose randomly there is a 50% chance you'll get it right.

Â So on average you'll make 50 cents every time you play the game.

Â If you had chosen distribution number one,

Â there are simply more choices to choose from.

Â So the chance that you get it right is only 0.25.

Â So on average you'll make a quarter every time you play the game.

Â 4:40

Since, on average, if you are to play the game with distribution

Â one you would be less certain about what number comes out,

Â we say that the distribution one has a higher entropy.

Â Since if you played the game with distribution two,

Â you would be more certain about what state was going to come out,

Â we say that distribution two had a lower entropy.

Â 5:07

So, what exactly was it about distribution one that gave it a higher entropy?

Â Well, distribution one had more options,

Â it had more possible states than distribution two.

Â So if we write out the number of states, how many states did distribution one have?

Â It had four.

Â And distribution two had two states.

Â 5:32

So, in a way,

Â the uncertainty is just the number possible states the distribution has.

Â And, in fact, that's all entropy does.

Â All that entropy does is count the number of states, and

Â we write entropy as H, in information theory.

Â However, the entropy does not just count the number of states directly,

Â it actually takes the log of the number of states.

Â So let's write H = log(# of states).

Â And we usually use log base 2.

Â So what's the entropy of the first distribution?

Â Log 2 of 4 is just 2.

Â And what's the entropy of the second distribution?

Â Log 2 of 2 is just one.

Â 6:32

So the second distribution has a lower entropy,

Â it has a smaller number of available states.

Â So here is our definition, the entropy, H,

Â of a random variable, let's say X,

Â 7:19

So every time you push your random number generator button you get a color out.

Â And we'll call the color X.

Â So your options are red, green, blue, orange, yellow or violet.

Â And they're all equally probable with a probability of one-sixth.

Â How many states are there?

Â Well, there are only six states.

Â That's not too hard to fill out.

Â What is the entropy of this distribution?

Â 7:50

That's just log of the number of states.

Â Log six, that's about 2.6 What if we

Â have a distribution over a three dimensional vector?

Â So x1, x2, x3.

Â And let's put a constraint that says in this vector

Â 8:13

you are always going to have two zeros and one one.

Â So maybe x1 is one, and x2 and x3 are zero.

Â And let's say that there is an equal probability that the one is in any one

Â of those places.

Â So our three possible states are 1,00,

Â 0,1,0 or 0,0,1.

Â And the probability of each one of those is one-third.

Â How many states are there?

Â One, two, three, three states.

Â What's the entropy, log2(3),

Â that's about 1.6.

Â What if our distribution is over three dimensional vectors, where

Â each element has a 50% chance of being a one and a 50% chance of being a zero?

Â So the probability that xi, any arbitrary element, is 0 is 0.5.

Â And the probability that any arbitrary element is 1 is 0.5.

Â How many states do we have?

Â There are eight possible states.

Â You can go ahead and count those up.

Â The entropy of the distribution over eight equally

Â probable states is log2(8), which is 3.

Â What if we had the same deal, but with a ten dimensional vector.

Â 9:49

Now, the number of states is 2 to the 10.

Â So log2 2 to the 10, is just 10.

Â What if it's a 100 dimensional vector?

Â Now there are 2 to the 100 possible states,

Â so the entropy of that is 100.

Â And maybe now you can see why we use the logarithm for

Â entropy rather than just the number of states.

Â If you wanted to, you could have defined entropy as just the number of states but

Â that would have made the math a little harder because you would be dealing with

Â these really big numbers all the time.

Â Using the logarithm makes it so you don't have to deal with

Â numbers like 2 to the 100, you can deal with the smaller number just 100.

Â Okay, hopefully that is not too confusing.

Â 10:45

The more possible states there are, the larger your entropy,

Â the larger your uncertainty about what state your distribution will spit out.

Â But, what if the states are not all equal?

Â 11:21

And the probability of any of the other options was 0.01.

Â In that case, the game suddenly becomes a lot more certain, and it's very easy for

Â you to just bet on (0,0) every time, and make quite a bit of money.

Â Therefore when the states do not have equal probabilities the entropy should

Â decrease.

Â You should be less uncertain about the distribution.

Â We need a way to generalize

Â the definition of entropy to take

Â into account situations where not

Â all states are equally probable.

Â So let's say we have some distribution,

Â maybe they're three different states a, b, and c, and this is the probability.

Â Maybe a has a probability of 0.5,

Â b has a probability of 0.3, and

Â c has a probability of 0.2.

Â The entropy of this distribution should be less than the entropy

Â of the distribution where the probability of a, and

Â the probability of b and the probability of c, were all one-third.

Â So we need to generalize the formula

Â H(x) = log(# states).

Â 12:58

And here's the answer, the answer, this is the full definition of entropy,

Â the most general definition, is minus

Â the sum over the probability of each state, so

Â x equaling xi times the log of

Â that probability where the sum is over all of the states, i.

Â 13:25

This is the general definition of entropy.

Â Now, just to make sure we're all on the same page, let's just remind ourselves.

Â That when we calculate the entropy, so the entropy is a function of something right.

Â 13:43

When we calculate the entropy the input to the function

Â is some probability distribution.

Â So the input is not just a single number, a single state.

Â The input is an entire distribution over states.

Â The output, however, is just a single number

Â that characterizes how uncertain that probability distributions is.

Â 14:15

So remember it's important to think about what the inputs to functions are and

Â what the outputs to functions are.

Â However even though the input is the entire probability distribution,

Â we usually just write H(x), where that random variable x stands in for

Â the whole probability distribution over x.

Â 14:49

It might feel a little off.

Â If you're interested in seeing the full derivation,

Â you can look in the supplementary materials of the course web page.

Â But for now,

Â I just want to talk through a bit of the intuition about what's going on.

Â First of all, since this is a generalization it should work

Â in the specific case in which all of our probabilities were equal.

Â 15:15

So let's just check and make sure.

Â So this is the special case that we started with.

Â So if all probabilities are equal, so let's say we have

Â N different states, and all of the probabilities are equal.

Â 15:31

That means that P(x=xi) is just 1/N.

Â So we can plug in 1/N for P(x=xi) in our equation for entropy.

Â So we have entropy of x is equal to minus,

Â sum over i, 1/N log 1 over N.

Â We can bring the 1 over N outside, and the minus inside,

Â but -log of 1 over N is just equal to log of N.

Â So when we do that sum, we have 1 over N times

Â (N log N), which is just equal to log N.

Â 16:41

So let's just do this with an example.

Â Let's say we have 4 states.

Â If they're equally probable then we have the entropy

Â is equal to log2 (4), which is just 2.

Â What if the probability of the first the state is one-quarter,

Â the probability of the second state is one-quarter,

Â the probability of the third state is 0, and

Â the probability of the fourth state is one-half?

Â What does the entropy equal?

Â 17:25

Well, let's plug in our formula.

Â So we have minus, we'll make everything negative, and now we're doing the sum.

Â One-quarter, that's the first probability,

Â log of one quarter + one quarter

Â log of one quarter + 0 log of 0

Â + one half log of one half.

Â 17:57

So the 0 term goes away and

Â we're left with minus one

Â half log of one quarter plus

Â one half log of one-half which

Â is equal to one-half log of

Â 4 plus one-half log of 2.

Â Log of 4 is 2, so one-half of 2 is 1.

Â And log of 2 is 1.

Â So one-half of 1 is one-half.

Â So that equals three-halves, and

Â it is in fact less than the case where we had equally probable states.

Â So it's starting to feel like this definition might be okay after all.

Â 18:55

And to start this one, we are going to note that

Â if you have some random variable x, the expected value

Â of a function of x is equal to a weighted sum

Â of all of the possible values that function can take on for different x's.

Â Weighted by the probability that those access are sampled from our distribution.

Â So from that perspective, we can say that the entropy

Â 20:01

So we can take its expected value, just like we would any other function

Â such as xi squared, or xi itself,

Â which is what we wanted to do when were taking the mean of the distribution.

Â 21:08

What this means is that states with a very small probability, so

Â maybe over there, have a very high minus log P sub i.

Â And states with a very large probability have a very low minus log P sub i.

Â So one way to think of minus log P sub i is as the amount of information

Â you would get if you're random number generator spat out the state X sub i.

Â And if it spits out a very probable state,

Â well you figure that was pretty likely to happen anyway.

Â So it's nothing new.

Â You don't get much information out of it.

Â If it spits out something really improbably,

Â something over here, then you get a lot of information out of that.

Â If you get a really improbable text message such as, your house is on fire.

Â That gives you a lot of information.

Â 22:04

One other simple way to think about this,

Â is pretend we have a distribution over colors again.

Â So, the probability that the color is red is 0.6.

Â The probability that the color is blue is 0.3,

Â and the probability that the color is green is 0.1.

Â One way to think of this distribution is as a bag of

Â marbles where the marbles are label one through ten.

Â 22:41

And let's say that marbles one through six are red,

Â marbles seven through nine are blue, and marble ten is green.

Â You know the drill.

Â Put all the marbles in the bag and reach your hand in the bag and

Â pull a random marble out.

Â 23:05

then that doesn't give you very much information.

Â That only tells you that the marble is either marble one through marble six.

Â However, if the draw is green, which is an improbably state,

Â then you know exactly which marble you've pulled out.

Â So this isn't a really crazy way to think about things.

Â So minus log P sub i tells you the amount of information you

Â would get about your system having seen the state X sub i.

Â And because this is probabilistic,

Â we just take an average over this amount of information.

Â And that is the entropy of a probability distribution.

Â 23:47

So again, this is just a fancy yet

Â mathematical way to talk about uncertainty.

Â And it will come up very often in information theory.

Â One last note is that we just talked about discrete distributions in this video.

Â So the entropy of a discrete distribution is minus the sum over all possible states

Â of the probability of the state times the log of that state's probability.

Â 24:42

And in the continuous case it actually a probability density.

Â So remember you have to take the integral, or the area under that

Â probability density, to get an actual probability between 1 and 0.

Â So in the continuous case, going from sums to integral isn't too tricky.

Â We just take minus the integral over the probability

Â density of our random variable being xi times the log of

Â that probability density integrated with respect to little xi.

Â So there you have it.

Â That's the basics of entropy, a way to quantify uncertainty.

Â When we get to information theory we're going to be primarily concerned with

Â how much a distribution changes after we receive a message.

Â So maybe the distribution starts out with all force states being equal then you

Â receive a message and then suddenly the first state become

Â much more probable and the probability of the rest of the states decreases.

Â And the way we talk about how the distribution has changed

Â is by calculating the change in entropy.

Â How much our uncertainty has decreased because of the message?

Â Okay, that's it for now.

Â See you next time.

Â