0:00

Hi my name is Brian Caffo, and this is Mathematical Biostatistics Bootcamp

Â Lecture six on Likelihood. In this lecture, we're going to define

Â what a likelihood is, which is a mathematical construct that is used to

Â relate data to a population. We are going to talk about how we

Â interpret likelihoods, talk about plotting them.

Â And then talk about maximum likelihood, which is a way of using likelihoods to

Â create estimates. And then we'll talk about likelihood

Â ratios and, and how to interpret them. Likelihoods arise from a probability

Â distribution. And a probability distribution is what

Â we're going to use to connect our data to a population.

Â So the idea behind this and a lot but not all of statistics follows this rubric is

Â to assume that the data come from a family of distributions.

Â And those distributions are indexed by a unknown parameter that represents a useful

Â summary of the distribution. To give you an example, imagine if you

Â assume that your data comes from a normal distribution, a so called Gaussian

Â distribution, so a bell shaped curve. To completely characterize a bell shaped

Â curve, all you need is its mean and its variance.

Â So the probability distribution, the Gaussian distribution or the bell shaped

Â curve has two unknown parameters, the mean and the variance.

Â And then the goal is to use the data to infer the mean and the variance.

Â And the ideas that the mean and variance from the Gaussian distribution are unknown

Â population parameters because the Gaussian distribution is our model for the

Â population. And the data or sample parameters is what

Â we are going to use to estimate the unknown parameters.

Â So the, the nice part about this approach other than quite a bit of other directions

Â and statistics is that the sample mean, the sample variance, these are all

Â estimators with the population model, you actually have estimands.

Â The sample mean is actually estimating something.

Â It's not just a statement about the data. It's an estimate of the population and

Â that's what we're going to be talking today and we're going to talk about a

Â particular way of approaching estimation and summarizing evidence in the data when

Â you assume a probability distribution using Likelihood.

Â Likelihood is a mathematical function, as a particular definition.

Â And it's just the joint density of the data evaluated, as a function of the

Â parameters with the data fixed and we'll go through an example.

Â Before we go through our example, I want to talk about what it is the likelihoods

Â are attempting to accomplish, and how we might interpret them.

Â So I'm going to put forward a particular theory of how likelihoods can be

Â interpreted and how they can be used and I guess I should stipulate that maybe not

Â everyone agrees with this theory. But the theory I'm going to put forward is

Â that, ratios of likelihood values measure relative evidence of one value of an

Â unknown parameter relative to another. So if you evaluate the likelihood with the

Â parameter of a specific value you get the number, and then you take the ratio with

Â the likelihood value, you get a different number.

Â That ratio, if it's bigger than one, it's supporting the hypothesized value of the

Â parameter in the numerator. If it's less than one, it is supporting

Â the hypothesis value of the parameter in the denominator.

Â So this is a somewhat controversial interpretation of likelihoods, but it's

Â the one I'm going to put forward. The second point is similarly

Â controversial, though there is a mathematically correct proof that at least

Â it motivates to, it actually doesn't prove to, but the, the statement I'm making too

Â is that given a statistical model so given a probability model and observe data,

Â there is a theorem called the likelihood principle that says all of the relevant

Â information contained in the data regarding the unknown parameter is

Â contained in the likelihood. Now, the likelihood principle has a

Â mathematically correct proof but not everyone technically agrees on its

Â applicability and its interpretation but nonetheless I'm going to put this forward

Â as a way that in this class we're going to interpret likelihood so that once you

Â collect the data, if you assume a statistical model, then the likelihood is

Â going to contain all of the relevant information.

Â It's interesting that this point two has very far reaching consequences to the

Â field of statistics if you believe it. Things like P values, and much of

Â hypothesis testing, and other staples of statistics become questionable if you take

Â point two as being true. So, you know for today's lecture, we're

Â going to take it as being true. And we'll talk a little bit about, maybe,

Â some of the controversy associated with it.

Â Probably from a much more practical point of view is point three which says, and we

Â already know this, but let's state it in the term of likelihood.

Â So when we have a bunch of independent data points, Xi then the joint density is

Â the product of the individual densities. So, equivalently as we said that the

Â likelihood is nothing other than the density of evaluators of the function of

Â the parameter, it's also true that their likelihoods multiply, so independence

Â makes things multiply. It makes the joint density multiply, it

Â makes the likelihood multiply. So, I had summarize that here in the

Â statement that is the, the likelihood of the parameters given all of the Xs is

Â simply the product of the individual likelihoods.

Â That last point I'd like to make on this slide is that, especially points one and

Â two, the likelihood assumes these interpretations of likelihoods.

Â One negative aspect of them is that you have to actually have the statistical

Â model specified correctly, and of course we don't know the statistical model really

Â ever. If we assume that our data is Gaussian,

Â that's an assumption. It's not generally something we know.

Â Maybe in some rare cases like in, for example, radioactive decay.

Â There are some physical theory that suggests that, that data is Poisson for

Â example. But in most cases, we don't actually know

Â that the statistical family is a correct representation of the mechanism that would

Â generate data, if we were to draw from the population.

Â So, I think the way in which people still rationalize using likelihood based

Â inference in these cases is that they say, well given that we assume this is the

Â statistical model, then we will adhere to the use of the likelihood to summarize the

Â new evidence in the data. Let's go through a specific example.

Â One of the more important examples, and it's very illustrative, so let's do it.

Â Consider just flipping a coin, but this coin, let's say it's a, an oddly shaped

Â coin. Maybe it's a little bent or something like

Â that. So you don't actually know what's the

Â probability of a head. Let's label that probability of a head as

Â theta. And then recall that the mass function for

Â an individual coin flip is theta to the x, one minus theta to the one minus x.

Â Here in this case the theta has to be between zero and one.

Â So if X is zero its a tails and X is one, its a head.

Â So if we flip the coin and the result is a head, then the likelihood is simply the

Â mass function with the one plugged in, right?

Â So in this case we get theta to the one, one minus theta to the one minus one which

Â works out to be theta. So the likelihood function is the line,

Â theta where theta takes values between zero and one.

Â And if you accept our laws of likelihood and likelihood principle and then

Â interpretation of likelihood that I sort of outlined in the previous page, then

Â this says that consider two hypothesis. The hypothesis that the coin's true

Â success probability is 50%, .5 versus the hypothesis that the coin's true success

Â probability is .25. In the light of the data, right?

Â The one head that we flipped and obtained, the question is what is the relative

Â evidence supporting the hypothesis that the coin is fair, .5 to the coin is unfair

Â with the specific success probability of.25, we would take the likelihood ratio

Â which is then .5 divided by .25, which works out to be two.

Â So if you accept our interpretation of likelihoods, this would say there is twice

Â as much evidence supporting the hypothesis that theta equals .5 to the hypothesis

Â that theta equals .25. So that is the idea behind using

Â likelihoods for the analysis of data. Now let's just extend this example.

Â So, suppose we flip our coin from the previous example but instead of flipping

Â it just once we get the sequence one, zero, one, one.

Â I have kind of a funny notation here. I am going to write script L as the

Â likelihood and L is a function of theta. But it depends on the data that we

Â actually observe, one, zero, one, one and so we're assuming our coin flips are

Â independent. And so what happens with like this?

Â Will you take the product? So, here I have the first coin flip theta,

Â to the one, one minus theta to the one minus one.

Â Here I have the second coin flip theta to the zero, one minus theta to the one minus

Â zero and so on. So, I take the product of all of those and

Â you get theta cubed, to the one minus theta, raised to the first power.

Â And that's the likelihood for this particular configuration of ones and

Â zeroes from four coin flips. Notice, however, that the order of the 1s

Â and the 0s, does it matter? Regardless of the order as long as we got

Â three ones and one tail, the likelihood was going to be equivalent.

Â It was going to give you theta to the three, one minus theta to the one.

Â So, that is a property of likelihoods. It's illustrating that, if you have a

Â coin, the particular configurations of zeros and ones doesn't matter.

Â All of the relevant information about the primary figure is contained only in the

Â fact we got a specific number of heads and a specific number of tails.

Â It doesn't depend on the order whatsoever. And in this case, because we know how many

Â coin flips we have, all we need to know is the specific number of heads.

Â So instead of writing likelihood of theta depending on 1,0,1,1, we might write it as

Â likelihood of theta depending on getting one tail and three heads because it's the

Â same thing. It's, the order is irrelevant.

Â This by the way raises the idea of so called sufficiency in this case, you know

Â the number of heads in the total coin flips is sufficient for making inferences

Â about theta. You don't need to know the data actually,

Â all you need to know is the total number of heads and the number of coin flips.

Â So often, that total number of heads, you know, conditioning on the fact that we

Â know the total number of coin flips, is called a sufficient statistic.

Â It's saying that there's a reduction in the data, to make inferences about the

Â parameter, you only need to know a summary of it, a function of it.

Â And in this case a function that we need to know is the sum, total number of heads.

Â So let's do a likelihood calculation again and let's take the likelihood supporting

Â the coin is fair that theta is .5 and divided by the likelihood assuming that

Â the coin is unfair specifically with a 25 percent chance of heads and we get the

Â ratio of 5.33. So in other words, there's over five times

Â as much evidence supporting the hypothesis, that theta is.5 over the

Â hypothesis that beta is .25. Now, that relative values of likelihoods

Â measure evidence. Well, that's useful but we're not

Â particularly interested in say, .25, I mean the .5 is kinda interesting cuz the

Â coin is fair. But you know, most of the other points,

Â you know, we're not interested in .25 anymore whether we're more interested in

Â .24 and so on. So we'd like a way to consider likely at

Â ratios of all values of the, parameter theta.

Â And this is just simply a likelihood plot, that simply plots theta by the likelihood

Â value. And remember that likelihoods are really

Â interpreted in terms of relative evidence. So it's the fact that the ratio of the

Â likelihood of .5 to the likelihood of .25 is five is what's saying that we have five

Â times as much evidence. So it actually doesn't matter.

Â Constants that don't depend on theta don't matter in the likelihood, right?

Â Because when you take the ratio of, if, if there's a constant that doesn't depend on

Â theta. If it's in the numerator and the

Â denominator it'll just cancel out. The likelihood in their interpretation

Â should be in variant to constants that are not a function of the parameter.

Â So because of that the raw absolute value of the likelihood isn't all together that

Â informative, so we need to pick a rule for kind of normalizing it, and so why don't

Â we just divide it by it's maximum value so that it's height is one.

Â And that seems to be a pretty reasonable rule, and it helps with interpretations, I

Â think. And again I just want to reiterate this

Â last point. Because likelihoods you know, if you're

Â going to buy into this sort of likelihood paradigm of interpreting likelihoods,

Â everyone agrees that they measure relative evidence rather than absolute evidence.

Â So, you know dividing the curve by its maximum value or any value, you know, it

Â doesn't change it's interpretation. It's actually an interesting question I

Â might add to try and think of could someone create an absolute measure of

Â evidence in statistics and I'm not aware of any.

Â But, so we'll have to stick to relative measures for right now.

Â 14:13

So, here on the next page is a likelihood plot.

Â We have theta on the horizontal axis, and we have the likelihood value with the

Â maximum of one on the vertical axis. In this case, this is exactly the

Â likelihood for the four coin flips that we saw.

Â So the peak value is one and as the likelihood goes down, those values of

Â theta are worse and worse supported. Now, it's kind of interesting that the

Â peak value right? The likely value at which we divided the

Â likelihood by is the best supported point in the data.

Â So that's kind of interesting, right? Because that point has the highest

Â likelihood value, so no matter what you divide it by, you're always going to get a

Â likelihood ratio bigger than one. So that point seems kind of special, and

Â in fact, we give it a name, we call it the maximum likelihood point.

Â And it's a maximum likelihood that turns out to be a very useful technique, and in

Â fact, you might not know this but the vast majority of statistical estimators will

Â use maximum light estimators are very close to them.

Â So the way you would interpret this plot, for example is, take any two points, say

Â if you take point four, you get a height at point four, and take the value, say,

Â point six, and you get a height at point six, the ratio between those values is the

Â relative evidence. And then because we divided by the

Â maximum, every value that we look at is the relative evidence of that specific

Â value of data when compared to the point that is best supported by the data, the

Â maximum likelihood point. So the fact that here, the value of a fair

Â coin .5, well, it actually has, surprisingly, a likelihood value of about

Â .5. Which means that the hypothesis, if you

Â were to divide the likelihood at the maximum likelihood value by the likelihood

Â at the fair coin value, you get a ratio of about .5.

Â And that gives the relative evidence supporting the point for which you have

Â the maximum likelihood, which turns out to be .75 relative to .5.

Â So we might draw a horizontal line and let's say we drew a horizontal line at one

Â eighth. I think that's what this top line is at,

Â it's at one eighth. What does that mean?

Â For every point that falls between the end points of this line is such that there's

Â no other point that's more than eight times better supported.

Â So take this point where the curve meets this line.

Â That's exactly one eighth. What does that mean?

Â That point is exactly eight times worse supported, given the data, then .75 which

Â is the maximum likelihood value. And take any point in this intervals that

Â falls between the ends of this line, and you can't find another point that's more

Â than eight times better supported. Take for example 0.4, it has a likelihood

Â value that it's about 0.3 or whatever, but its ratio relative to the maximum is less

Â then one eighth. So, it's ratio with everything else is

Â going to be less than that, so you are not going to be able to find for point four,

Â another point any where on this curve that's more than eight times better

Â supported in it. So, that's the idea behind drawing a line

Â at say one eighth or whatever if you draw a line the collection of data values that

Â lie on that horizontal line. Between the points we're going to set the

Â likelihood curve or well supported. And of course as you draw the line as you

Â go up and up and up fewer points stay in the interval to the point where if we draw

Â it high enough then you only have the maximum likelihood value surviving the

Â threshold. So, just to reiterate some of the points

Â we made on the previous slide the value of data where the curve reaches it's maximum

Â is that maximum likelihood estimate and if we want it to write out mathematically

Â then MLE is the argument maximum over the theta of the likelihood having plugged in

Â the data X. And a nice interpretation of the MLE is

Â it's the value of the parameter that would make the data that we observed most

Â probable. So in this case we have three heads and

Â one tail. And the question is, what's the success

Â probability of the coin that we could pick that would make the data that we observe

Â most probable? And that's a nice interpretation of the

Â MLE as well. Well it turns out, and I think I've eluded

Â this because I kept saying that the likelihood MLE in the previous example was

Â .75, how did I get that? Well, there was three heads out of four

Â tails, so that's a proportion of heads of .75.

Â Well, it turns out that if you have independent identically distributed coin

Â flips, then the MLE for theta is always the proportion of heads that you get.

Â And again, I think if anyone were to give a single point estimate for the success

Â probability of that coin, they would all give the proportion of heads.

Â So, to be honest, the fact that maximum likelihood yields that is not so much a

Â booster for using the proportion of heads as an estimator, it's more fact that it

Â motivates the use of MLE perhaps in more complicated settings where we don't have

Â great intuition already as to what the kind of logical estimator should be.

Â That's kind of the benefit of MLE is that in a lot of the cases where we have a

Â really good idea of what should be the right estimator, the MLE returns

Â estimators that exactly near our intuition.

Â And that gives us some hope that it would be a useful thing to do in these settings

Â where we don't know what the best estimator is.

Â Then in addition to that there's been, I think it's fair to say tomes of theory

Â that have been developed in support of MLEs as you let's say for example the

Â number of data points go to infinity. So let's actually just prove this fact

Â that if you have a binomial or you have n Bernoulli coin flips that the maximum

Â likelihood estimator is the number or the proportion of heads so let n be the number

Â of trials, and let's let x be the number of heads.

Â And remember that in this case the likelihood is theta to the x, one minus

Â theta to the n minus x. So, theta to the number of heads, one

Â minus theta to the number of tails. And then, if we want to find the argument

Â maximum of this function, it turns out it's definitely easier to maximize the log

Â likelihood. And, this is almost a general principle in

Â statistics that when you have a bunch of independent things and you want to

Â maximize a likelihood your better off maximizing the log likelihood.

Â You know because if you maximize the log of the function, you've maximized the

Â function because log is an increasing monotonic function.

Â And then in addition to that, the fact that you have a bunch of independent

Â things means that you've multiplied a bunch of things to get the joint density

Â or mass function. So if you multiply things, things get

Â raised to powers and so on and these are all kind of complicated things to work

Â with, addition is much easier to work with, so log kind of converts products

Â into sums, and that's really quite useful. So x, which is a power, is now just is no

Â longer a power on the log scale. And you can x log theta plus n minus x log

Â one minus data, which is a much easier function to work with.

Â But you know in this case you can do it either way.

Â It's no problem. But one of the reasons it helps in general

Â is that it takes care of the annoying products that you get from independence in

Â multiplying a bunch of densities or mass functions together.

Â If we take the derivative, we get X over theta.

Â N minus X over one minus theta. And if we want to solve for the,

Â inflection point, we'd set this equal to zero, and we'd set it equal to zero, and

Â I'm not going to churn through the calculations.

Â If you actually set it equal to zero and just bring the two terms on either side,

Â it's pretty clear that theta equal to X over N solves that equation.

Â You have one minus X over N times theta plus, equal to one minus theta times X

Â over N. It's pretty clear if you plug in X over N,

Â that, you're going to get a valid equality there.

Â But you can turn to the calculations and get to the fact that theta solved it x

Â over n. So in other words the value of theta that

Â makes the observe data most likely in IID Bernoulli trials is the proportion of

Â heads. Oh and, and below I, I checked that second

Â derivative condition to make sure the likelihood is concave.

Â So, you know, this technically doesn't handle if you got all failures or all

Â successes, but maybe just do those cases on your own.

Â 22:56

So what constitutes strong evidence? If you're going to treat the likelihood as

Â our arbiter of evidence and likelihood ratios as measures of evidence, we would

Â like to maybe build up some intuition and you know, a friend and a faculty member

Â here taught me this idea of, why don't we just use this kind of coin flipping as the

Â mechanism for building up our intuition, through its strength of evidence.

Â So imagine an experiment where, a person is considering three possible hypotheses

Â with coin flipping. The coin has tails on both sides, in other

Â words theta equals zero. The coin is fair theta equals .5 versus

Â the coin has heads on both sides, theta equals one.

Â And so here we have hypothesis one, hypothesis two and hypothesis three and

Â now I have this table for the possible outcomes.

Â So lets suppose I flip the coin in its head.

Â And I've done this experiment, unfortunately it's difficult to do this

Â experiment in this setting. But I've done it in class, and you just

Â have to take my word for it. On one coin flip, pretty much or no one is

Â willing to ditch the hypothesis that the coin is fair.

Â So, in one coin flip, suppose you get a head, right?

Â The probability of a head, even the first hypothesis, that the coin has tails on

Â both sides is zero. The probability of a head, given that the

Â hypothesis that the coin is fair is .5, and the probability of a head given the

Â hypothesis three that the coin is two headed, then that's one.

Â So the likelihood ratio of hypothesis one to hypothesis two is zero, and the

Â likelihood ratio for hypothesis three relative to hypothesis two is two.

Â And of course, this is exactly what we would hope to happen right?

Â If the coin is two tailed, that can't produce heads so that if we get one head,

Â it should have a likely ratio of zero for supporting the two tailed hypothesis.

Â Okay, two, there is twice as much evident for supporting the hypothesis that the

Â coin is fair and the coin is two headed given a single coin flip that is a head.

Â So, it's clear that two is not terribly strong evidence especially given either

Â way. If you flip the coin ones, something is

Â going to happen. So that, 50 percent probability of getting

Â head is not that compelling. So let's suppose we have two heads in a

Â row, okay. Now I am, I am quit talking about

Â hypothesis one, because you can't get two heads in a row under hypothesis one.

Â But here I've outlined all the different possibilities head, head, head, tail, tail

Â and tail, tail and I go to the likelihood ratio for all of them.

Â But in this case it's .25 or two consecutive heads.

Â If the coin is fair, and it's 100 percent of getting two heads if the coin is two

Â headed so the likelihood ratio is now four, four times as much evidence

Â supporting the hypothesis that the coin is 2-headed then the hypothesis that the coin

Â is fair if you get two consecutive heads. Now let's suppose we get three consecutive

Â heads and then the probability of getting three heads if the coin is fair is .125.

Â Probability of getting three consecutive heads if the coins is two headed is 100%.

Â You get a likelihood ratio of, of eight and in this case that means that there's

Â eight times as much evidence supporting the hypothesis that the coin is two headed

Â relative to the hypothesis that the coin is fair.

Â And, so let me tell you what happens when I do this in a class.

Â I have a two headed coin and I play this game.

Â And people are willing to keep considering the hypothesis that the coin is fair.

Â Because, I guess because, most of the time, people aren't aware that two headed

Â coins are easy to buy. So, around three heads, a substantial

Â fraction of the class has started to believe that the coin is two headed now.

Â Four consecutive heads, which would then the likelihood ratio would be sixteen of

Â course. Five consecutive heads, it would be 32 and

Â so on. By four consecutive heads, the vast

Â majority of the class is believing that its two headed.

Â And then by five consecutive heads, basically 100 percent of the class always

Â agrees that its two headed. And I've done games where I have a fair

Â coin and an unfair coin. And I show the class that one of them's

Â fair, and one of'ems unfair. Jumble them up in my hand, and then they

Â don't know which one I'm flipping so that they know I'm not trying to trick them.

Â Well, I am trying to trick them, but I, not in an obvious way I'm trying to trick

Â them. Actually, you can kind of tell by the

Â weight which one's fair, and which one's not so I always grab the unfair one.

Â These create sort of useful benchmarks, right?

Â So, A8, you know, as sort of being moderate evidence.

Â So the idea behind this is to use this coin flipping, and the easy experiment

Â that we can understand to build up context for what likelihood ratios mean.

Â So eight, sort of moderate evidence, it's sort of like getting three consecutive

Â heads on a coin, right? Sixteen is being strong evidence, it's

Â like getting four consecutive heads and the evidence against the coin being fair,

Â and then 32 is being quite strong evidence.

Â And, you know these are, admittedly, the coin is used for contacts.

Â But these are no more arbitrary, say, than, the existing threshold that, say, is

Â used on p values where people just arbitrarily pick five percent as their cut

Â off for type one error rates if you are aware of this sort of thing.

Â So any rate this is why for example I draw lines, likelihood plots at the vaue of one

Â eighth, and so that way parameter values above the one eighth reference line are

Â such that no other point is more than eight times better supported given the

Â data. That's the end of the kind of technical

Â component of this lecture, I wanted to spend a little bit of time just talking

Â about the consequences of kind of adopting this style of analysis.

Â So, pretty much every major paradigm in statistics, Bayesianism, frequentism,

Â likelihood, this likelihood paradigm, pretty much every paradigm in statistics

Â agrees if you assume a probability model and act as if it's true, then the

Â likelihood ratio is a central component to the theory.

Â If you take enough mathematical statistics, you'll see this.

Â The particular paradigm that I'm discussing today then goes beyond this

Â relatively benign use of likelihood ratios that occur in the other areas.

Â What I'm talking about today right, is that not just that the likelihood ratio is

Â useful but that likelihood ratios measure relative evidence and that given a

Â statistical model on observed data, all of the relevant information is contained in

Â the likelihood. And this has kind of far reaching

Â consequences to the field of statistics. If you go beyond just saying likelihoods

Â are useful, to going to say not only are the useful but they have these properties,

Â then it changes quite a bit of statistics. So, for example, much of statistics is

Â devoted to things like hypothesis testing and P values and other variance of

Â statistics with the interpretation of the statistics involves potentially fictitious

Â repetitions of an experiment. So, for example if you've ever heard of a

Â confidence interval, the interpretation of a confidence interval is quite confusing,

Â but it's something along the lines of if you were to use this technique over and

Â over again you would obtain these intervals that contain the things they

Â were trying to estimate say 95 percent of the time.

Â Well, if you kind of adopt this strong variant of interpreting likelihoods, then

Â that suggests that, that interpretation can't be valid because it involves

Â potentially fictitious repetitions of the experiment which do not depend on the

Â likelihood, for the data at hand. So it cannot possibly be useful or it

Â cannot have any additional evidence. So some of the things that get disputed if

Â you adopt this paradigm or p values, hypothesis testing, multiple corrections,

Â and these are the big ones that come to the top of my head, which is very disputed

Â because in many ways, these techniques seem very central to the idea of

Â statistics. So, I really just wanted at this point to

Â introduce people to these concepts, and state the consequences of this theory.

Â I think for the purposes of this class, what I would hope you would know after

Â this lecture is what the likelihood is. You would know that regardless of what

Â kind of paradigm statistics you're in, the higher likelihoods generally, refer to

Â better supportive values of the parameter. And I would hope that you understand about

Â the principle of maximum likelihood. Thank you for listening.

Â This was Mathematical Biostatistic Bootcamp Lecture six and I look forward to

Â seeing you for the next lecture.

Â