0:00

Hi my name is Brian Caffo, and this is Mathematical Biostatistics Bootcamp

Lecture six on Likelihood. In this lecture, we're going to define

what a likelihood is, which is a mathematical construct that is used to

relate data to a population. We are going to talk about how we

interpret likelihoods, talk about plotting them.

And then talk about maximum likelihood, which is a way of using likelihoods to

create estimates. And then we'll talk about likelihood

ratios and, and how to interpret them. Likelihoods arise from a probability

distribution. And a probability distribution is what

we're going to use to connect our data to a population.

So the idea behind this and a lot but not all of statistics follows this rubric is

to assume that the data come from a family of distributions.

And those distributions are indexed by a unknown parameter that represents a useful

summary of the distribution. To give you an example, imagine if you

assume that your data comes from a normal distribution, a so called Gaussian

distribution, so a bell shaped curve. To completely characterize a bell shaped

curve, all you need is its mean and its variance.

So the probability distribution, the Gaussian distribution or the bell shaped

curve has two unknown parameters, the mean and the variance.

And then the goal is to use the data to infer the mean and the variance.

And the ideas that the mean and variance from the Gaussian distribution are unknown

population parameters because the Gaussian distribution is our model for the

population. And the data or sample parameters is what

we are going to use to estimate the unknown parameters.

So the, the nice part about this approach other than quite a bit of other directions

and statistics is that the sample mean, the sample variance, these are all

estimators with the population model, you actually have estimands.

The sample mean is actually estimating something.

It's not just a statement about the data. It's an estimate of the population and

that's what we're going to be talking today and we're going to talk about a

particular way of approaching estimation and summarizing evidence in the data when

you assume a probability distribution using Likelihood.

Likelihood is a mathematical function, as a particular definition.

And it's just the joint density of the data evaluated, as a function of the

parameters with the data fixed and we'll go through an example.

Before we go through our example, I want to talk about what it is the likelihoods

are attempting to accomplish, and how we might interpret them.

So I'm going to put forward a particular theory of how likelihoods can be

interpreted and how they can be used and I guess I should stipulate that maybe not

everyone agrees with this theory. But the theory I'm going to put forward is

that, ratios of likelihood values measure relative evidence of one value of an

unknown parameter relative to another. So if you evaluate the likelihood with the

parameter of a specific value you get the number, and then you take the ratio with

the likelihood value, you get a different number.

That ratio, if it's bigger than one, it's supporting the hypothesized value of the

parameter in the numerator. If it's less than one, it is supporting

the hypothesis value of the parameter in the denominator.

So this is a somewhat controversial interpretation of likelihoods, but it's

the one I'm going to put forward. The second point is similarly

controversial, though there is a mathematically correct proof that at least

it motivates to, it actually doesn't prove to, but the, the statement I'm making too

is that given a statistical model so given a probability model and observe data,

there is a theorem called the likelihood principle that says all of the relevant

information contained in the data regarding the unknown parameter is

contained in the likelihood. Now, the likelihood principle has a

mathematically correct proof but not everyone technically agrees on its

applicability and its interpretation but nonetheless I'm going to put this forward

as a way that in this class we're going to interpret likelihood so that once you

collect the data, if you assume a statistical model, then the likelihood is

going to contain all of the relevant information.

It's interesting that this point two has very far reaching consequences to the

field of statistics if you believe it. Things like P values, and much of

hypothesis testing, and other staples of statistics become questionable if you take

point two as being true. So, you know for today's lecture, we're

going to take it as being true. And we'll talk a little bit about, maybe,

some of the controversy associated with it.

Probably from a much more practical point of view is point three which says, and we

already know this, but let's state it in the term of likelihood.

So when we have a bunch of independent data points, Xi then the joint density is

the product of the individual densities. So, equivalently as we said that the

likelihood is nothing other than the density of evaluators of the function of

the parameter, it's also true that their likelihoods multiply, so independence

makes things multiply. It makes the joint density multiply, it

makes the likelihood multiply. So, I had summarize that here in the

statement that is the, the likelihood of the parameters given all of the Xs is

simply the product of the individual likelihoods.

That last point I'd like to make on this slide is that, especially points one and

two, the likelihood assumes these interpretations of likelihoods.

One negative aspect of them is that you have to actually have the statistical

model specified correctly, and of course we don't know the statistical model really

ever. If we assume that our data is Gaussian,

that's an assumption. It's not generally something we know.

Maybe in some rare cases like in, for example, radioactive decay.

There are some physical theory that suggests that, that data is Poisson for

example. But in most cases, we don't actually know

that the statistical family is a correct representation of the mechanism that would

generate data, if we were to draw from the population.

So, I think the way in which people still rationalize using likelihood based

inference in these cases is that they say, well given that we assume this is the

statistical model, then we will adhere to the use of the likelihood to summarize the

new evidence in the data. Let's go through a specific example.

One of the more important examples, and it's very illustrative, so let's do it.

Consider just flipping a coin, but this coin, let's say it's a, an oddly shaped

coin. Maybe it's a little bent or something like

that. So you don't actually know what's the

probability of a head. Let's label that probability of a head as

theta. And then recall that the mass function for

an individual coin flip is theta to the x, one minus theta to the one minus x.

Here in this case the theta has to be between zero and one.

So if X is zero its a tails and X is one, its a head.

So if we flip the coin and the result is a head, then the likelihood is simply the

mass function with the one plugged in, right?

So in this case we get theta to the one, one minus theta to the one minus one which

works out to be theta. So the likelihood function is the line,

theta where theta takes values between zero and one.

And if you accept our laws of likelihood and likelihood principle and then

interpretation of likelihood that I sort of outlined in the previous page, then

this says that consider two hypothesis. The hypothesis that the coin's true

success probability is 50%, .5 versus the hypothesis that the coin's true success

probability is .25. In the light of the data, right?

The one head that we flipped and obtained, the question is what is the relative

evidence supporting the hypothesis that the coin is fair, .5 to the coin is unfair

with the specific success probability of.25, we would take the likelihood ratio

which is then .5 divided by .25, which works out to be two.

So if you accept our interpretation of likelihoods, this would say there is twice

as much evidence supporting the hypothesis that theta equals .5 to the hypothesis

that theta equals .25. So that is the idea behind using

likelihoods for the analysis of data. Now let's just extend this example.

So, suppose we flip our coin from the previous example but instead of flipping

it just once we get the sequence one, zero, one, one.

I have kind of a funny notation here. I am going to write script L as the

likelihood and L is a function of theta. But it depends on the data that we

actually observe, one, zero, one, one and so we're assuming our coin flips are

independent. And so what happens with like this?

Will you take the product? So, here I have the first coin flip theta,

to the one, one minus theta to the one minus one.

Here I have the second coin flip theta to the zero, one minus theta to the one minus

zero and so on. So, I take the product of all of those and

you get theta cubed, to the one minus theta, raised to the first power.

And that's the likelihood for this particular configuration of ones and

zeroes from four coin flips. Notice, however, that the order of the 1s

and the 0s, does it matter? Regardless of the order as long as we got

three ones and one tail, the likelihood was going to be equivalent.

It was going to give you theta to the three, one minus theta to the one.

So, that is a property of likelihoods. It's illustrating that, if you have a

coin, the particular configurations of zeros and ones doesn't matter.

All of the relevant information about the primary figure is contained only in the

fact we got a specific number of heads and a specific number of tails.

It doesn't depend on the order whatsoever. And in this case, because we know how many

coin flips we have, all we need to know is the specific number of heads.

So instead of writing likelihood of theta depending on 1,0,1,1, we might write it as

likelihood of theta depending on getting one tail and three heads because it's the

same thing. It's, the order is irrelevant.

This by the way raises the idea of so called sufficiency in this case, you know

the number of heads in the total coin flips is sufficient for making inferences

about theta. You don't need to know the data actually,

all you need to know is the total number of heads and the number of coin flips.

So often, that total number of heads, you know, conditioning on the fact that we

know the total number of coin flips, is called a sufficient statistic.

It's saying that there's a reduction in the data, to make inferences about the

parameter, you only need to know a summary of it, a function of it.

And in this case a function that we need to know is the sum, total number of heads.

So let's do a likelihood calculation again and let's take the likelihood supporting

the coin is fair that theta is .5 and divided by the likelihood assuming that

the coin is unfair specifically with a 25 percent chance of heads and we get the

ratio of 5.33. So in other words, there's over five times

as much evidence supporting the hypothesis, that theta is.5 over the

hypothesis that beta is .25. Now, that relative values of likelihoods

measure evidence. Well, that's useful but we're not

particularly interested in say, .25, I mean the .5 is kinda interesting cuz the

coin is fair. But you know, most of the other points,

you know, we're not interested in .25 anymore whether we're more interested in

.24 and so on. So we'd like a way to consider likely at

ratios of all values of the, parameter theta.

And this is just simply a likelihood plot, that simply plots theta by the likelihood

value. And remember that likelihoods are really

interpreted in terms of relative evidence. So it's the fact that the ratio of the

likelihood of .5 to the likelihood of .25 is five is what's saying that we have five

times as much evidence. So it actually doesn't matter.

Constants that don't depend on theta don't matter in the likelihood, right?

Because when you take the ratio of, if, if there's a constant that doesn't depend on

theta. If it's in the numerator and the

denominator it'll just cancel out. The likelihood in their interpretation

should be in variant to constants that are not a function of the parameter.

So because of that the raw absolute value of the likelihood isn't all together that

informative, so we need to pick a rule for kind of normalizing it, and so why don't

we just divide it by it's maximum value so that it's height is one.

And that seems to be a pretty reasonable rule, and it helps with interpretations, I

think. And again I just want to reiterate this

last point. Because likelihoods you know, if you're

going to buy into this sort of likelihood paradigm of interpreting likelihoods,

everyone agrees that they measure relative evidence rather than absolute evidence.

So, you know dividing the curve by its maximum value or any value, you know, it

doesn't change it's interpretation. It's actually an interesting question I

might add to try and think of could someone create an absolute measure of

evidence in statistics and I'm not aware of any.

But, so we'll have to stick to relative measures for right now.

14:13

So, here on the next page is a likelihood plot.

We have theta on the horizontal axis, and we have the likelihood value with the

maximum of one on the vertical axis. In this case, this is exactly the

likelihood for the four coin flips that we saw.

So the peak value is one and as the likelihood goes down, those values of

theta are worse and worse supported. Now, it's kind of interesting that the

peak value right? The likely value at which we divided the

likelihood by is the best supported point in the data.

So that's kind of interesting, right? Because that point has the highest

likelihood value, so no matter what you divide it by, you're always going to get a

likelihood ratio bigger than one. So that point seems kind of special, and

in fact, we give it a name, we call it the maximum likelihood point.

And it's a maximum likelihood that turns out to be a very useful technique, and in

fact, you might not know this but the vast majority of statistical estimators will

use maximum light estimators are very close to them.

So the way you would interpret this plot, for example is, take any two points, say

if you take point four, you get a height at point four, and take the value, say,

point six, and you get a height at point six, the ratio between those values is the

relative evidence. And then because we divided by the

maximum, every value that we look at is the relative evidence of that specific

value of data when compared to the point that is best supported by the data, the

maximum likelihood point. So the fact that here, the value of a fair

coin .5, well, it actually has, surprisingly, a likelihood value of about

.5. Which means that the hypothesis, if you

were to divide the likelihood at the maximum likelihood value by the likelihood

at the fair coin value, you get a ratio of about .5.

And that gives the relative evidence supporting the point for which you have

the maximum likelihood, which turns out to be .75 relative to .5.

So we might draw a horizontal line and let's say we drew a horizontal line at one

eighth. I think that's what this top line is at,

it's at one eighth. What does that mean?

For every point that falls between the end points of this line is such that there's

no other point that's more than eight times better supported.

So take this point where the curve meets this line.

That's exactly one eighth. What does that mean?

That point is exactly eight times worse supported, given the data, then .75 which

is the maximum likelihood value. And take any point in this intervals that

falls between the ends of this line, and you can't find another point that's more

than eight times better supported. Take for example 0.4, it has a likelihood

value that it's about 0.3 or whatever, but its ratio relative to the maximum is less

then one eighth. So, it's ratio with everything else is

going to be less than that, so you are not going to be able to find for point four,

another point any where on this curve that's more than eight times better

supported in it. So, that's the idea behind drawing a line

at say one eighth or whatever if you draw a line the collection of data values that

lie on that horizontal line. Between the points we're going to set the

likelihood curve or well supported. And of course as you draw the line as you

go up and up and up fewer points stay in the interval to the point where if we draw

it high enough then you only have the maximum likelihood value surviving the

threshold. So, just to reiterate some of the points

we made on the previous slide the value of data where the curve reaches it's maximum

is that maximum likelihood estimate and if we want it to write out mathematically

then MLE is the argument maximum over the theta of the likelihood having plugged in

the data X. And a nice interpretation of the MLE is

it's the value of the parameter that would make the data that we observed most

probable. So in this case we have three heads and

one tail. And the question is, what's the success

probability of the coin that we could pick that would make the data that we observe

most probable? And that's a nice interpretation of the

MLE as well. Well it turns out, and I think I've eluded

this because I kept saying that the likelihood MLE in the previous example was

.75, how did I get that? Well, there was three heads out of four

tails, so that's a proportion of heads of .75.

Well, it turns out that if you have independent identically distributed coin

flips, then the MLE for theta is always the proportion of heads that you get.

And again, I think if anyone were to give a single point estimate for the success

probability of that coin, they would all give the proportion of heads.

So, to be honest, the fact that maximum likelihood yields that is not so much a

booster for using the proportion of heads as an estimator, it's more fact that it

motivates the use of MLE perhaps in more complicated settings where we don't have

great intuition already as to what the kind of logical estimator should be.

That's kind of the benefit of MLE is that in a lot of the cases where we have a

really good idea of what should be the right estimator, the MLE returns

estimators that exactly near our intuition.

And that gives us some hope that it would be a useful thing to do in these settings

where we don't know what the best estimator is.

Then in addition to that there's been, I think it's fair to say tomes of theory

that have been developed in support of MLEs as you let's say for example the

number of data points go to infinity. So let's actually just prove this fact

that if you have a binomial or you have n Bernoulli coin flips that the maximum

likelihood estimator is the number or the proportion of heads so let n be the number

of trials, and let's let x be the number of heads.

And remember that in this case the likelihood is theta to the x, one minus

theta to the n minus x. So, theta to the number of heads, one

minus theta to the number of tails. And then, if we want to find the argument

maximum of this function, it turns out it's definitely easier to maximize the log

likelihood. And, this is almost a general principle in

statistics that when you have a bunch of independent things and you want to

maximize a likelihood your better off maximizing the log likelihood.

You know because if you maximize the log of the function, you've maximized the

function because log is an increasing monotonic function.

And then in addition to that, the fact that you have a bunch of independent

things means that you've multiplied a bunch of things to get the joint density

or mass function. So if you multiply things, things get

raised to powers and so on and these are all kind of complicated things to work

with, addition is much easier to work with, so log kind of converts products

into sums, and that's really quite useful. So x, which is a power, is now just is no

longer a power on the log scale. And you can x log theta plus n minus x log

one minus data, which is a much easier function to work with.

But you know in this case you can do it either way.

It's no problem. But one of the reasons it helps in general

is that it takes care of the annoying products that you get from independence in

multiplying a bunch of densities or mass functions together.

If we take the derivative, we get X over theta.

N minus X over one minus theta. And if we want to solve for the,

inflection point, we'd set this equal to zero, and we'd set it equal to zero, and

I'm not going to churn through the calculations.

If you actually set it equal to zero and just bring the two terms on either side,

it's pretty clear that theta equal to X over N solves that equation.

You have one minus X over N times theta plus, equal to one minus theta times X

over N. It's pretty clear if you plug in X over N,

that, you're going to get a valid equality there.

But you can turn to the calculations and get to the fact that theta solved it x

over n. So in other words the value of theta that

makes the observe data most likely in IID Bernoulli trials is the proportion of

heads. Oh and, and below I, I checked that second

derivative condition to make sure the likelihood is concave.

So, you know, this technically doesn't handle if you got all failures or all

successes, but maybe just do those cases on your own.

22:56

So what constitutes strong evidence? If you're going to treat the likelihood as

our arbiter of evidence and likelihood ratios as measures of evidence, we would

like to maybe build up some intuition and you know, a friend and a faculty member

here taught me this idea of, why don't we just use this kind of coin flipping as the

mechanism for building up our intuition, through its strength of evidence.

So imagine an experiment where, a person is considering three possible hypotheses

with coin flipping. The coin has tails on both sides, in other

words theta equals zero. The coin is fair theta equals .5 versus

the coin has heads on both sides, theta equals one.

And so here we have hypothesis one, hypothesis two and hypothesis three and

now I have this table for the possible outcomes.

So lets suppose I flip the coin in its head.

And I've done this experiment, unfortunately it's difficult to do this

experiment in this setting. But I've done it in class, and you just

have to take my word for it. On one coin flip, pretty much or no one is

willing to ditch the hypothesis that the coin is fair.

So, in one coin flip, suppose you get a head, right?

The probability of a head, even the first hypothesis, that the coin has tails on

both sides is zero. The probability of a head, given that the

hypothesis that the coin is fair is .5, and the probability of a head given the

hypothesis three that the coin is two headed, then that's one.

So the likelihood ratio of hypothesis one to hypothesis two is zero, and the

likelihood ratio for hypothesis three relative to hypothesis two is two.

And of course, this is exactly what we would hope to happen right?

If the coin is two tailed, that can't produce heads so that if we get one head,

it should have a likely ratio of zero for supporting the two tailed hypothesis.

Okay, two, there is twice as much evident for supporting the hypothesis that the

coin is fair and the coin is two headed given a single coin flip that is a head.

So, it's clear that two is not terribly strong evidence especially given either

way. If you flip the coin ones, something is

going to happen. So that, 50 percent probability of getting

head is not that compelling. So let's suppose we have two heads in a

row, okay. Now I am, I am quit talking about

hypothesis one, because you can't get two heads in a row under hypothesis one.

But here I've outlined all the different possibilities head, head, head, tail, tail

and tail, tail and I go to the likelihood ratio for all of them.

But in this case it's .25 or two consecutive heads.

If the coin is fair, and it's 100 percent of getting two heads if the coin is two

headed so the likelihood ratio is now four, four times as much evidence

supporting the hypothesis that the coin is 2-headed then the hypothesis that the coin

is fair if you get two consecutive heads. Now let's suppose we get three consecutive

heads and then the probability of getting three heads if the coin is fair is .125.

Probability of getting three consecutive heads if the coins is two headed is 100%.

You get a likelihood ratio of, of eight and in this case that means that there's

eight times as much evidence supporting the hypothesis that the coin is two headed

relative to the hypothesis that the coin is fair.

And, so let me tell you what happens when I do this in a class.

I have a two headed coin and I play this game.

And people are willing to keep considering the hypothesis that the coin is fair.

Because, I guess because, most of the time, people aren't aware that two headed

coins are easy to buy. So, around three heads, a substantial

fraction of the class has started to believe that the coin is two headed now.

Four consecutive heads, which would then the likelihood ratio would be sixteen of

course. Five consecutive heads, it would be 32 and

so on. By four consecutive heads, the vast

majority of the class is believing that its two headed.

And then by five consecutive heads, basically 100 percent of the class always

agrees that its two headed. And I've done games where I have a fair

coin and an unfair coin. And I show the class that one of them's

fair, and one of'ems unfair. Jumble them up in my hand, and then they

don't know which one I'm flipping so that they know I'm not trying to trick them.

Well, I am trying to trick them, but I, not in an obvious way I'm trying to trick

them. Actually, you can kind of tell by the

weight which one's fair, and which one's not so I always grab the unfair one.

These create sort of useful benchmarks, right?

So, A8, you know, as sort of being moderate evidence.

So the idea behind this is to use this coin flipping, and the easy experiment

that we can understand to build up context for what likelihood ratios mean.

So eight, sort of moderate evidence, it's sort of like getting three consecutive

heads on a coin, right? Sixteen is being strong evidence, it's

like getting four consecutive heads and the evidence against the coin being fair,

and then 32 is being quite strong evidence.

And, you know these are, admittedly, the coin is used for contacts.

But these are no more arbitrary, say, than, the existing threshold that, say, is

used on p values where people just arbitrarily pick five percent as their cut

off for type one error rates if you are aware of this sort of thing.

So any rate this is why for example I draw lines, likelihood plots at the vaue of one

eighth, and so that way parameter values above the one eighth reference line are

such that no other point is more than eight times better supported given the

data. That's the end of the kind of technical

component of this lecture, I wanted to spend a little bit of time just talking

about the consequences of kind of adopting this style of analysis.

So, pretty much every major paradigm in statistics, Bayesianism, frequentism,

likelihood, this likelihood paradigm, pretty much every paradigm in statistics

agrees if you assume a probability model and act as if it's true, then the

likelihood ratio is a central component to the theory.

If you take enough mathematical statistics, you'll see this.

The particular paradigm that I'm discussing today then goes beyond this

relatively benign use of likelihood ratios that occur in the other areas.

What I'm talking about today right, is that not just that the likelihood ratio is

useful but that likelihood ratios measure relative evidence and that given a

statistical model on observed data, all of the relevant information is contained in

the likelihood. And this has kind of far reaching

consequences to the field of statistics. If you go beyond just saying likelihoods

are useful, to going to say not only are the useful but they have these properties,

then it changes quite a bit of statistics. So, for example, much of statistics is

devoted to things like hypothesis testing and P values and other variance of

statistics with the interpretation of the statistics involves potentially fictitious

repetitions of an experiment. So, for example if you've ever heard of a

confidence interval, the interpretation of a confidence interval is quite confusing,

but it's something along the lines of if you were to use this technique over and

over again you would obtain these intervals that contain the things they

were trying to estimate say 95 percent of the time.

Well, if you kind of adopt this strong variant of interpreting likelihoods, then

that suggests that, that interpretation can't be valid because it involves

potentially fictitious repetitions of the experiment which do not depend on the

likelihood, for the data at hand. So it cannot possibly be useful or it

cannot have any additional evidence. So some of the things that get disputed if

you adopt this paradigm or p values, hypothesis testing, multiple corrections,

and these are the big ones that come to the top of my head, which is very disputed

because in many ways, these techniques seem very central to the idea of

statistics. So, I really just wanted at this point to

introduce people to these concepts, and state the consequences of this theory.

I think for the purposes of this class, what I would hope you would know after

this lecture is what the likelihood is. You would know that regardless of what

kind of paradigm statistics you're in, the higher likelihoods generally, refer to

better supportive values of the parameter. And I would hope that you understand about

the principle of maximum likelihood. Thank you for listening.

This was Mathematical Biostatistic Bootcamp Lecture six and I look forward to

seeing you for the next lecture.