So, that's the binomial distribution. Let's talk about the most famous of all
distributions and probably the most handy of all distributions is the so called
normal or, or Gaussian distribution. The term Gaussian comes, the great
mathematician, Gauss. And it's kind of interesting to note,
Gauss didn't invent the normal distribution.
The invention of the normal distribution is kind of a debated topic.
For example, Bernoulli had used something not unlike the Gaussian distribution as a
probabilistic inequality not formalizing it as a density.
If you're interested in this, the book by Stephen Stigler on the history of
Statistics actually has a nice summary of exactly where and when and who came up
with the Gaussian distribution. But it's clear that Gauss was instrumental
in the early development and use of the Gaussian distribution.
So, a random variable is said to follow a normal or Gaussian distribution with
parameters mu and sigma squared if the density looks like this, two pi sigma
squared to the minus one half e to the negative x minus mu squared over two sigma
squared. And so, this density, it looks like a bell
and it's centered at mu. And sigma squared sort of controls how
flat or peaked it is. And so, it turns out that, that, mu is
exactly the mean distribution and sigma squared is exactly the variance of this
distribution. So, you'll only need two parameters, a
shift parameter and a scale parameter to characterize a normal distribution.
So, we might write that x is this little squiggle and N mu sigma squared as just
sort of short hand for saying that a random variable follows a normal
distribution with mean, mu, and variance sigma squared.
And, in fact, one instance of the normal distribution is sort of the kind of root
instance from which other sorts are derived and that's why mu is equal to zero
and sigma equals one. And so, we will call that the standard
normal distribution. It's centered at zero and its variance is
one and so all other normal distributions are simple shifts and rescaling of the
standard normal distribution. But then again, you could pick a different
root, maybe mu equal five and sigma equal two, but it wouldn't be quite as
convenient. You could still get every other
distribution from that one by shifting and scaling appropriately, but it wouldn't be
as convenient. This is the most convenient way to define
a sort of route of the normal distribution.
The standard normal density is so common that we, we often reserve a Greek letter
for it. So, the lower case phi we usually use for
the normal density, and the upper case Phi, we would use for the normal
distribution function. And standard normal random variables are
often labeled with a z and you sometimes do even hear, introductory statistics
textbooks and so on and refer to them as z-variables or z-distributors, or
something like that and that's because this notation has become so common.
Here's the normal distribution. It looks like a bell.
That's how it gets its name the bell-shaped curve.
And, sort of here, I've drawn reference lines at one standard deviation, two
standard deviations, and three standard deviations.
One above and, and negatives being below and positives being above.
Now again, here, so the, the one, right, because this is a standard normal
distribution, right, the one represents one standard deviation away from the mean.
Here, the mean is zero. One is one standard deviation away from
the mean, two is two standard deviations away from the mean, and three is three
standard deviations away from the mean. Instead of thinking of these numbers as
just z values in the denominator, if we think about them in the units of the
original data, right, and this is representing one standard deviation from
the mean, two standard deviations from the mean, and three standard deviations from
the mean, it doesn't matter whether we're talking about a standard normal or a
nonstandard normal. They all are going to follow the same
rules. So, about 68 percent of the distribution
is going to lie within one standard deviation, about 95 percent is going to
lie within two standard deviations, i.e., between -two and +two.
And about, almost all the distribution, about 99 percent of it is going to lie
within three standard deviations. We can get from a nonstandard normal to a
standard normal very easily. So, if x is normal with mean mu and
variance sigma squared, then z equal to x minus mu over sigma is, in fact, standard
normal. Now, you could at least, given the
information from this class, check immediately that z has the right mean and
variance. So, if you take the expected value of z,
you get the expected value of x minus mu divided by sigma.
You can pull the sigma out, and then you have expected value of x minus mu, which
is just zero because that's expected value of x minus expected value of mu.
And mu is not random so that's just mu and mu is defined as expected value of x, so
that's just zero. Then, the same thing with the variant.
If you take the variance of z, you get the variance of x minus mu divided by sigma,
right? So, if we pull the sigma out of the
variance, it becomes a sigma squared, and we have variance x minus mu.
And we learn to rule with variances that if we shift the random variable by a
constant, say, in this case, subtracting out mu, it doesn't change the variance at
all. So, we get a variance of x divided by
sigma squared. The variance of x is sigma squared so we
get sigma squared divided by sigma squared, which is one.
So, at the bare minimum, we can check that z has mean zero and variance one.
By the way, there was nothing intrinsic to the normal distribution that, that
occurred in that calculation, right? So, we've also just learned an interesting
fact, which is that take any random variable, subtract off its population
mean, and divide by its standard deviation and the result is a random variable that
has mean zero and variance one. In this case, in addition, if x happens to
be normal, then z also happens to be normal.
Similarly, we can just take this equation where z equals x minus mu over sigma and
we can multiply by sigma then add the mu and get that.
If we were to take a standard normal, say, z, scale it by sigma and then, shift it by
mu, then we wind up with a nonstandard normal.
You know, the top calculation goes from a nonstandard normal and converts it into a
standard normal then the bottom equation starts with nonstandard normal and
converts it to a normal. Another interesting fact is that the
nonstandard normal density can just be obtained as plugging into the standard
normal density. So, if you take the standard normal
density phi and instead of just plugging in z to it, say, you plug in x minus mu
over sigma, and then divide the whole thing by sigma, then that is exactly the
nonstandard normal density. And this is a kind of a way to generate,
just kind of an interesting aside. Here, mu is a shift parameter.
So, all mu does is shifts the distribution to the left or the right, right?
Just like whenever you subtract a constant from an argument in a mathematical
function. It's just moving the function to the left
and the right. And then, sigma is a scale factor.
And so, basically, whenever you take a kernel density, some density, I guess it
works for any density but it makes most sense to do with a density with mean zero
and variance one. And then, you create a new family where
you're plugging in x minus mu over sigma and then divide the density by sigma, you
wind up with the new family of densities that now have mean mu and variance sigma
squared. So, this is kind of an interesting way of
taking a root density with mean zero and variance one and then creating a whole
family of densities that have mean mu and variance sigma squared, they are usually
called location scale families. And any rate, we are only interested in
this case in the normal distribution and this formula right here is exactly how you
can go from the standard normal density and use it to create a nonstandard normal
density by plugging into its formula. Let's just talk about some basic facts
about the normal distribution that you should memorize.
So, about 68%, 95%, and 99 percent of the normal density lies within one, two, and
three standard deviations of the mean, respectively, and it's symmetric about mu.
So, for example, take one standard deviation.
About 34%, one-half of 68 percent lies from within one standard deviation on the
positive side and about 34 percent lies within one standard deviation below the
mean, So, each of these numbers split equally to above the mean versus below the
mean. And then, there's certain quantiles of the
normal distribution that are, are kind of common to have memorized.
So, -1.28, -1.645, -1.96, and -2.33 are the tenth, fifth, two-point fifth, and
first percentiles of the standard normal distribution.
And then again, by symmetry, so, if we just flip it around, right?
So if, if -1.28 is the tenth percentile, then 1.28 has to be the 90th percentile.
So, by symmetry, 1.28, 1.645, 1.96, 2.33 are the 90th, 95th, 97.5th and 99th
percentile of the standard normal distribution.
One in specific I want to point out that you really need to memorize is 1.96.
The reason it's useful is it's the point so that you could take -1.96 and +1.96,
the probability of lying outside of that range, right, below -1.96 or above +1.96,
well that's five%. So, 2.5 percent below it and 2.5 percent
above it, so that's five%. So , the probability of lying between
1.96, -1.96 and +1.96, is 95%. And so, at any rate, it's used to do
things like create confidence intervals in these other entities that are very useful
in Statistics and people have kind of stuck with 95 percent as a reasonable
benchmark for confidence intervals. And five percent is a reasonable cut off
for a statistical test, and if you're doing two-sided, you need to account for
both sides and so you, you use 1.96 and then, the other fact is that 1.96 is close
enough to two that we just round up. So, a lot of times, things like confidence
intervals, you might hear people talking about, we'll just add and subtract two
standard errors. They're getting that two from this 1.96
right here. So anyway, that one in specific you should
memorize, but you should probably just memorize all of them.
Let's go through some simple examples. So, we'll go through two and you should
just be able to do lots of these, after I go through two.
So, lets take an example. What's the 95th percentile of a normal
distribution with mean mu and variance sigma squared?
So, recall, what do we want to sell for if we want a percentile?
Well, we want the point x NOT, but the probability that a random variable from
that distribution x, being less than or equal to x NOT turns out to be 95 percent
or 0.95. Okay.
And so, you know, it's kind of hard to work with nonstandard normals so the
probability that x being less or equal to x NOT is 0.95.
Well, why don't we subtract out mu from both sides of this equation and divide by
sigma from both sides of this equation? And on the left-hand side of this
inequality, x minus mu over sigma, well, that's just a, a z random variable, now, a
standard normal random variable. So, the probability that x is less than or
equal to x NOT, is the same as the probability that a standard normal is less
then or equal to x NOT minus mu over sigma and we want that to be 0.95.
Well, if you go back to my previous slide, 0.95 95th percentile of the standard
normal is 1.645. So, we just need this number, x minus mu
over sigma to be equal to 1.645 to make this equation work.
And so, let's just set it equal to 1.645, right?
And then, solve for x NOT so we get x NOT equals mu plus sigma times 1.645.
So now, you know, you could ask lots of questions with specific values of mu and
sigma. But you'll wind up with the same exact
calculation. And here, in fact, you know, we used 1.645
because we wanted the 95th percentile. But, in general, x NOT is going to be
equal to mu plus sigma z NOT, where z NOt is the appropriate standard normal
quantile that you want. And then, you can just get them very
easily. You know, the other thing I would mention
too is you should be able to do these calculations more than anything just so
you've kind of internalized what quantiles from distributions are and how to sort of
go back and forth between standard and nonstandard normals and the kind of ideas
of location scale densities and that sort of thing.
In reality and practice, you know, it's pretty easy to get these quantiles because
for example, in r you would just type in q norm 0.95 and then give it a mean and a
variance. Or if your wanted, if you did q norm 0.95
without a mean and a variance, it'll return 1.645 and you can do the remainder
of the calculation yourself, but even that's a little bit obnoxious so you can
just plug in a mu and a sigma. So, these calculations aren't so necessary
from a practical point of view even very rudimentary calculators will give you
normal quartiles, nonstandard normal quartiles.
The hope is that you'll kind of understand, you know, the probability
manipulations. You'll understand, you know, what a
quantile means. You'll understand, you know, what the
goals of these problems are. And you'll understand sort of how to go
backwards between the standard and nonstandard normal.
That's kind of what we're going for here. It's kind of clear, I think everyone
agrees that you can very easily just look these things up without having to, to
bother with any of these calculations. Let's go with another easy calculation.
What's the probability that a normal mu, sigma squared random variable is two
standard deviations above the mean? So, in other words, we want to know the
probability that x is greater than mu plus two sigma.
Well, again, do the same trick where we subtract off mu and sigma from both sides
and we just get the, the answer that that's the probability that a standard
normal is bigger than two. And that's about, 2.5%.
And, so you can see the kind of rule here. If you want to know the probability that a
random variable is bigger than any specific number, or smaller than any
specific number or between any two numbers, instead of take those numbers and
convert them into standard deviations from the mean, right?
And that can, of course, be fractional. It could be 1.12 standard deviations from
the mean or whatever. And the way you do that is by subtracting
off mu and dividing by sigma and then, revert that calculation to a standard
normal calculation. So, if you wanted to know what's the
probability that a random variable is bigger than say, let's say, 3.1, just to
pick out a random complicated sounding number.
Let's suppose you're talking about the height of a kid and you want to, you know,
say, what's the probability of, of being taller than 3.1 feet.
What you would need is the population mean mu and the standard deviation sigma, take
3.1, subtract off mu, divide by sigma. Now, you've just converted that quantity
3.1, which is in feet, right, to standard deviation units.
And then, you can just do the remainder of the calculation using the, the standard
normal. So, I would hope that you could kinda
familiarize yourself with these calculations.
And I recognize that, in a sense, they're kind of ridiculous to do because you can
get them from the computer so quickly. And we'll give you the R code that you
need to do these calculations very quickly on the computer.
But I think it's actually worth doing them by hand so just to get used to working
with densities, to get used to what these calculations refer to.
So, let me just catalog some properties of the normal distribution, a lot is known
about the normal distribution. And so, I'll outline some of the simpler
stuff, and, and some of the stuff, the letter points, we probably won't get to in
this class, but I thought I'd just at least say.
So, at any rate, the normal distribution is symmetric and it's peaked about its
mean, which means that the population mean associated with this normal distribution,
the median, and the mode are all equal right at that peak.
A constant times a normally distributed random variable is also normally
distributed. And you can tell me what happens to the
mean and the variance if, say, x is a normal random variable, what distribution
does a times x have if I'm going to tell you that it's normal, what's the resulting
mean and variance? It turns out that sums of normally
distributed random variables are again normally distributed.
And this is true regardless of the dependent structure of the data.
So, if the random variables are jointly normally distributed.
It's important that they are jointly normally distributed.
They could be independent, they could not be independent, but they need to be
jointly normally distributed. The sums or any linear function of the
normal random variables turns out to be normally distributed.
And again, you can calculate the mean and the variance.
Sample means of normally distributed random variables are again normally
distributed. Again, this, this is true regardless of
whether or not they're jointly normal and possibly dependent, or if they're simply a
bunch of independent normal random variables, this is true of sample means.
However, let me just jump to point seven. It also turns out that if you have
independent identically distributed observations, properly normalized sample
means, their distribution will look like a Gaussian distribution, not entirely but
pretty much regardless of the underlying distribution that the data comes from.
So, take as an example, if you roll a die and look at what the distribution of a die
roll looks like, it doesn't look like very Gaussian it looks like a uniform
distribution on the numbers one to six. Now, take a die, roll it ten times, take
the average, and then repeat that process over and over again and think about what's
the distribution of this average of die rolls.
Well, it turns out it'll look quite Gaussian.
It'll look very normal. At any rate, that's the rule, is that
random variables, properly normalized, with some conditions that we're probably
going to gloss over will limit to a normal distribution.
And that's how the normal distribution became the sort of Swiss army knife of
distributions is that, pretty much anything you can relate back to a mean of
independent things , tends to look normalish in distribution.
And mathematically, formally, if they're independently and identically distributed
in the, you normalize the mean in the correct way, then, then you get exactly
the standard normal distribution. That is an incredibly useful result, an
incredibly useful result. It's a very historically important result
called the central limit theorem. So, lets see, back to point five.
If you take a standard normal and square it, you wind up with something that's
called a chi-squared distribution, you might of heard of that before.
And if you take a standard or a nonstandard normally distributed random
variable and exponentiate it, take e^x, where x is normal, then you wind up with
something that's log-normal. Log-normal is kind of a bit of a pain in
the butt in terms of its name. A log-normal means take the log of a
log-normal and it becomes normal. It doesn't mean the log of normal random
variable. It's a little annoying fact, right?
And you can't log a normal random variable, by the way, because there's a
nonzero probability that it's negative and you can't take the log of a negative
number. The name makes it sound like a log normal
is the log of a normal. It's not.
Log-normal means take the log of mean and then I'm normal.
Okay. Let's talk about ML properties associated
with normal random variables. If you ever bunch of IID normal mu sigma
squared random variables, and let's assume you know the variance.
So, let's ignore the variance for the moment.
Then, the likely to associate with mu is, is written now right here.
You just take the product of the likelihoods for each of the individual
observations. And so, you wind up with two pi sigma
squared to the -one-half e^-xi minus mu squared over two sigma squared.
If you move that product into the exponent, you get minus summation i equals
one to N, xi minus mu squared over two sigma squared.
Remember, we're assuming that the variance is known.
So, the two pi sigma squared to the minus N over two, that you would have gotten, we
can just throw that out, right? Because remember, the likelihood doesn't
care about factors of proportionality that don't depend on mu.
In this case, cuz mu is the parameter we're interested in.
By the way, this little symbol right here, this proportion two symbol is what I mean.
That means I dropped out things it's proportional to.
I dropped out things that are not related to mu.
And I'll try and use that symbol carefully where it's contextually obvious what I
mean, what variable I'm considering important.
Okay, so, let's just expand out this square and you get summation xi squared
over two sigma squared plus mu summation xi over sigma squared minus N mu squared
over two sigma squared. Now, this first term, negative summation
xi squared over two sigma squared, again, that doesn't depend on mu.
So, we can just throw it out, right? It's e to that power times e to the,
latter two powers. So, that first part is a multiplicative
factor that we can just chuck. Then the other thing here is it's a little
annoying to write summation xi. Why don't we write that as nx bar, right?
Because if you, take x bar, the sample average and multiply it by N, you get the
sum. Okay, so, the, the likelihood works out to
be mu nx bar over sigma square minus n mu squared over two sigma squared.
So, that's the likelihood. Let's ask ourselves what's the ML estimate
from mu when sigma squared is known. Well, as we almost always do, the
likelihood is kind of annoying to work with, so why don't we work with the log
likelihood? We take the log from the previous page,
and we get mu nx bar sigma squared minus mu squared over two sigma squared.
If you differentiate this with respect to mu, you wind up with this equation right
here, which is clearly solved that x bar equal to mu and so what it tells us is
that x bar is the ML estimate of mu. So, if your data is normally distributed,
your estimate of the population mean is the sample mean.
That makes a lot of sense. We would hope that the result would kind
of work out that way. But also notice because this calculation
didn't depend on sigma, this is also the ML estimate when sigma is unknown.
It's not just the ML estimate when sigma is known.
So, we know what our ML estimate of mu is. Let me just tell you what the ML estimate
for sigma squared is. The ML estimate for sigma squared works
out to be summation xi minus x bar squared over N.
And you might recognize this as the sample variance, but instead of our standard
trick of dividing by N - one, we're now dividing by N which is a, a little
frustrating that there's this kind of mixed message that the maximum likelihood
estimate for sigma squared is the so-called biased estimate of the variance
rather than the unbiased one where you divide by N - one.
Now, notice as N increases, this is irrelevant, right?
The factor that disputes the two estimates is N - one / n. And that factor goes to
one as N gets larger and larger. So, I've had several colleagues tell me
that they would actually just prefer this estimate, this maximum likelihood
estimate. And their argument is something along the
lines of, well, the N - one estimate is unbiased but this one has a lower
variance. And what they mean is this is the biased
version of the sample variance. It's only a function of random variables
so it, itself is a random variable, and as a random variable, as a mean and a
variance. The fact that it's mean is not exactly
sigma squared means that it's biased. But it has a variance and its variance is
slightly smaller than the variance of the unbiased version of the sample variance.
And so, this is an example that pops up all the time in Statistics, that you can
trade bias verses variance. In this case, one variance estimate is
slightly biased, but will give you a lower variance.
Another one is unbiased, but the variance estimate, itself has a larger variance,
and it's very frequent in Statistics that you have this kind of trade off, you can
pick one as you increase the bias, you tend to decrease the variance and vice
versa. So, the other thing I wanted to mention
was here, we've kind of separated out inference from mu and inference for sigma.
If you wanted to do kind of full likelihood inference then you have exactly
a bivariate likelihood, a likelihood that depends on mu and sigma.
And it's a little bit difficult to visualize, but it is just a surface,
right? Where you have mu on one axis, sigma on
another axis, and the likelihood on the vertical axis, then it would just be a
likelihood surface instead of likelihood function.
And, it's a little bit hard to visualize these kind of 3D looking things.
So, there are methods for getting rid of sigma and looking at just the likelihood
associated with mu, and getting rid of mu and looking at the likelihood of just for
sigma and later on we'll discuss methods for them.
But for the time being, it's not terribly important.
What I would hope you would remember is that if you assume that your data is
normally distributed, then, you know, we gave you the likelihood for mu if your sum
sigma is known. We calculated that the ML estimate of, of
mu was, in fact, x bar, and that the ML estimate of sigma squared was, you know,
pretty much the sample variance. You know, off by a little bit from the
standard sample variance, but pretty much the sample variance.
And then, you know, the ML estimate of sigma, not sigma square but of sigma, is
just the square root of our estimate for, ML estimate for sigma squared.
Well, that's the, end of our whirlwind tour of probably the two most important
distributions. There are some other ones that we'll cover
later. Next lecture, we're going to travel to a
place called Asymptopia. And everything's much nicer in Asymptopia,
and so I think you'll quite like it there.