0:09

Hello. This lesson introduces distributions both

Â empirical and theoretical which provide concise representations of a data set.

Â Often you will be given a data set and you may do

Â some basic statistical analysis with descriptive statistics.

Â You may visualize the data set and use for instance,

Â box plots or scatter plots or

Â even histograms to try to understand how data are distributed.

Â But, usually, you're going to also want to look

Â at theoretical distributions to understand,

Â is there some sort of physical basis for what I see in this data?

Â We can use these theoretical distributions to

Â gain insight into modeling and interpreting a data set

Â based on what we know about different distributions

Â such as a Poisson distribution or a normal distribution.

Â In this lesson, you're going to look at a visual web site from Seeing Theory that

Â explores random variables both continuous and discrete distributions

Â as well as look at the Introduction to Distributions notebook.

Â So first, the distributions site,

Â first this talks about a random variable and

Â goes through how to play with random variables.

Â So for instance you can enter values and

Â submit and you can select cells and you will generate

Â different types of random variables through this distribution

Â and see how the probability space can generate a distribution.

Â You also can play with continuous and discrete variables.

Â So for instance, at discrete,

Â you can look at Bernoulli, binomial, etc.

Â Or if we click continuous,

Â you can look at uniform, normal exponential, etc.

Â Lastly, the central limit theorem.

Â This is a very important concept that talks about how even

Â if we have a distribution that is not normally distributed,

Â if we have enough samples that we average together,

Â the results will generally follow a normal distribution.

Â And that's what this particular part of the website shows.

Â So I encourage you to play with these different ideas and see, on this website,

Â you can build some deeper physical intuition by seeing them visually demonstrated.

Â Now, these same concepts are also

Â demonstrated in the Introduction to Distributions notebook.

Â We look at theoretical distributions that can be discrete or continuous.

Â First, we'll look at a uniform distribution where we

Â have probability uniformly spread between

Â two endpoints and we see how to do this in both a discrete and a continuous case.

Â The code here actually demonstrate these by making plots.

Â So on the left, we have

Â a discrete uniform distribution and on the right, we have a continuous.

Â Notice that because it's continuous,

Â we actually bin the data and that's shown by this soft gray line.

Â We also show our continuous and frozen distributions.

Â Now, one thing to keep in mind,

Â is we're making this plot and we're using the SciPy library,

Â scipy.stats, to get these functions.

Â The way it works in SciPy is we create the distribution and we

Â can effectively have a frozen distribution where we specify the parameters.

Â So, here we say, we want

Â a uniformly distributed integers discrete distribution

Â and the probability is uniform between low and high.

Â Thus whenever we call this,

Â this is now a function that is predefined with these parameters.

Â So that saves us time and it makes it easier to compute things.

Â So for instance we have UDRV here,

Â we can actually compute things from it as we go through this particular code cell.

Â And that's what we do here,

Â we pass this function and do some functions we defined and we add

Â sample from that distribution inside these other functions and create these plots.

Â The rest of the notebook looks at some other distributions, like the Poisson,

Â we briefly mentioned other discrete distributions

Â before moving on to the Gaussian distribution.

Â Here we see different versions of the Gaussian.

Â Also talk about some other continuous distributions.

Â I want to just focus on the plots themselves so you

Â can see these other distributions demonstrated here.

Â All in all, I believe we look at eight different distributions.

Â Here is the plots themselves so there's the Power Law.

Â Note that this is a logarithmic scaling.

Â These are very interesting distributions.

Â A lot of times you see things in queuing theory that may follow a Power Law.

Â So the number of calls that come into a call center or the time it takes for shipping,

Â things like that maybe following a Power Law.

Â A related one is the exponential distribution.

Â You may have heard of the Pareto distribution and the Cauchy is a very interesting thing,

Â it looks like a normal distribution but there's a lot more power out in

Â the tails and so they're more broad than a normal distribution.

Â Next, we look at random sampling how to actually,

Â given a sample, draw from it.

Â So here's a data we've drawn from a model which is shown in red.

Â And in the blue we see the actual data that we've drawn.

Â We can do this with other distributions as well.

Â So here's an exponential distribution shown in red and in blue,

Â soft blue, is our actual data.

Â Now this is kind of hard to see because it's so

Â strongly peaked so we can change the axis to be

Â logarithmic and then you could see that it's just a straight line.

Â We also look at some alternative distribution forms including the CDF,

Â and the percent point function, the survival function.

Â The SciPy module provides methods to calculate these very easily,

Â so here we show a Gaussian PDF.

Â And for that same function we show the CDF,

Â the percent point function,

Â the survival function, and

Â the inverse survival function and talk a little bit about why these are important.

Â Fundamentally the idea of the CDF is we can read off a probability and say,

Â what is the value of our variable at which the probability is that value?

Â So we're going to be able to say,

Â what's the value below which we know our probability is below this?

Â So this tells us quickly,

Â there's your median, right?

Â And there is your seventy fifth percentile.

Â The other functions behave similarly.

Â The survival function for instance tells you what's

Â the probability that you've lasted this long?

Â So if you think about this in terms of time,

Â you could read this off from zero to time T,

Â that's the probability that you survive 50 percent of the total time.

Â This is important for manufacturing, for instance,

Â where you want to understand how long is

Â a given piece of equipment going to survive if it's operating at a certain rate.

Â Other things that we're going to look at include

Â the central limit theorem which you saw on the visual web site.

Â This demonstrates the central limit theorem by sampling coin flips.

Â We take 10 coins if we only flip them once they're distributed crazy.

Â Here, we're basically averaging over all the flips we do,

Â 10 coins, 10 flips.

Â It starts to look a little more like a Gaussian 100 flips.

Â And by the time we've done a thousand flips of 10 coins,

Â it looks very close to a Gaussian.

Â And this is the central limit theorem in action.

Â We also talk about QQ plots and fitting distributions.

Â The idea is, can we actually take a data set and infer

Â whether a given distribution would be a good approximation to it.

Â And that we can then say,

Â let's fit a model to our data,

Â a theoretical model and we can then actually see.

Â So here we generate data,

Â the model is in this strange dot dash line.

Â The fit is in this red line,

Â and you can see they're very close,

Â and yet if you look at this data you might have thought,

Â what kind of distribution is it?

Â And yet we've derived very accurately the underlying distribution.

Â So I hope this has given you a better feel for theoretical distributions,

Â how we can use them both in calculating probabilities as well as interpreting data.

Â If you have any questions let us know in the course forums. And good luck.

Â