Learn how probability, math, and statistics can be used to help baseball, football and basketball teams improve, player and lineup selection as well as in game strategy.

Loading...

En provenance du cours de University of Houston System

Math behind Moneyball

36 notes

Learn how probability, math, and statistics can be used to help baseball, football and basketball teams improve, player and lineup selection as well as in game strategy.

À partir de la leçon

Module 3

You will learn how Monte Carlo simulation works and how it can be used to evaluate a baseball team’s offense and the famous DEFLATEGATE controversy.

- Professor Wayne WinstonVisiting Professor

Bauer College of Business

Okay, in this video we're going to teach you an introduction to Monte Carlo

simulation.

Some of you may have seen this before, but what is Monte Carlo simulation?

It's playing out a complex situation many times

To estimate a probability or

a range of outcomes that might occur.

And there might be an analytic formula for the situation you're looking at, but

usually that will require advanced math, which we don't want to get into.

But basically it's pretty easy to make Excel play out a situation many,

many times and estimate the probabilities.

So we're going to start with a couple examples of Monte Carlo simulation one

from sports, one from classical probability.

And the name Monte Carlo, you know the casino in Monaco.

But I believe it was a Polish mathematician physicist,

I'm not sure I spelled it right.

Who worked on the atom bomb during World War Two.

And the uncertainty in the atom bomb was basically with the atom bomb actually work

with that, I guess, the neutrons hitting each other, keep hitting each other enough

to have a critical mass or a reaction that would bring you an atom bomb.

So everything in World War II needed a code name, and

I believe they gave the code name Monte Carlo to the simulations of the atom bomb.

That basically, during war time, during World War II, and the name stuck.

Okay, so let's start with a really simple situation that's a simple game

of baseball.

So we have an inning.

So there's three outs in an inning.

And every hitter hits a home run or strikes out.

So, on average, how many runs would you score that day?

And then of course, you want to generalize this to, where you can have a real

Major league team batting order like the Astros or the Red Sox or the Yankees.

And figure out given their batting order, how many runs will they score in a game

and that's sort of how these table baseball games were built.

They're sort of simulations where you throw dice to see if a guy hits

a home run.

I remember I made one of these up in seventh grade.

If Willie Mays threw a seven with three dice he might hit a home run.

Okay, Sandy Koufax could stop that home run if you roll a certain

number on Sandy Koufax.

That's again one of the ways I learned a lot about math,

just by picking up that game in the seventh grade.

Okay, so let's try and solve this problem by using Monte Carlo simulation.

So the trick is to first set up a spreadsheet that plays out the situation

once, and the key to modelling uncertainty in Excel there's a function called RAND,

R-A-N-D, left parenth, right parenth.

It's equally likely, when you type in RAND, with an equal to sign,

to return any number between zero and one.

So in other words, the chance that the RAND function would be for

instance less than or equal to 0.6 would be 60%.

Between let's say 0.1 and 0.8 would be 70%.

Because that's 70% of the distance between zero and one.

And in the next video, we'll learn another function that's helpful with Monte Carlo

simulation itself, called the RANDBETWEEN function.

We'll also need data tables, which we discussed when we did,

went over the Pythagorean Theorem.

Okay, so let's try and play out this situation.

What's the difficulty?

We don't know how many batters will bat an inning or

first three batter that could strike out the innings.

So we could go strike out, home run, strike out, home run, strikeout, home run.

That's six batters.

We might even have 50 or 100 batters.

But let's, we've got to cap it somewhere.

So let's suppose there's 50 batters.

So we're going to go batters one through 50.

Now an easy way to do one through 50 is go home, fill,

series, and if I say columns I can go one through 50.

This will be helpful when I want to go one through 5,000 in a couple.

Okay, so I have the batter number and so I could have a random number for

each batter, and that will determine whether it's a home run or out.

So I'm just going to say less than or equal to 0.5 is a home run.

That will happen 50% of the time.

Greater than 0.5 now.

So that will make it 50/50 home run or out.

And there's nothing sacred about 50/50.

I could say 60% chance of an out, except making that for homer problem.

Okay, so I need a random number for each batter, and

I know we may not need 50 batters that finish the inning, but that's okay.

We can play it out longer than we need and

then pick off how many runs we scored in the end.

So, I do RAND, no, you hit the F9 key, I'm hitting F9 key.

If you hit the F9 key, the RAND recalculates, and

that's really the key to doing the monte Carlo simulation with Excel.

So if I double click and copy that down, I have a random number for each batter.

And then the result is either a home run or an out.

So, if that random number is less than 0.5, then I want to make it a home run.

And now we can use IF statements, which we haven't really needed to.

I can say IF that random number, but IF statements are so

important to what we're going to do throughout the class.

If the random number is less than or equal to 0.5, I want to put a home run in there,

I need to put that in quotes, because it's text.

Otherwise, I'll put the word out, it doesn't matter if it's a strikeout or not.

Okay, so then I can copy that down.

So, the first batter made an out.

Any time you see greater or equal to 0.5, it's an out.

Less than 0.5, it's a run.

Okay.

So now I want to keep track of how many runs I have after each batter.

So the trick is we can have the number of runs at the beginning,

and then after the batter, the number of runs after.

And we could also have the number of outs at the beginning, and the ending number.

Actually, we don't have to make it this complicated.

But the key to simulation is just writing down what you want to keep track of and

then writing formulas that keep track of it,

based on the uncertainties that results.

Okay, so, the start runs is obviously zero.

Now at the end of runs, it's if this is a home run, if that equals home run, give

yourself, the start runs plus one, otherwise you keep the start runs.

In other words, if it's a home run, you've got more runs, one more run.

And I could just copy that down.

Double click the left nodes, seek every time okay.

The start runs I have wrong there.

The start runs would equal the end runs from the previous batter.

Okay, so say I have a home run.

Start with no runs, I got one run.

I got an out, still have one run.

An out, I still have one run.

Home run, now I've got two runs.

And outs, it's very straightforward to do number of outs.

I start with no outs.

Okay, now basically, how many outs would I have at the end?

Well, after one batter, you just would take the number

of outs plus runs must equal the number of batters.

That's the simple, easy way to do it, but really do it another way if you wanted to,

but then outs would equal batters minus runs.

So the end outs would equal the batters we have minus the number of runs.

And the start runs would equal the start outs would equal the end outs.

I think this should work here.

I'll copy that down.

Okay, so now by Double-clicking copy these down here.

Okay, so let's say nine batters, I've got two runs and seven outs.

Okay after eight batters, two runs and six outs.

There's two home runs.

So, now the game is probably not going to go 50 batters,

because you probably see around 25 out.

So, when does this game end?

Well, basically suppose the end outs is three.

In this case, the end outs is three right there.

Okay, if the end outs is three, the number of runs I got was right here.

So how do we figure out how many runs we scored?

Well, we could figure out the row in which the batter that ended the inning.

That's the first time you see a three in the outs column.

And we could do that with the match function we learned earlier.

And then I could use the index function to figure out how many runs.

In other words, here, it took five batters to end the inning.

So I would say, match three in the end outs column, I'd get a five, and

then with index I could go down and find the number of runs, and

that would play this out once.

But I want to play this out 1,000 times, 5,000 times let's say.

Don't worry about how many times,

that's beyond the scope of what we want to do.

Okay, so the number of batters, okay, I would say match,

3 in the outs column.

I assume there is a small chance the inning wouldn't end, but

I'm not going to worry about that.

Remember you need a common zero up there.

Okay, so there were five batters and the inning ended.

That makes sense because there were two home runs.

And then the number of runs, well,

actually I don't even need the index option.

It's from what we said, runs equals batters minus outs, but

I'll do it two ways.

So do batters and there'll be three outs at that point.

See there there were six batters and we got three runs.

But with the index function, just to show you I could do it,

I would say index, and I'd go down this column,

And then, index function, you have to say how many rows you go down and

that's the number of batters.

And then it's going to be the first column.

And see, that's always the same thing.

See, these two will always be the same, doesn't really matter which one I look at.

You see there I got, one home run and three outs.

There I got, my dude batted well here,

sometimes I should get, here I had good nine batters, and I'd scored six runs.

Okay, because I want home run and the inning, and end right there.

So the question is, how to play this out?

We'll use this cell, it doesn't really matter.

Want to play it out, let's say 5,000 times and average the number of runs scored.

And the answer will be close to three by the way.

You can prove that mathematically exactly the answer is three.

It's in I believe chapter four of a mathletics book.

But we're going to use a one way data table to play the out 5,000 times.

So we list the numbers one through 5,000.

So 5,000s fairly arbitrary, but that's a lot of files, right?

Again it's a pretty slow at 10,000.

So I do home, build, series, and

I'll say one through 5,000 in columns.

So I have the numbers one through 5,000.

So remember how a one-way data table works?

You go up one row and over one column, you put the output cell.

The output cell is the number of runs.

So I want to play that cell out 5,000 times, and

that's like playing out this inning 5,000 times.

What I really want to do in a couple of videos is show you my simplified baseball

money for our simulation, you put in team statistics and it plays out an inning and

you can try and figure out how many runs that team is going to score.

So if I want to play this out over and over, here's the trick.

I make this a one-way data table.

Remember, a one-way data table, There's no row input, so.

So what's the column input stuff?

Use any blank cell.

Because what'll happen if you make the column input cell a blank cell,

it'll put a one in a blank cell, and then basically recalculate the formula.

When you put a one in a blank cell, all the RANDS recalculate.

Put a two in a blank cell, it will recalculate the formula,

which is number of runs, etcetera.

So every time it goes down the data table, putting one through 5,000 in a blank cell,

reviewing all the RANDS, and figure out the run score.

Okay, so let's do this.

We go here, select that range, we go Data, What if Analysis,

then we'll average the number of runs, and we should get about three.

Data Table.

No row input cell, column input cell, blank cell.

Okay, now your numbers maybe all the same, let's explain that.

Under Formulas, Calculation Options, there are three choices.

Automatic, Automatic except for Tables, or Manual.

Now Automatic is your usual option and everything recalculates,

but that can make things slow on a big spreadsheet.

Automatic except for

data tables, you need to hit F9 to make your data table recalculate.

Manual is good for a big spreadsheet, and then you'd have hit F9 to make the whole

spreadsheet recalculate, but sometimes big spreadsheets are very slow to recalculate.

Okay, so, I hit F9.

Okay, there we go.

So this first time we scored six runs, then seven, what's the average runs?

It should be close to three, I think.

So I'll use the average function.

I got 2.99.

I run it again by hitting F9, 2.99, 3.01.

So even though I don't get an exact answer, I get,

without knowing a lot of math,

It's pretty close approximation to what's the right answer, which is three runs.

You can read the Matt Planeck's book for the derivation.

Okay, well that's the end of this example on money policy inflation.

I think the next video we'll do a simple football example where we know

the probability we would gain a certain number of yards on each play.

And we want to know if we ran four downs,

what's the chance we would get a first down?

Okay, so we'll see you in the next video.

Coursera propose un accès universel à la meilleure formation au monde,
en partenariat avec des universités et des organisations du plus haut niveau, pour proposer des cours en ligne.