Of the 60 spam emails, 35 contain the word free.
Of the rest,
only three contain the word free.
If an email contains the word free, what is the probability that it is spam?
So what we want to do first is to organize this information into a probability tree.
We're going to start by dividing our population, our inbox in this case
is our population, into two, based on whether the email is spam or not spam.
So we have 60 emails that are spam, and 40 emails
that are not spam.
Now that we've done this branching, we can actually further
branch out from these and list how many of the spam
emails have the word free in them and how many of
them do not, and likewise for the no spam, non-spam emails.
Of the 60 spam emails, 35 have the word free in it, and of,
and the remainder 25 do not. And of the not spam emails, only three
of them have the word free in it, and 37 do not.
Now that we have organized the information that we're given
into a probability tree, what we want to do next
is to go back to the question and try to
figure out what it is exactly that we're being asked for.
The question is, if an email contains the word
free, what is the probability that it is spam?
So we know that the email contains the word free, so that's
going to be our given, and we're asked for the probability that it's spam.
So we can denote this as probability of spam
given that the word free is in the email.
Since we're saying that we know the word free is in the
email, we're basically saying we can in, ignore the rest of the email.
So first what we want to do is figure out how
many emails in total have the word free in them.
35 of them come from the spam folder and
three of them come from the not spam folder for a total of 38 and of these,
only 35 of them are of interest to us because those are the spam emails.
So 35 out of 38 gives us roughly 92%.
Here we've implicitly made use of the Bayes theorem.
What we have in the numerator is our joint probabilities, spam
and free, and what we have in the denominator is the marginal
probability of what we're conditioning on, the free.
Except instead of working with probabilities in this
case, to make things simple we've worked with counts.
So what we're going to do next is actually
move onto a situation where we're working with probabilities
from the get go, and we don't know the
sample size of the population size that we're dealing with.