0:50

we can try to predict how popular different movies are going to be.

So, this is just a summary of the data that's up on the course website.

We've got a sample of movies that were produced during the 2001 through 2005

years, with a lot of information available about those movies.

Now, if we look at the information that's available, features such as

what genre is it, which studio produced it, what's the movie rating?

Is it based on an adaptation of a graphic novel or a novel?

Is it based in some other media?

Some of these are yes/no answers.

Other variables might have multiple options available,

not just the two options.

These are all categorical outcomes.

That's the common thread here.

1:37

We've also got a block of financial measures.

So, how much revenue was brought in?

What was the production budget?

What was the marketing budget?

These are all quantitative variables.

So, the nature of the variable, that's going to

inform the type of analysis that we can apply to it.

All right.

So some of the ways that we might start looking at the categorical data, and

we'll go through some of these.

I'll demonstrate them, I would encourage you to spend some time working with

the Excel file to make sure that you're comfortable, not only generating these

different reports, but also understanding the trade-offs associated with them.

Frequency tables are going to report numerical values to us as our

contingency tables, or cross-tabs.

We might look at pie charts, bar charts,

column charts as ways of visualizing some of this output.

All right, so if we wanted to put together a frequency table.

And I'll jump over to the Excel file, so

that we can see what we're working with in a second.

This just gives you a snapshot of the number of movies being produced by each

studio in a particular year, that's in the middle column.

So, if we look at, of these 199 movies,

we can see how they're distributed across the different studios.

Notice that they're ranked in descending order.

So, Universal had the most movies out in this year.

Followed by 20th Century Fox, Warner Brothers, and so forth.

As we go further down this list, the other category,

lumping all of those smaller studios that had less than five films

in that year produced, makes up a total of 44 of the 199 films.

And then the column on the side is putting that as a percentage basis.

So they're saying Universal produced 9.05% of the films,

20th Century Fox produced 8.54%.

Descending percentage until we get to that final row where we've lumped the others

together.

That percentage adds up to 100%.

Now the way that we've reported it here,

we're reporting for each studio what percentage of films were produced.

You might also want to produce a cumulative column.

So Universal produced 9% of the films, 20th Century Fox produced 8.5%.

So we might want to say, all right, well, the two studios that produced the largest

films or the most films combined how much did they produce.

It would be running sums, so lets start off with on the 9.05% for

20th Century Fox and larger studios, we add in the 8.54% for

Warner Brothers, add in another 8.54% as we go further down that column,

adding in the films produced by the smaller studios.

We get closer and closer to 100%, and

that's what this cumulative distribution would show us.

So, on the X-axis or the horizontal axis,

one corresponds to the studio that produce the largest films,

37 corresponds to the studio producing the fewest number of films.

And as we include more and more studios as I move from left to right on this graph,

it accounts for a larger share of the films that have been produced.

4:56

Another way of looking at this data, numbers tend to be a bit sterile.

We might want to put that into the bar chart.

And so we can see how many of these, how many films were produced by each studio.

So that frequency table, that can be reformatted and put into the bar chart.

And this is just focusing on a subset of those studios,

just those that had at least five films.

Maybe easier in terms of delivering reports rather

than including a massive table to have charts similar to this one.

Another way that we might represent the distribution

of films across the studios would be with a pie chart.

And I have a little bit of a love/hate relationship with pie charts and

you start to see why in this case.

We've got a lot of studios that make up a very small slice.

Well, think of making a more and more narrow slice of the pie.

Try splitting that other category into the individual studios.

Fitting these data labels onto this chart is going to become very difficult.

So in this case, we've included the name of the studio on the pie chart.

We've included the percentage of the films that those

studios are producing and we can still see it on this chart.

As you add more and more studios,

as you have a categorical variable with more and more values.

These pie charts may become less useful because you can't visualize all of

the possible options.

Ways around that might be to lump sum the options together.

So that's what we've done in this case with the 22% falling into

that other category.

6:39

All right.

So just a couple of words of caution with these charts, and we've talked about

categorical variables in the sense that a movie falls under a studio.

A movie falls only under one of the studios in our data set.

Well, when you're making bar charts, when you're making pie charts,

that's a requirement in the data that each observation can only fall into one of

those categories and all of your options are going to have to add up to 100%.

One of the other things to be careful of is, we focused just on studio.

What if I wanted to look at studio by rating?

So let's look at the movies that are PG-rated movies

from the different studios.

And I want to draw some comparisons between the PG and the PG-13 movies.

Well, a single pie chart is not necessarily the best way to go about

doing that.

I might have to do side-by-side pie charts or

side-by-side bar chart to make those comparisons.

All right, but these are just ways of summarizing the categorical data that's

available to us and visualizing that, and very helpful from a recording perspective.

Bar charts, pie charts, the frequency tables,

it's ways of summarizing a single categorical variable.

But what about when we want to see the relationships that exist among

two categorical variables or even more general than that?

What if they're three or more categorical variables?

Well, one of the popular tools that we can use to do that is a contingency table or

a cross-tab table.

So using the data that we have available to us, we can put together these tables.

Say, we wanted to look at the studio and the genre of movies that were in there.

Perhaps we want to see how many movies of a particular genre are made by different

studios.

Maybe we want to look at the relationship between studio and ratings.

8:41

So this is one way of looking at this data.

This is the raw count of the data.

You'll notice, going across the rows, we have these studios.

If we look down the columns, this is looking at the movie rating.

So we have G, PG, PG-13 and R, and

these are the counts of how many movies of each rating were made by these studios.

This is the cross-tabs, so we're trying to look at the relationship that exists

in terms of studio and ratings.

10:07

Think of this if we're looking at that last row, it's the margin of the table.

It's the marginal distribution for us.

And so, what fraction of movies were rated G?

23 out of 351.

What fraction of movies were rated R?

89 out of the 351.

If we wanted to look at how many of these movies came from a particular

studio with a particular rating?

That's when we're going to jump into the individual cells.

So Buena Vista movies rated G were 20 out of the 351.

If we look at Buena Vista overall, I could add up this entire row.

And that's going to tell me that they produced,

that studio produced 87 movies out of the 351.

So this tool, good for looking at two different variables,

in this case we're looking at just those counts.

This is produced entirely in Excel, it's using the pivot table feature,

very convenient as far as organizing data and providing quick summaries.

11:16

Same data that we were looking at previously, but in this case,

I've just reformatted that data.

So instead of saying, let's count up the number of movies, in this case,

focuses on the fraction of the total.

So you'll recall from the previous slide, there were 351 movies.

Well, now we're looking at 100% of those movies.

So divide each entry by 351 and we can see what the percentages are.

So, movies rated G made up just about 6.55% of movies

released by these four studios.

Whereas movies released by Universal made up just shy of

22% of movies released by these four studios.