0:00

[MUSIC]

Welcome back.

I hope you were able to gain an understanding of the difference between

exploratory and explanatory analysis from my colleague Sook.

Moving forward, most of what we will be discussing is exploratory analysis,

since we don't necessarily have an idea of the questions we are trying to answer.

In this lesson, we're going to exam a classic case study.

It's called the Anscombe Quartet, and it revolutionized data analysis.

It was first posited in the early 1970s, he stated that you can't

just use summary statistics to understand the data, you have to visualize it.

In a world where we have data sets that could be in the trillions of records,

Anscombe's argument is even more relevant today.

It's not to say that summary statistics aren't important.

They are absolutely essential, but you must also visualize it.

So we're going to do it with tableau.

So this is how it's going to look when it's all done.

Looks exciting, right?

Well, let's get started.

1:12

These data are available to you in the resources.

But, let me first introduce it.

This are the data that Francis Anscombe used.

It's a very simple data set at first glance.

1:28

All of the x values are identical,

x1, x2, x3 and x4.

All of the y values have the changes depending

on whether it's in y1, y2,y3 or y4.

Now the crucial thing is that the summary statistics, the average, the variance,

the correlation and the linear regression slope are all identical.

So the mean of x1, x2, x3 and x4 are all 9.

The means of y1, y2, y3, and y4 are all 7.5.

And similarly, the variance of x are all identical, and

the variances of ys are all identical.

The correlations of each of those x1 and y1, x2,

y2, x3, y3, x4, and y4 are all identical,

which means it is exactly the same regression line for each of the equations.

Now, we want to do this in Tableau.

This is a Tableau class, so

it's very important we do as much as we can in Tableau.

And it's going to be much easier to do the visualizations,

data setter cleaned up to make it easier for us to use.

In this case, we're going to do what's called normalizing the data.

What that means is that each row contains only one piece of information.

So, in the data set shown here, it is designed to be analyzed using summary

statistics in a statistical software package like Stata, SAS, R or SPSS.

But in Tabeau, it's not intended to be a statistical software application.

It's a visualization package, and thus, normalizing data is

essentially done to maximize what you can do with data in Tableau.

3:28

So although I'm going to do the visualization in Tableau,

I'm going to do the data preparation in Excel.

Because it's actually much more difficult to do it in Tableau and

Excel is really made to do this data manipulation.

Now to do this, I modified the spreadsheet and you can do this however you want it.

You can do it manually, you can do it through cut and paste, but

I'm not going to go through this exercise because this is not an Excel course.

I made the change to the spreadsheets.

So now there's a number going down the side, and there's an x and y column.

So instead of having x1, y1, x2, y2, etc., it's the number and then x and y.

So it's going down, so again, it's normalized.

4:18

Now this is interesting because there are two additional columns

that really require explanation.

One is called Column and one is called a Row.

And yes, there's a column that I'm naming Rows.

Each row is assigned a value, either first or

second depending on where I want the particular chart to go into visualization.

And the reason why I have those two columns,

is because we want to replicate what Francis Anscombe did back in 1973.

To not only prove a point about outliers and

the importance of exploratory analysis through visualization, but

also to give you a little taste about the cool tricks that you can do in Tableau.

To be able to get visualizations that you wouldn't normally be able to do just by

doing some of the default choices that are available in Tableau.

Tableau is powerful enough to be able to allow these other innovations,

through various calculations, to make your visualizations look really cool.

5:18

Okay so, the data are ready for Tableau.

In the last course,

you spent some time walking through how to import the data from Excel.

It's very important that it's just really what you're going to be doing

most of the time.

But this data set, isn't that big, it's actually considered tiny.

And so, this is an opportunity to show you another trick.

5:40

You need to copy the data from Excel, like I'm doing here.

Make sure you grab every row and column, don't forget anything,

just double check, and then do the copy in Excel.

I usually do Ctrl+V, if you have a PC, but you can do it from the drop downs as well.

So just copy that information down.

6:02

If you already haven't done so, please open Tableau.

Click on the Data menu, then click on Paste Data.

It will churn, but not for very long, and then voila, your data is now in Tableau.

It's really cool, because you don't have to do the importing,

you just paste it in there.

And it will work for medium size data, and this is perfect for our benefit here.

So definitely take advantage of that.

If you just need it just to get that visualization done, just paste it in.

You don't even have to get it imported it from an Excel document.

However, there are some changes that you're going to make to the Excel file,

it would be good to do it through data connections.

6:53

It's nice to have that cross tab, maybe, but that's not what we want, of course.

So drag all the fields away.

Another way to do it is you would just go up to the drop downs here,

and you can just clear the worksheet.

8:15

Now we'll do a little bit of formatting here.

So let's change the marks to a circle.

Change the color to orange, and enlarge the size of the circle up a bit.

We're going to add a trend line here, but

I'm going to remove that confidence interval that it automatically puts there.

8:43

These data sets look virtually, not just virtually but

actually identical, but obviously they're not.

As evidence through visualization and not through summary statistics.

The one on the bottom right, for example, fits the same exact

9:02

regression line as the one in the upper left.

Yet the data sets are very different.

The one in the upper left is sort of a traditional linear correlation between

x and y, but it actually has exactly the same correlation as the bottom right.

10:09

I'm assigning a couple of readings on the Anscmbe's Quartet and why it's important.

But in the meantime bottom line is this,

make sure you do exploratory work on your data.

In the next lesson we're going to show more ways about learning your data through

exploratory work.

So I'll see you there.