In this lecture, we want to talk about attributes of a multivariate data visualization.

In previous modules, we've discussed univariate data visualization,

how to create graphs and different properties of graphs,

and all of those things focus sort of on looking at one variable in a data set.

How are such data sets with a single variable are more and more uncommon.

If we think of different examples such as just our classroom,

where we might have different exams and quizzes and these things,

we can easily create a dataset

with several different variables that we may want to compare against.

For example, each student in class has a name.

We might have scores for quiz one,

for quiz two, for quiz three and so forth,

and each of these, and we may want to be able to create

different plots to try to find patterns and correlations.

And in previous lectures, we talked about things like making a histogram for one D data.

So if I wanted to make a histogram for quiz one,

remember we broke up the data in two different chunks.

So if the quiz was out of 100 points,

we might have bins from 10 to 100,

and we would count how many students fell into each bin,

and draw different sized bars for the number of students that fell into a bin.

And this let us explore sort of the distribution of a single variable.

In this lecture, what we're interested about is

trying to start thinking about what if we have more than one variable,

what might those plots look like?

And with multivariate data,

what we're interested in is really

any statistical technique used to analyze data from more than one variable.

And so we're going to talk not only about ways to interactively visualize the data,

but we also want to think about ways to analyze these.

One of the mantra is the visual analytics,

has always been detect the expected, discover the unexpected.

And so we want to be able to explore, interact,

and find things that are interesting in the data,

and process this information in a meaningful way.

The problem is, the more dimensions that the data set has,

the less effective its standard computational and statistical techniques become.

We wind up with problems,

like P fishing where you can find correlations between lots of random things.

For example, you can go online and find correlations between

gun violence and the altitude of the Pyrenees Mountains for example.

And those have no meaning, no meaningful correlation.

So we have to be careful about the things we find,

the statistics we use,

and the visualizations that we create,

so that we're not accidentally lying to people about what's within their data.

So we want to detect the unexpected,

we also need to be careful and make sure that this has some sort of meaning behind it.

And so in univariate visualization,

we talked about things like histograms.

So here we're showing another example the histogram,

looking at maybe how to fit probability distributions.

Here we can fit a curve to try to create some underlying probability distribution

through techniques such as kernel density estimation.

Here we're looking at line charts,

and they've been expanded to handle sort of three variables,

we're plotting three variables at once.

But here we can see as we plot more and more line charts,

we have more and more clutter on the screen.

We can see from this Excel graph,

we're not really looking at nice numbers on the plot.

We're not really able to read the graph all that nicely.

So when I think about those things,

here's our example of our box and whisker plot,

and we talked about how to create these, showing quantiles.

So here we have the median,

we have Q1, we have Q3.

So median, Q1, Q3,

as we discussed in previous lectures how to calculate those.

We can see a skewed distribution to the negative side here.

But we're also missing information like plot titles,

axis labels, and some of those other things to really tell us what these data sets are.

And these still only are really handling one variable.

So if we think about how we might want to plot a dataset like a class

or the nice example you'll see in other lectures about books from a library,

then how do we handle all this information,

we want to start thinking about new techniques for multivariate visualization.

And really there's two main ways of presenting multivariate datasets,

directly through a table.

And so again, we can think about that example I made up earlier,

where we might have everybody's name in class,

we might have their first quiz score,

their second quiz score,

their third quiz score.

So you might have Ross,

you have some scores like this.

We might have Jane,

and of course Jane has some scores,

and some other people.

And so we have this whole table.

And you can see right away part of the problem with this table is,

it's hard to see any trends.

If I have a whole lot of rows,

it's going to be difficult to sort of comprehend those.

If I have a whole lot of columns,

I'm going to have more and more variables.

So it becomes difficult to parse.

And if we want to figure out like trends over time,

let's say we were taking all these quizzes in a row,

so we may actually want to look at how students improve over time.

So we may have some sort of time series plot,

and we may want to see if there's some students that are having trends,

sort of going worse as the semester goes along.

We have to think about how we might represent symbolically all of the data in the plot.

So how do we decide which to use and when?

So with tables, these are always our main sources of data.

This is sort of where the ground truth is stored.

We have some sort of document,

whether it's an Excel file,

a CSV file, a sequel table,

MongoDB, whatever sort of technology we're using,

somewhere somebody has captured data,

and has intruded into some format.

And this document contains individual values.

Like I said, this could be a name,

this could be a quiz,

this could be a quiz,

this could be a quiz and we have values in these.

And this document is going to be used to compare individual values.

I can get directly down to the number,

and precise values are required.

It doesn't have to just be a quantitative data like we're showing here.

Instead of a name,

remember this is a qualitative variable.

So the order of the names doesn't matter,

and one name is more important than the other.

But we could also have things like,

your favorite size of coffee cup,

so small, medium or large,

for example. So ordinal data.

So we can contain all sorts of information in these tables,

but the precise values of each individual record is captured.

And within a table,

you could also have no values.

So you could have information that's not captured,

maybe a person didn't thought a question,

maybe a quiz grade wasn't entered correctly.

And oftentimes visualization is used to quickly identify errors in the data like that,

and to see if we can correct those or if they turn out to not be errors at all.

And so this quantitative information to be computed,

may involve then processing data to communicate this in different ways.

And so, for a table,

it's sort of these precise comparisons.

Did Ross score more points than Jane on quiz 1?

We can do a direct comparison between that.

But if we're interested in sort of shapes,

and trends, and changes in the data,

we often want to use graphs,

where messages contained in the shape of the values,

and the document is going to be used to reveal relationships among different data sets.

So for example, if I want to know the relationship between quiz 1 and quiz 2,

I might make a scatter plot.

We're going to talk about how to do these later.

But if I plot quiz 1 on one axis and quiz 2 on another,

each point here on the scatter plot is a particular person.

So for example, if this is five and this is five,

this person got five on Quiz 1 and five on quiz 2.

If this is one and one,

this person got a one on quiz 1 and a one on quiz 2.

And as we plot more dots,

we may find some sort of trend that

students that did better on quiz one also did better quiz two.

These things can be helpful for thinking about things like learning outcomes,

they can be helpful to identify correlations,

like if this is in the stock market,

you may be looking for correlation between time series to

help predict what a stock might do tomorrow,

and all sorts of different applications in this area.

And graphs are going to be visual displays that

illustrate one or more relationships among entities.

This is a shorthand way to present information,

and it lets us see these trends and patterns and easily sort of

get a comparison and try to comprehend what's going on.

So what we really want to think about when we're using graphs and

tables is thinking about what the task is.

What are we asking the end user to do?

Why do we need a graph?

What are we trying to show?

We talked about this with the Univariate data as well too.

What questions are being answered in the graph?

Is this the right graph to make?

If you're not answering the questions that people want to ask,

then maybe it's not the right visual.

What data is going to be needed to answer those questions?

If we haven't captured that data,

then it doesn't matter what the graph is

because we won't be able to help people understand.

And finally we have to think about who's the audience from the data.

Who are we talking to?

How do we need to present this?

How are they going to understand these things?

What's their level of visual literacy?

And how do we help people get through these?

And so again, we talked about univariate graphs in the past.

For example histograms, we're plotting

probabilities and bin accounts over time or over a particular dimension.

But, we can also move on to multiple variables as well.

So one simple example of moving on from a bar chart in

one dimension to two dimensions is what we call a stacked bar.

So let's think about how this was created,

and let's go back to our Quiz examples.

If we have a name, we have quiz one,

and we have quiz two,

we have quiz three,

and let's say each quiz out of ten points,

so the maximum score you can get on any quiz is 10.

So we had Ross, remember we had eight,

a nine and a 10.

Then we had Jane, and we had all tens here.

So we want to think about is, when we're plotting these,

if we plot quiz 1, the most points we can have is,

since all of these are out of 10,

the highest that all three of these could ever get to 30, right?

So if we plot Ross,

Ross is going to have his own bar,

and Ross's first bar for quiz 1 only goes up to eight.

So we wind up shading this in typically with some color.

His next bar adds 9 to there, so this is 17.

So this is 8, 17. And the final one is 27.

So each bar is slightly longer than the other.

Each one may have different colors or textures to represent things.

We can also then compare Jane,

and Jane had a 10 on quiz 1,

a 10 on quiz 2,

and a 10 on quiz 3.

So all the bars are equal size.

And then we can color those to help identify the different quizzes.

And we can directly compare these different elements from the two people.

And we can also do this for stock prices over time.

So this could have been three different stocks,

and this could have been day one,

day two, day three.

The problem is this lets me directly see that Jane's quiz 1 is higher than Ross's quiz 2.

But it's really hard to compare elements in the middle,

because now they're not grounded at the same height.

So if I do a slightly better drawing,

it's easy to compare those two boxes and tell me which one is longer, right?

But if I do this,

is one longer than two or is two longer than one?

It gets difficult to tell,

and so oftentimes we add in interaction where we can swap the bottom boxes with the top,

and move things around and allow people to interactively explore different datasets.

And this is just one example of

sort of the multiple variables where we can

show different elements by stacking them on top each other.

Along with stacking, we could also think about putting elements next to each other.

So for each person, we could have some sort of element like this.

If this is Ross, we could have, Jane and so forth.

But again, we wind up taking up a lot of screen real estate to do this.

And so we're limited in the number of different then rows,

because each person is a row,

the number of rows that we could possibly show here.

So again, this is one way to think about showing multiple sort of variables in a dataset.

And over the next few lectures,

we're going to talk about several different ways to show multiple variables.

Some of them are common ways from scatter plots to parallel coordinate plots,

to techniques for actually clustering and labeling data and finding patterns to them.

Think about how we can project those onto different scatter plots

and elements as well too. Thank you.