In this module, we want to talk about a unique sort

of visualization space called parallel coordinate plots.

We're going to look at different attributes of this multivariate data visualization.

So much like the scatter plots and the other multivariate data visualization lectures,

again we're thinking about a dataset.

For example we could have our players again,

could a batting average,

we could have percent on base,

we could have countries,

we could have income,

GDP, population, and so forth.

So we've got data sets with lots of variables,

lots of rows, lots of information,

and we want to think about how to create a plot

that lets people explore trends and relationships between those.

And in previous modules we talked about the scatter plot where we can

compare things like population size and GDP for example,

and each dot might be a particular country.

We might see some sort of trends.

We can extend those by changing size and shape

and color and adding more variables into our data sets.

And in parallel coordinate plots,

we have a similar sort of idea,

where now different variables can take different values with different ranges.

And what's going to happen is if we have a data set for example like country,

GDP, population, let's have some sort of measure like wellness,

maybe it's the average age people live life expectancy and things like that.

Let's think of some other variables.

We will just call them variable one variable two and variable three.

So you can have a ton of variables about a country.

We could have size for example,

what's the landmass of the country,

number of provinces, or states or whatever in a country and so forth.

So we have all these different measures.

Well, for each measure, we have an axes.

So I've got population,

and maybe population ranges from zero to one and a half billion.

Now there's only a few countries that have one and a half billion

and there's a bunch of countries that maybe have several millions here,

so we wind up with this skewed distribution.

Now the other weird thing is that population goes from zero to one and a half billion.

But if we do life expectancy,

it's sort of maybe goes from zero to 100.

So how do I match 100 to one and a half billion?

These axes have such drastically different values,

so oftentimes we might try to normalize these from zero to one.

What I mean by that is I might find the maximum and I might

divide everything by the max in this column to try and normalize the data.

And now what you're starting to see when I draw my graph like this,

is for each variable, I can create a line.

So for GDP, I can create a line.

For population, I can create a line.

For life expectancy, I can create a line.

For variable one, I create a line.

And the more variables we get,

the more these axes we're going to have.

So these are my parallel coordinate axes.

Now, I want to plot countries for example.

So for every unique country,

the unique country has a GDP.

And remember I can just do the 1DK,

so every dot on this line is already a country's GDP,

and every dot on this line is a country's population.

Parallel coordinate plots connect the same countries together across each of these lines.

So this line now represents a single country,

and where it crosses is the values in our data table.

So we can figure out what that country's GDP is,

we can figure out what its population is and so forth tracking those.

And a parallel coordinate plot has a line for every country.

And again, what we're seeing is

these pairwise combinations on a parallel coordinate plot.

And so we're just putting all of our variables on different axes and

then we can connect based on our categorical variable interest or things like that.

So if this is eight quizzes in class,

each line could be a student representing that student's record.

Now the thing is with parallel coordinate plots,

I didn't have to put V8 next to V7.

I could have moved V8 over to V4 and V4 back to V8.

What happens in a parallel coordinate plot is even though I can now

see all of the data records in this single view,

I can only see sort of pairwise correlations.

So here I can see most of the data from V1 to V2 has a downward trend.

Meaning that in general, V1,

if I do have a high value in V1,

I generally have a lower value in V2.

If I look at V4 to V5,

I have a lower volume V4,

I have a higher value in V5.

So these trends let me still look at pairwise comparison and it

can even show V3 to V4 and V4 to V5.

So I can sort of look at two correlations at once,

I can reorder things,

but the order of these axes is going to greatly

influence what this visualization looks like.

Likewise, what's really important actually is the angle between the axes,

because these angles represent this level of correlation,

and we keep coming back and talking about correlation.

The reason why correlation is important is because it

allows me to create some sort of mathematical formula,

like X is equal to MY.

And if I can make some formula like this,

if I know Y,

I can predict X.

And if I can predict X, if I know something about the future,

if I know something about other things I haven't seen,

so for example if I'm trying to hire

new baseball player and I want to guess how he might do and I don't have

any measurements on their batting average

but I have lots of measurements on their on base percent,

I can guess they're batting average from this information.

For stock markets, if I can make some sort of

equation like this where I know how much stock Y has made,

I can guess how much stock X will make

and I might be able to predict things in the future,

I might be able to classify unknown information.

So that's why looking and exploring data is really important,

and parallel co-ordinate plots lets us

see different correlations between subsets of the data.

So we can even start thinking about is there a subset of the data,

like this subset here that is correlated and why is that

subset correlated where other subsets are not.

So we can start reasoning and exploring different chunks and we can see

which correlations along two axes are of interest.

And again here we're looking at a car data sets.

We have the year car was made,

it's horsepower, it's acceleration,

the number of cylinders.

So we can start exploring and extracting information.

And what's interesting is we can even think about looking at how these elements cluster,

and we can see the visual clustering of the data in the parallel coordinate plot,

and so we can apply color and opacity based on line density.

So the more elements across a particular chunk,

the more dense they can get.

We can compute local density for each line and average of the density values,

and we can apply color and opacity based on user specifications.

So we can start filtering things out, looking for trends.

We don't even have to draw straight lines.

As you see here, we can try to curve things

to show different of patterns and get more different visual aesthetics,

because one of the big challenges with parallel coordinate plots is,

if I have a really large set of

lines and a really large set of countries or baseball players,

I can wind up with plots like this where it's just really hard to see anything,

I have too many lines that overlap.

So we started thinking about how we might be able to bundle these together.

How can we use color and opacity to help show trends?

How can we allow the user to select different things that are important?

They may say well, I'm only really interested in countries with

a low GDP and a high life expectancy,

because those things are sort of interesting.

I wonder why that might occur.

And so again we can allow user interaction to filter to give information on the tooltip,

to even reorganize axes in the parallel coordinate plot.

And just like we talked about scatter plots,

we can look at screens based metrics to calculate insight into these plots.

In scatter plots we talked about things like skewness, clumpiness.

Striations, and with scattered plots.

Once we draw the plot,

we're trying to measure what this geometry looks like,

and the way the geometry looks,

may give us insight into whether this plot is interesting for humans to look at.

By that same token,

with parallel coordinate plots,

once we draw our geometry,

we can start doing some sort of Screen Space Metrics

to try to understand what this might look like.

Likewise, we can also think about how we

could create lower dimensional projections of the data.

So, taking all of these N variables,

and reducing them to just the two or three most important variables,

so that we only have to plot the ones that would give the maximum insight into the data.

This would help us optimize the parameter space

for things like pixel based orientations and visualizations.

And we can have metrics then based on particular views of the parallel coordinate plots.

Problem is this also depends on the size of the display,

and the space between the axes is where interesting patterns occur.

But the more variables I have,

the more axes I have to draw.

And sometimes then if I have a lot of axes I have to draw,

I may not have a lot of space.

Likewise, this gets really long,

but I have a lot of screen space,

I might have screen space appear that I don't even wind up using.

I can take this and I can rotate it,

so I don't have to draw my parallel axes vertically.

I can do my parallel coordinate plots this direction,

and I can draw my lines as well.

But now I'm losing out on this green space over here.

So again, thinking about trade offs between

these different visualizations of scatter plots and parallel coordinate plots.

And with parallel coordinate plots, basically again,

each connection between an axis,

gives us some information about the data.

Well, we can use a variety of metrics to try and optimize the use of screen space.

For example, we can look at histogram distance,

like recording the slope of the lines between the axes.

We can use paired histograms or histogram

of all the lines covering both axes to try to determine,

you know, should I put these two axes further apart,

and these two axes closer together?

What sort of information is there?

Can I delete this axis altogether because it's not interesting?

So, we can start using these different metrics

and information to again sort of analyze the data first,

present what's interesting, let the user filter, analyze again,

and try to help them explore

form hypotheses and understand what's going on within the data set.

And again there's a variety of metrics to optimize the use of

screen space like line crossing.

So we can interpret each line between a pair of axes as a directed interval,

and sort of count the number of times that the lines cross.

And again, think about it this way.

If we have no line crossings between two axes,

this means that every time a value goes from low to high,

or could have been from high to low,

we easily can sort of sense this pattern.

We can also have the angles of crossing to determine angles between line crossing.

So if I have two axes and I'm getting all of these crossed lines,

I can measure the angle between each cross pair,

and use that as some sort of metric as well.

And we don't have to do this only for quantitative data,

we can actually do this for sets of data as well.

And so, Robert Kosara introduced this idea of parallel sets,

where we can adopt

this parallel coordinate layout by use a frequency based interpretation.

So for example, there's a nice data set about the Titanic.

So, how many male passengers were in first class and survived the Titanic,

versus second class, or third class, or so on.

So our data set winds up looking something like this.

So, we have first class,

second class, third class.

We have male, we have female,

we may also have some sort of a secondary role like survived not survived.