Introduction to Data Exploration,
Color Schemes and Design.
In this module, our goal is to
identify appropriate color schemes for different data types.
We've talked about different visual variables for different data elements.
For example, we keep going back to talking about what
if we have a data set of our course,
where we have your name,
your quiz grades, and those things.
How can I map those to visual attributes?
And we talked about position in the law and common axis, size, shape.
And then, the one we sort of glossed over so far has been color.
And partly, it's because we need to identify
appropriate color schemes for different data types.
Color is one of the fundamental things we use in visualization.
But oftentimes, it's been found to be lacking.
We saw that Cleveland when he measured perception of color.
People had a hard time in sort of mapping color to a value.
So, how can we help improve that?
What are some of the design principles for choosing appropriate color schemes?
And we really have three different design principles for color.
Given a univariate data type,
we want to have order, separation, and aesthetics.
What do I mean by a univariate data type?
Well, this is something like our coffee cup size.
We have small, medium, and large.
We want to make sure that the color we pick has an order for this univariate.
The color scale that's chosen needs to map to the data,
must represent a perceived ordering if there is order in the data.
If it was mapping color to our names,
we don't want to proceed ordering.
Separation means that color scale that's chosen must represent
a perceived ordering and have the correct amount of separation in between them.
That means, if I have light red to darker red,
I want people to be able to sort of perceive that the change in
read between those two boxes is approximately the same.
Then for aesthetics, I want color to be pleasing.
It needs to contain a maximum perceptual resolution.
The ordering should be intuitive and it should be enjoyable for people to see
so we have sort of three different major univariate color schemes.
Starting with what we're going to call the Rainbow Color Scheme,
or the Qualitative Color Scheme.
Now currently, Rainbow Colors Scale is one of the most commonly used.
You're going to see this on weather maps.
You're going to see this in a variety of scientific visualizations.
But, oftentimes this is a very poor choice
and a very poor color map in a large right of domain problems.
The ordering of hues is unintuitive.
Often, this is a low value to a high value.
So, in weather maps this is cold and this is hot.
Why should red be hot and purple be cold?
And why is green sort of in between?
Is that intuitive for everybody?
Is it intuitive for different cultures?
Do people understand this?
So, in previous lectures we talked about having nominal data,
ordinal data, interval, and ratio data.
These sorts of Rainbow Color Schemes and Qualitative Color Schemes are good
for nominal data where some sort of underlying ordering is not implied.
A Qualitative Color Scheme is just the broad category of the Rainbow Color Scheme.
Rainbow Color Scheme is a particular type of Qualitative Color Scheme.
A Qualitative Color Scheme,
you see again has different hues with no implied ordering.
Our second type of univariate color scheme to discuss is the Sequential Color Scheme.
Now, notice the Sequential Color Scheme is just a changing of hues along one sort of
color and grayscale is a particular type of Sequential Color Scheme.
Sequential maps represent ordered data.
Oftentimes, we go from light to dark,
so low to high.
Again, low to high can always flip it around high to low as well.
Dark colors typically represent our higher ranges,
brighter or low ranges.
The benefits is that, the scale is often pretty
perceptually-intuitive but it has a weakness in
that only a limited number of distinguishable colors can be represented.
It's not that I can't make every hue of color from white to green.
It's that people can't perceive the difference,
so that becomes an issue.
Grayscale is nice because it's the simplest map where
variables just mapped the brightness so bright white to dark. It prints well.
Again, we get a limited number of distinguishable colors.
So, that means we have to sort of bin our data together to form it in
groups and we can also wind up with problems in Grayscale.
Grayscale actually can wind up creating different perceptual illusions.
So, if we look at our cube here,
we can ask ourselves how many different colors of grade do we see.
Well, here we've got one color of gray.
Here, we've got two.
Here, this may look like three different color gray here.
And this may look like four,
five and this may look like a different color as well.
Oftentimes if people are asked how many different colors of gray are there on this cube?
They say six, but this is due to simply having these gray squares next to each other.
And when they're next to each other,
they wind up implying some sort of shadowing color.
What we find is we actually only have four different shades of gray on here.
Due to how we've organized the different shades,
we wind up perceiving different sorts of shadows and different processes there.
Adding things like a thin border can maybe help reduce the illusion that we're
seeing but it causes other perceptual problems where things might look blocky and chunky.
Thinking about how we can overcome some of these illusions,
also tells us we need to be careful about accidentally creating these.
Now, the third color scheme we want to talk about is very similar to
the Sequential Color Scheme in that we're mapping different hues.
Except, we're really are sort of taking two different hues and
combining them together around a central midpoint.
This is good because it provides means for variable comparison.
So if I want to compare above or below a certain mark.
For example, if we're mapping our quizzes,
and we want to compare how many students got above a C
and below a C so I make C my divergent point here,
and A can be on this side of the scale,
and F can be on this side.
It allows me to compare across those ranges.
We could have that same point for temperature.
We could have this be our zero temperature point and this could be hot because it's red.
This could be cold because it's blue.
This is best suited for ratio data where there's some meaningful,
not necessarily zero point,
but some meaningful comparison point.
The scale lacks the natural ordering of colors,
but it's ordered from
that zero point and careful choices need to be made in choosing high and low ends.
Often, we use the concept of cool blues, and warm reds,
and yellows for these different colors but this allows us to create
this nice combination of sequential scales to do this comparison for divergent schemes.
So, again, we have three different major types of color schemes for univariate data.
We have our Qualitative or Rainbow Color Scheme,
primarily for nominal data.
We have our Sequential Color Scheme used for ordered or for interval and ratio data.
And then, we have our Divergent Color Scheme where it's
for ratio data specifically where we want to
compare to some sort of midpoint where we want to see what's above and what's below.
We have some sort of natural zero.
Now, that doesn't mean that we can't do Multivariate Color Schemes as well.
We can try to map color across two or more dimensions.
The problem is, then people have to think about
the separability of these colors and it can increase the cognitive load.
So, again, if I have a data set where I have your scores for quiz one,
quiz two, and quiz three,
and we have students R, J, A,
we can say, "Okay,
if this student got perfect scores,
the quizzes are out of 10.
The student got mid scores,
and this person got low scores.
What color would they be and how would you do this?"
What if we just want to map a color to our first two variables?
What we do is we create an x-axis and a y-axis,
and try to create some sort of color scale like this.
A person that's got 10 on quiz one and a 10 on quiz two,
would wind up here so that would be R.
The person that one on quiz one and one on quiz
two would wind up there and a person in the middle here.
We have another person that wound up with the five,
one, and a seven or something like that.
On quiz one, they had a five,
they might wind up here.
They had a one on quiz two so that would be here.
We'd have person F might be that color.
We can do this for all three variables as well so we have axes in this dimension.
Since R got a 10, 10, and a 10,
here's 10 on variable one and here's 10 on variable two,
so R might wind up here and J got five on B1,
five on B2, five on B3 so somewhere maybe in this range for J.
A got one, one, one and that becomes tricky so it's one on B1,
one on B2, and one on B3.
It's sort of here for A.
Now, F got five B1,
so somewhere in the middle, one on B2,
and seven on B3, so up here.
So we're seven, one, and zero.
What color should that be?
And that becomes really difficult in
trying to interpret these multivariate color schemes.
And, in general, you won't see these very much used in practice.
You may see sort of what we would call a bivariate color map here.
And I'll see these sometimes in
geographic visualization and corporate map representations.
But, it's very rare that I see a trivariate color scheme or higher used.