0:03

A very important step when you are designing or using some graphs,

is the step called data transformation.

So as I've shown you before,

we have two steps.

The first one is selection and transformation,

and the second one is choosing or designing an appropriate representation.

So far, I've been talking about how to choose and design a representation.

Now, we are going back to the first step,

in trying to look into more details what this means.

Okay. So, let's start with selection. What is selection?

The idea here is that,

when you are designing a new visualization,

you are virtually, all the time,

starting from data set,

a table that contains multiple attributes.

But every single representation represents only a small fraction of these attributes,

a small subset of attributes.

This means in turn,

that every time you are designing a new visual representation,

you also have to choose

which attributes are going to be used for these visual representation.

This process or this step is called selection,

selecting which attributes you need to create the representation that you want to create.

Let me give you an example.

Starting from the food data set that we used before,

say that you want to answer this question,

in the food data set,

visualize the relationship between carbohydrates and calories,

and see how it is affected by food categories,

where every single item in your data set is a food product.

One way to do that, is to use a Scatter plot,

as we have seen before,

where on the X-axis,

you have amount of carbohydrates, on the Y-axis,

you have calories, and every dot is colored with the food category.

So, think about it. What are we doing here?

We are starting from a large data table with lots of attributes,

and selecting three attributes out of them,

which are those attributes that we need in order to answer these questions.

Carbohydrates, energy in terms of calories, and product type.

Let me give you another example.

Starting from the same data set,

now imagine that you want to answer a different question.

You want to see how the average amount of calories distributes across food categories.

That is for every food category,

I want to know on average,

how many calories there are across all the food products.

To do that, once again,

I have to select two attributes. Which attributes?

Well, I have to select the food category and the amount of calories.

And I want to create a chart like the one that you see here,

where every single bar represents one fruit category,

and the height of the bar represents the average amount of calories.

But now, you may have noticed that I need an intermediate aggregation step.

You can't just go from the table to this graph.

Why? Because this table contains information about every single product.

But here in this graph,

I need to aggregate information across all products,

in order to calculate the average amount for every single category.

Once again, going from here to here,

I need an intermediate step.

So, let me show you what this intermediate step is.

Starting from the original table,

once again, first I have to select two columns,

but I also have to aggregate

these two columns together to generate the information that I need.

In this case, what do I need?

I need to aggregate all the categories together,

and calculate the new number,

where this number is the average across

all food products that belong to each of these categories.

Once I do that, I am able to generate the final chart.

So virtually, all graphs require the selection step as I just said.

Typically, you have more attributes than what you need to visualize.

So, you first have to figure out which of these attributes you need to select,

in order to create the visualization that you need,

and many visualizations require an intermediate step,

that typically is aggregation,

or other transformations that I'm going to talk about in a moment.

Let me just go over the aggregation step

once again to make sure you understand what I'm talking about.

This is a little table where I have in one column,

a categorical attribute, on the other column,

I have a quantitative attribute,

and say that I want to create,

once again, a bar chart for every single category.

So, what do I need to do?

I need to aggregate all the items that have the same category,

as you see in this picture,

and calculate and use some statistics,

or some function that takes all the values,

all the quantitative values that belong to one category,

and aggregates them together.

In the previous example,

we use average, but in general,

there are a lot of different functions or aggregate statistics that you can apply,

or you may want to apply to these numbers.

Common aggregation functions that are used are the sum,

the maximum, the minimum, the average,

the median, and the standard deviation,

but there may be situations where you need to calculate some other type of information.

These are just the most common ones.