0:04

So we've looked at the histogram as a way of visually exploring our data.

If you wanted to summarize the data numerically,

let's talk about first the measures of central tendency

that we can use to to explain the data.

So we can we often look at the mean or the average of the data.

Another metric that you might use is the median or what is the middle value.

That is if you rank order your data from highest to lowest,

one observation falls in the middle of that data such that 50 percent

of observations are greater than that 50 percent of observations

are going to be less than or equal to that.

So we'll look at both of those commands both mathematically and within Excel.

There are also measures of dispersions that we can use to summarize the data.

The most common probably being looking at the standard deviation

or the variance that's observed in the data

that's giving us a measure of dispersion from the mean.

So from the average value, you might also look at the range

or the interquartile range to get a sense for how much variation is there in your data.

So these are the formulas that we can both the mathematical expression

for calculating the average as well as how we can derive

that using Excel the command equals the average

and then you would enter the cell range for where your data resides.

What Excel is going to do?

It's going to add up all of those values in that range,

divide by the number of observations to come up with the average.

The median if you wanted to calculate

that equals median and then the highlighted data range.

One thing to note and we'll take a look at this an example to customer value

in the second is that the median is not going to be sensitive to the extreme values.

So if you've got a lot of variation in the value of your customer some are really high,

but most customers are really low.

When you calculate the average that's going to be highly sensitive

to those extreme observations.

The median is not going to be sensitive because it doesn't care what the exact values are.

It just cares about finding the point where 50 percent

are above 50 percent are below that value.

Doesn't matter how far above or how far below

all that it's concerned with is the rank order.

All right.

So here's an example that's been where we've done some work looking at the customer base

of new customers signing up for telecom service

and then we calculate their lifetime value,

techniques that we'll look at when we examine customer analytics

with regards to studying, retention, and driving customer value,

really the story that we get depends on the metrics that we're using.

If we look at the average customer lifetime value based on this histogram turns out

that it's around $1200.

But if we look at the median customer lifetime value

that is you know the point at which half of my customers are going to be worth

less than this amount, half my customers are going to be more worth more than this amount.

Turns out that it's only $800.

That difference between the median

and the mean indicative of having a skewed relationship.

In this case, the data is going to be right skewed

if you look at our distribution as we go from left to right from low value to high value,

we see fewer and fewer observations.

The data seems to fall off.

So that's indicative of that right skewed relationship.

So which customers would we want to target?

Well, if if there was a way for us to identify the customers

who have that high lifetime value it'd be great for us to focus our efforts

on these customers but we need some way of identifying those individuals.

Maybe it's on the basis of demographic variables,

maybe it happens to be the region that they're in, maybe it's the method

of acquisition so if we have additional variables

that tell us something about the individuals

who have that high lifetime value then we can make an informed decision

of do we want to focus on those individuals where they are very valuable

but there aren't that many of them?

Or do we want to focus on customers who may not be as valuable to us,

but maybe it's easier for us to acquire them and there are a lot more of those customers.

So that's what we're going to look at when we talk about the customer analytics.

How hard is it to acquire these different customer groups?

What's the difference in value going to be

and what are the levers that we can pull to try to attract them?

All right.

Looking at dispersion, common measure is going to be the variance.

So I've given you here the mathematical derivation for the variance

as well as the Excel notation equals var.

And then highlighting the data range and you'll see

that in the derivation for the variance,

what we're really focused on is the key here is this piece.

It's the observation minus the average.

Now the reason that it's squared is we really don't care how whether it's a positive

or negative deviation from the average or from the mean.

So what we're concerned about is the magnitude of that deviation.

So the bigger the more dispersion there is around the mean

the higher the variance is going to be.

Variance can also be reported in terms of the standard deviation.

So stdev which is really just the square root of the variance.

Other formulas that you might use to cut to understand

how much dispersion is in your data we can calculate the range.

So what's the largest value minus the smallest value.

So here we can use the max and min commands to calculate just the size of that range.

You might also look at what's referred to as the IQR where the interquartile range.

So this is the range that contains the middle 50 percent of the data.

There's not a built in command to say give me the IQR

so we can do is use the percentile command.

And so what we're doing in this equation is say let's extract from the data

what value is that the 75th percentile and we'll subtract from that

the value that's at the 25th percentile that difference giving us the interquartile range.

So these are methods for summarizing data both in terms of the average,

the median, the central tendency and in terms of dispersion.

One thing that we always have to contend with are outliers in our data.

Those extreme observations that can heavily influence our data.

In some cases those outliers might be caused by coding errors.

You know sometimes in data you'll see that the code 9 9 9 9

is used to refer to data that's missing.

Well, we actually have to remove that before we conduct our analysis

because otherwise it's going to skew things.

So when we detect those outliers it may be a reason to dig in further

and understand what's going on.

It is possible that there are just mistakes

but any assumptions that you're making about how you deal with those outliers those

are conclusions that should be noted in any report that you're producing.

Another aspect of quantitative data might be time series or a temporal dimension.

So these are time series plots that were produced using television tuning data.

So data collected from set top boxes for two television programs

in the- I believe early 2000s when we're collecting the data America's Next Top Model

and Pussycat Dolls Presents Girlicious,

the dips in the audience level or the dips in the ratings

caused by commercial breaks so the audience staying tuned to the program,

commercial break comes on, people tune away, come back when the program resumes.

Well that's a characteristic of this type of data.

Audience levels decline during commercial breaks.

So we were in any type of analysis,

that's something that we are going to need to control for.

Another popular time series data is stock performance data.

So we can see how do stocks or do indices tend to move together

or are they tending to move apart.

Well getting to looking at time series data

and building forecasting models in a little bit.

But just to foreshadow the direction we're going in,

one component that we're going to look for is the trend in the data.

So for new product that's launched as time goes on sales may ramp up.

Something else that we might also look at is a cyclical pattern.

Is it related to seasonality?

Is there a pattern in here related to the general state of the economy?

These are factors that we need to do that we need to know about

if they're influencing in this case sales

because if we're going to be making forecasts going forward those are variables

that have to be included in the predictions that we're making.

All right.

We've looked already at how do we examine relationships

that may exist among categorical variables.

Well we want to do the same thing when we look at relationships

that exist among quantitative variables.

A scatterplot is the common visual way of doing that.

So in this particular case, comparing the daily returns

for the S&P 500 with the Verizon stock.

So on the x axis it's the daily return for the S&P,

on the y axis the daily return for a Verizon.

And you'll see from the trend line that's been layered on here

that there is generally a positive relationship.

On days when the S&P went up, Verizon also tended to go up.

So we can look at the direction of the relationship.

In this case, we fit a linear curve a linear trend to that may not always be linear.

That's one thing that we're going to keep in mind.

We're also going to look at how much dispersion there is.

There seems to be a lot of dispersion think of around this line.

So from this point on all the way down to this point, a lot of dispersion

that we're seeing in that scatterplot means

that if I wanted to know what the Verizon price was going to be on a given day,

well I can't just use the S&P because it seems like that's not explaining everything.

When we get into talking about linear regression

that's one of the pieces that that's going to be key for us.

It's trying to reduce the amount of uncertainty that we have.

The more information we have the better fit we should see.

If we're looking at trying to quantify the relationship between two variables,

a common technique that we're going to use is to look at the correlation.

What the correlation does is assess the linear relationship and that's going to be key

because not all relationships are going to be linear, as I'll show you in a second,

sometimes that correlation is going to be indicative of the nature of the relationship.

Other times, it might not be that informative.

So we really have to conduct exploratory analysis to understand the patterns

that exist between quantitative variables.

Within Excel, equals correl, that's going to be the command

that we use to calculate correlation, highlight the x range in the data,

then we highlight the y range in the data.

Those are going to be separated by a comma

and all that we're going to get as output is a number between negative 1 and 1.

So negative 1 indicating a perfect negative linear relationship,

positive one indicating a perfect positive linear relationship.

Mathematically and this is something that I hope you never have to calculate

by hand but this is the mathematical derivation behind the correlation coefficient

and the piece that I want to just draw your attention to just to give you some intuition

for what this number really means is going to be this set of terms.

So what we're doing is for each observation we're calculating

the deviation from the average x value

and we're calculating the deviation from the average y value

and multiplying them together adding that piece up.

Well, what does the positive correlation mean?

Well a positive correlation means that when the x value is above average as we see

which will be given by this first term the y value also tends to be above average.

Both of them are positive.

When the value is below the average, it's going to be negative.

When the y value is below its average value that's going to be negative.

The product of the two giving us a positive relationship.

If we have a negative relationship when X is above its average value,

Y is below its average value and vice versa.

So that's that's where the correlation concept is coming from.

So one you know we can look at this in terms

of the financial performance so I pulled information.

This is data from 2010 on the stock performance of some telecom companies.

And if we look at Verizon, Comcast, and AT&T positively correlated with each other,

also positively correlated with the S&P 500.

So probably no surprise that these finance these stocks

and the S&P 500 generally move in the same direction.

Right.

So this is intended to give you a sense for the extent of dispersion

that you might see depending on the level of correlation,

the middle plot that has an r value of 0.9, you know, very high correlation there,

very tight cluster looks like pretty much a straight line whereas

when we have a lower correlation coefficient the 0.4 on the right or the

or the -0.7, you see there's more dispersion there,

so there still is a linear relationship.

If I were to superimpose a trend line, we would capture a negative relationship.

But the fact that there is a lower correlation coefficient time

is that there is going to be more dispersion there.

So these are all examples when the data actually suggests a linear relationship.

But as I said earlier it's not always going to be a linear relationship

and that's where we need to be careful.

So one example of how the correlation can be misleading

is that we have outliers in our data.

We've got a massive points in this case near the origin, near the lower left

and then we've got a single outlier.

Well if we try to draw best fitting line because that outlier

is going to influence the trajectory of that line,

we're actually going to be told that you know

what there is a strong linear relationship here,

but in reality, it's that one outlier that's doing it.

If we were to discard that and just look at that massive points,

we would not get nearly as strong a linear relationship.

The other piece that we have to be careful about

is when we're dealing with non-linearities.

So in this case we have if we look at the lower right we have a correlation of zero.

Well that doesn't mean that there's no relationship between x and y,

what it means is that there is not a linear relationship between x and y.

You know this relationship is a negative quadratic relationship.

So, y is related to x squared.

Well, it's a very strong quadratic relationship but it's not a linear relationship.

That's why we're getting the correlation of zero.

So doing this visual inspection using scatter plots,

using histograms a very important part of the data analysis that we need to do.

If we were to just look at the correlation coefficient,

we see that there's no linear relationship, we might erroneously conclude

that there's no relationship at all.

All right.

So to recap the techniques that we've looked at we've talked about techniques

for examining individual quantitative variables.

We've also looked at techniques for examining the relationships

among the quantitative variables and correlation and scatterplots good for looking

at the relationship among these two quantitative variables.

We'll come back to examining these relationships more formally

when we start getting into prediction and we start using linear regression

as one of our go to tools in that forecasting.