Covering the tools and techniques of both multivariate and geographical analysis, this course provides hands-on experience visualizing data that represents multiple variables. This course will use statistical techniques and software to develop and analyze geographical knowledge.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

Huan Liu

Professor: Computer Science and Engineering School of Computing, Informatics, and Decision Systems Engineering (CASCADE)

In previous modules in geographic analysis and visualization,

we talked about ways to create

corporate maps and we've talked about different sorts of measures

for looking at just regular datasets like covariance, correlation entropy.

And then we started talking about,

how do we create these sort of weights matrix?

Remember we talked about if we're given some sort of geographic region and

a bunch of different counties in the region like this county one,

two, three, four, five we can create some sort of connectivity matrix between those to

augment calculations for spatial correlations

as opposed to we'd talked about time series correlations and things in the past.

In this module, we want to talk about,

how we can actually extend those and what those formulas are going to look like.

And so we want to explain some specific calculations for

spatial statistics and the most common one is Spatial Autocorrelation.

And now there are global and local measures of spatial autocorrelation.

We're going to focus primarily on these global measures of spatial autocorrelation.

So, if this is a map and each county,

each polygon in the map is colored based on a variable.

If the map looks like this where the things are very dispersed and spread out,

we consider this highly uncorrelated so like minus one.

Here things are very grouped together.

It's highly clustered so one and if things are more random, we get a zero.

So, this gives us an idea of sort of this range of

spatial autocorrelation on a global sense so it's measuring

how grouped are the different variables in relation to everything else.

So, these are all touching counties that have similar measures,

then we have high global autocorrelation.

And we can calculate this very similarly to how we calculated correlation in time series,

it's just a slight modification where

the main spatial autocorrelation measure is extended now with two summations.

In time we just had one summation.

We're just looking at sort of this single dimension.

In space we have to look at two dimensions and then notice this W here,

that's our weights matrix that we talked about how to calculate.

Then we have X of Y,

X of J and the mean of X.

So, what we mean by this formula is,

if we have all these boxes are counties.

This is one, two, three,

four, five, six, seven, eight, nine.

So I've got some variable so I might have,

these are counties I might have the average family income.

Okay, so I can calculate.

Let's say this is for the state of Arizona.

So, I can calculate the mean family income across all counties in Arizona.

So that's x bar. All right.

So, now I know x bar,

and x bar is the same here,

here, and here in my formula.

Right? Now, Wi sub J is the connectivity matrix.

So, that matrix since we have nine counties,

that matrix is going to be nine by

nine matrix and it's going to tell us which counties are adjacent to each other.

Now, remember it can also be based on distance,

length, and so forth.

So, we can figure out our weights matrix and so now,

I ranges from zero to nine,

j ranges from zero to nine.

And so basically, we're going to have I is one,

I'm sorry it ranges from one to nine because it's counties alpha is one and j is one.

Then we fill out our formula.

We do the estimations and we just walk through

this value and what this is going to do is Moran's I going to give

us a global measurement

of whether the pattern expressed in the underlying dataset for example,

average income is clustered, dispersed or random.

So, weather a map might look like this versus this.

And given a set of features and an associated attribute,

this global Moran's value I is going to indicate that.

So, values near one indicate clustering,

values near negative one indicate dispersion,

and then we can also calculate a Z score to indicate whether we

can reject the null hypothesis of there's no spatial clustering.

So, we can even determine if there is spatial clustering in the data just by

using this sort of formula and we don't have to necessarily even program this by hand.

This is available in Python,

R, and these things.

It's just a matter of thinking about,

how do we fill in all of the different variables?

And the most important one is the weights matrix.

You have to decide, we want to use

Queens continuity for connection is one connected to two,

four, and five or is one only connected to two and four.

Do I want to use distance from my weights matrix?

So I have some sought of decay function so the further away the centroid are.

So one could still be connected to nine but it may be a weight of

point five as opposed to a weight of one in the case of five.

So, we have to think about how to calculate weights matrix and fill in all these values.

Now, I mentioned a Z score.

So, Z score is a statistical test to

identify a null hypothesis associated with the normal distribution.

Basically the Z score is a measure of standard deviation.

So, we're trying to determine whether this distribution of

patterns could have occurred with sort of a 95 percent confidence interval.

What I mean is, if I had a bunch of income data distributed randomly,

how likely is it that this pattern I'm seeing would have occurred by chance or not?

And so Z scores allow us to test this,

get a measure on how critical this

Moran's I might have been or how unlikely it might have been to occur.

And so, if we're given a bunch of regions with a bunch of measurements,

we can calculate our Moran's I.

So, here's an example of the spatial distribution in X and Y.

So, at the location this is one,

two, three, four.

One, two, three.

So X is one Y is one we had no values there.

At X is one Y is two. We have 4.55.

At X 1, Y 3 we have 5.54 and we can walk through

our Moran's I calculation to do this.

And we also have to calculate the spatial weights matrix as well.

So, we have 10 different spatial regions.

So we have a 10 by 10 weights matrix.

So, box number two is next to number one,

is next to number four, I'm sorry.

Box number one is next to two and is next to four.

Notice, we didn't make it next to three so we're doing what we call a Rook's continuity.

Okay. Now, let's walk through number two together.

Box two is connected to one and it's connected to five in Rook's contiguity.

All right? So, take a minute to think about three.

Three is connected to four and six.

And so hopefully you can go through and see how they did the rest of this weights matrix.

So, it's also sometimes referred to as the adjacency matrix.

And this is for Rook contiguity.

So, you may say, well why didn't they want seven and one connected to three?

Again it was just the choice of Rook's contiguity.

So, this refers to movements in chess

where the rook can only move left and right up and down.

Queens contiguity would have been like I've drawn here,

where every element is connected.

A similar measure to Moran's I is Geary's C. Geary's C,

you'll see a very similar formula.

We've got our two summations.

We've got our weights matrix and the value of

Geary's C though instead of going from minus one to one,

lies between zero and two.

Where here it's a value of one means no spatial autocorrelation.

If things are similar,

smaller values than one means positive spatial autocorrelation so clumpiness.

Values that are larger than one mean they have negative spatial autocorrelation.

This measure was designed to be more sensitive to local spatial autocorrelation.

So, it gives us some different information than Moran's I.

It doesn't mean one is better or worse for using for spatial autocorrelation,

it's just another metric you can use to look at to gain

information that's calculated very similarly needing all the same information.

Again, Moran's I is a global measure of this correlation.

Now, the individual components in a map can also be mapped and tested for

significance to provide an indication of clustering patterns within the study region.

But we may also be interested in these local examples

too and so we've had things like the Getis and Ord's statistics so unlike Moran's I,

Getis and Ord wanted to develop a statistic to

identify the degree to which higher low values cluster together.

So, whether or not we see patterns within

the map as opposed to a global pattern that may look like this.

These are global patterns.

Getis and Ord are looking for whether or not we see smaller chunks.

Whether we'll see a whole bunch of counties that have similar values in some region

here and here and you may have dispersion in between them.

And so, there's a lot of different significant tests for

autocorrelation as well because we may want to know whether or

not this could have occurred by chance or how unlikely is to have occurred by chance.

An autocorrelation coefficient can be

tested for statistical significance under assumptions

of normality and we also assume that values are independent and identically distributed.

So, we take observed patterns for a set of values and this might

just be one realization from all these possible random permutations.

So often people will do some Monte Carlo testing

comparing this to random distributions of other data points.

But the goal is to try to see if we can again analyze

the data first to point people to interesting things within their dataset,

and the reason we want to cover this in class is to give you

some idea of different analyses you might be

able to apply to spatial data to find things that are interesting within the data.

These are just some measures of global autocorrelation.

There's other measures of local like we talked about the Getis and

Ord and we're going to even talk about other metrics

as well for finding significant clusters of events and things within the data.

So, again these are all tools for your bag of data exploration tricks.

You can start thinking about,

how could you use different methods to analyze the data first to then show

people or help guide people to things that might be

interesting and important within the data. Thank you.

Explorer notre catalogue

Rejoignez-nous gratuitement et obtenez des recommendations, des mises à jour et des offres personnalisées.