0:00

[MUSIC]

We continue our look at different types of random sampling.

So remember, with a simple random sample, where we attach a known equal

non-zero probability of selection to each individual.

Just by chance,

we may end up with a sample which is not fully representative of the population.

For example, our sample may tend to have a gender bias,

maybe there's more males or females within our sample.

Maybe it tends to be a fairly younger or

older group relative to the population as a whole.

Maybe more of one kind of nationality than another may be.

So although we like this concept of removing selection bias by choosing

the population members probabilistically, are there ways which we can try and

perhaps increase the chances of a representative sample?

Well, the answer of course,

is yes there are, and we will review these in this section.

So firstly, let us consider systematic sampling.

So remember, for all kinds of random sampling, we do require a sampling frame,

a list of all population members, numbered from 1 up to N.

Where we let capital N denote our population size.

So for a systematic sample, we need something called a sampling interval,

which we can derive from both our population size,

as well as our desired sample size.

For example, let's imagine our population had 100,000 members.

So capital N = 100,000, and let's suppose we wish to sample 1% of those.

Namely, we'd like a sample size n, little n = 1,000.

So our sampling interval

refers to the population size divided by the sample size.

And hence, in this illustration, 100,000 population size divided by 1,000,

our sample size, will give us a sampling interval of 100.

So how do we then select this sample systematically?

Well, using our random or pseudo random number generator.

We'll let, let's say the computer choose at random a number between 1 and

100, between 1 and this sampling interval.

[SOUND] Once this value has been chosen, we would then wish to observe

that observation, and then every 100th observation thereafter.

So for example, if the computer gave us a random number, let's say of 23.

Then we would wish to observe the 23rd person in our list

than the 123rd, the 223rd, etc.

So you can see once this starting point has been randomly chosen,

it then fixes every other member of the sample because systematically we will look

at ever 100th individual thereafter.

So how could this perhaps increase the representativeness of our sample?

Well, imagine we considered ordering

the observations within our sampling frame by the characteristic of interest.

Perhaps let's say we knew the age of everyone in our sampling frame.

And let's say we order them from youngest to oldest.

So, if we wanted to solicit opinions or views on some topic,

maybe a political policy say, and hence to support for that.

It will be helpful if we can have a broad cross section of the electorate by age.

So if they are arranged from youngest to oldest, then of course bi-systematically

considering, let's say the 23rd, the 123rd, the 223rd, etc, etc.

Then of course, that ensures we have a fairly broad cross section of ages,

from younger members in our population through to the more elderly members.

3:42

Of course though, one should be conscious of any cyclicality which may exist within

our data set, which might actually serve to just decrease the representativeness.

So an example of this might be,

let's say we recorded the daily sales turn over in a supermarket.

And hence we had our daily sales, Monday, Tuesday, Wednesday, Thursday, Friday,

Saturday, Sunday, going into the next week and weeks thereafter.

So, of course, with something like a supermarket,

there would be some cyclicality about certain days being more

popular to do your weekly food shop than others.

Perhaps days towards the end of the week,

families may wish to stock up on food maybe for the weekend.

So there, if we had a sampling interval let's say of 7,

then of course we would have the same day of the week.

It's that day's sales which would be observed each time.

For example if we randomly had Monday as our starting point

then every seventh day there after would be a series of Mondays.

And hence the sales turnover on a Monday may not represent what we see throughout

the course of a week.

So that was our look briefly at systematic sampling.

We will consider two more kinds of random sampling.

The next one is stratified random sampling.

Now this is really the probabilistic equivalent of quote or

sampling, which we saw a couple of sessions ago,

which of course, was an example of non-random sampling.

Whereby we said, if we knew the distribution of some characteristic

in a population e.g. gender.

Let's say 50-50 split between males and females,

then we may wish to replicate that by setting quotas of 50% males, and

50% females within our observed sample.

But of course with a quota sampling it was up to the individual researcher to choose

the people to take part in our survey.

And hence would be liable and susceptible to a degree of selection bias.

So this is the probabilistic version of proto-sampling whereby things we might

use as a culture control such as gender, maybe a age group.

Nationality, you name it.

These would be called stratification factors

in a stratified sampling situation.

So a nice example might be,

if we were interested in student satisfaction at a university.

So of course the university will have many students, but

they're each going to be studying at a particular degree level.

There may be the undergraduate students, maybe the master students, and

let's say the PhD community as well.

And we may suspect that student satisfaction with their program of study

may depend on their level of study.

So if we simply took a simple random sample from

the registry database of all students at this university.

By chance, we might just end up with purely undergraduates or

maybe just master students in our sample.

And hence, without the views of the other groups of the student community,

we may not get a representative snapshot of student satisfaction at

the institution as a whole.

6:53

So in this case we may look to stratify students at this university.

Let's say by level of study.

Namely, we have as our strata, the different levels of programme

namely the bachelor's degree, the undergraduate students, those studying for

a masters degree, and the PhD community as well.

Such that, with this stratification of program level, we are dividing,

partitioning a total list of students into what we call MECE,

mutually exclusive and collectively exhaustive groups.

Mutually exclusive means that students will belong to at

most one of those groups.

For example, you cannot be studying let's say for an undergraduate qualification and

a post graduate qualification at the same university at the same point in time.

And collectively exhaustive in that each student must belong to one or

other of those strata.

Hence, every student must be studying one of those degrees, and

at most one of those degrees.

So once we've segregated our students out by the level of study, we would

then just take a series of simple random samples, one from each of those strata.

And hence this guarantees we get a representative sample in terms of drawing

from the full cross section of levels of study within the university.

Of course, a natural question to ask is, how many should we choose from

each of those groupings, the undergraduate, Masters and PhD?

Well, for that we may wish to consider the relative size of the strata.

Let's imagine a university had 50% undergraduates,

40% master students, and 10% PhD students.

So given,

the undergraduates would be the largest cohort of students at the institution.

It would make sense that they should occupy equivalently 50%

of our overall sample size.

And similarly, 40% of students should be drawn from the master's community.

And the remaining 10% from the PhD community.

So this is so called a proportionate stratified sampling

where we take into account the relative size of those groups.

Of course if each group was of the same size then we would

take exactly the same numbers of members from each of those strata.

Of course one further consideration might be the standard deviation

of the characteristic of interest within each of those groups.

Because if we had a group which had quite a small standard deviation for

the characteristic of interest, and of course that just means there's not

much variation of that characteristic within that particular natural grouping.

And hence there would be no need to sample a particularly large number

from that group because if you just observe a handful,

that's likely to give you a fairly clear picture of the group as a whole.

And hence, so-called disproportionate stratified sampling

would take into account not just the size of the strata but

also the degree of variation which exists within each group, i.e.

by taking into account the standard deviation

of the characteristic of interest in each group.

And finally, we move on to cluster sampling.

So let's return to this case of investigating student satisfaction.

Now, of course, we may not be interested in a single University, but rather,

all students at any university in a particular country.

So we might choose different institutions,

different universities as so-called clusters.

Now, I agree that no two universities are identical, but

they will be fairly similar to each other.

Two different universities.

They're both going to gave some undergraduate students, masters,

and PhD students.

And no doubt studying wide range of different disciplines as well.

So with cluster sampling, there were different forms of cluster sampling.

But at the simplest case, we would consider dividing our total population,

for example, students, into these mutually exclusive and

collectively exhaustive clusters, i.e., the different universities.

And then to try and save some time and

money, we wouldn't necessarily consider students at all institutions, but

rather we could take a random sample of all of these clusters.

For example, using simple random sampling.

And once those particular subset of universities within a country have

been chosen, then in a one-stage cluster sample,

we would then consider all students within those chosen universities.

But of course we know universities can have large numbers of students and

hence we may wish to have multiple stages of our sampling

whereby from those selected institutions, we then may wish to

introduce some multistage sampling whereby we then further do some stratification.

And from those selected institutions,

we then may wish to stratify by undergraduate programs, masters, and PhD.

And of course we could refine this process at many stages if so required.

So that's just to give you a taste of the different kinds of sampling techniques

out there.

Please do review the online materials for this MOOC, which consider the relative

advantages and disadvantages of the random versus non-random forms of sampling.

And the different constituent parts there of.

[MUSIC]