0:56
Why is that?
Well, the first two techniques, exact matching and probabilistic matching,
they assume that the linked records belong to the same unit.
So ideally, you would see a record in data set one from Frauke Kreuter,
born, not going to tell you, and some other identifying variable.
Or maybe my first name and last name combined is already unique enough that
you would, in the second data set, find that case as well, right?
1:30
If it's the social security number, that presumably is unique for
a particular person, and you could find that in both data sets.
So that would be an exact matching.
Sometimes, if you don't have as good of an identifier, you can't do that.
And so you need to move to probabilistic matching.
Statistical matching is a totally different game.
The key difference here is that you try to find similar cases, even so
these are completely different data sets.
So, I'm going to talk in more detail about these, but
I wanted to make that distinction clear so that you know the broader
2:10
difference between these techniques as we revisit this.
So on exact linkage, as we said,
the link is established based on a single unique identifier.
Although, you could imagine creating a single unique identifier out
of several variables, right?
So first name, last name, birthdate, birthplace.
That together could be a single unique identifier for a case.
2:57
The purely deterministic approach requires an exact 1 to 1 match,
where you have one case here, one case there, you combine the two.
The key here is, if you know eventually you want to link to
a certain administrative database, that you request that
particular identifier from the respondent prior to linkage.
So,it's good to plan your whole data collection at once.
And make sure you have the proper identifying variables for
each data set in place, as you go into this.
3:35
Now, this also assumes that you've record the identifier without error.
If I misremember my birthdate, or you have a telephone interview,
and you misshear my name, and write it down a different way,
then it will be very hard to find me in the database.
And you will need to move to something that's more probabilistic or
predictive linkage.
Now the term I shall say the predictive linkages is not necessarily
a fixed term like probabilistic record linkage is.
But in the latter sort of coming more out of computer science,
using techniques like machine learning learning, or
techniques from database and information retrieval systems.
They are basically techniques that are used for prediction.
That's why I use this kind of labeling here.
4:54
What happens when you do this kind of probabilistic linkage,
you use a set of available attributes for linking.
That could be personal information, names, addresses, date of birth, and so on.
And then you calculate match weights for attributes.
So this might be a little bit confusing, because I just said earlier on the exact
linkage, you could create a unique key out of that string.
That's right, that way if I actually make a string out of the entire
set of variables, which eventually then is unique.
Here, the notion is more that you could link on names, on addresses,
on birthdates, and each of them should have a different weight.
Because, let's say if you have the correct name, right,
and so, you assume every one of these variables could have an error.
So my name could be typed wrong, my birthdate could be wrong,
the birthplace could be wrong, the address could be wrong.
And so you wouldn't want all of them to go equally strongly into the probabilistic
record linkage algorithm, because you would assume, well,
name change, less likely to happen, right?
Whereas address changes maybe more so, right?
So if I have the same name and birthdate but
two different addresses, then maybe the address shouldn't have as much weight.
Or depending on what it is, maybe you have different decisions going on here, right?
Or you think okay, well typos in name happen quite frequently,
but in the database I have, addresses are always validated and so
therefore the address is good.
Then you might have a different rationale here, right?
So you gotta think ahead of time what that might be,
what these weights are for the particular variables.
And then eventually you sum over all of them and
you create a score in which you merge.
Not going to go into the statistical details here.
That's not the right course for this.
But it gives you a sense of what's behind this technique.
And then likewise, on the predictive linkage approaches,
most of them are sort of driven by machine learning techniques.
There are a lot of papers by Bill Winkler from the US Census Bureau,
that you can retrieve from the US Census Bureau website,
on using these techniques for various data linkage endeavors.
And as always in machine learning, you can use both techniques,
supervised learning and unsupervised learning.
Supervised learning basically being like regression techniques.
You predict something using training data so you have a subset of your data set,
where you already had a link and you know, okay, these really are true matches.
And then you try to predict, learn from these training data,
what are good predictive variables for these matches and use those.
8:20
Now, the challenges with these techniques is that if there are no
unique identifier available, then this can be quite difficult.
But more likely, the real world data are dirty,
as I said before, so in the Fellegi Sunter approach,
you want to assign different weights even if we are not dealing with dirty data.
Because it might be more important to match your name and birthday and
not on address because address changes, right?
This is not counting for typos.
But if you have, if all of this is sort of filled with typographical errors and
missing values or out of date values and stuff like that,
it might be necessary to have a lot of preprocessing.
So let's take my name again, right?
So sometimes you will see Frauke Kreuter, PhD.
Sometimes you will see Doctor Frauke Kreuter, and
sometimes you will see Professor Doctor Frauke Kreuter,
because the Germans like to spell out the full titling.
And so if you were to take that that string that's in your data set and
match it with the same string that only has Frauke Kreuter,
you wouldn't even see that this is the same case, right?
So there would need to be parsing going on, where you drop the professor and
you drop the doctor and then you merge these two pieces, right?
So there's a lot of work going on ahead of time.
Don't underestimate that when you try to create a record linkage project.
9:46
Also there could be imperfect matches obviously,
as a function of this dirty data.
Now, the other piece that's problematic is,
if you really try to find 1 to 1 match, right?
This doesn't scale right?
Because just think about it, you take a case and
then you search in the other database for, of course, everybody is there and
is available, this is computationally very, very effortful.
And can very quickly result into exceeding
any computational power you have.
So there are techniques out there that sort of reduce any
duplicates before in the data set that use techniques like blocking,
that you only search within a certain area.
So if you decide, okay, city is the important
10:35
variable and you would just search within Washington D.C.
rather than everybody and stuff like that.
So there are techniques to make that computational task a little easier,
something to take a look at there.
Textbooks out there, like the one from Christian from 2012 published with
Springer and then an intro chapter in the Taylor and Prince's edited volume, from
Ian Foster and others that's coming out in 2016 that Josh Tarko and Jeff wrote.
They give you a good sense of this.
We'll put the links up on the Coursera website when the book is out and
obviously something interesting for you to look at.
11:46
Rainer Schnell, over at the University of Duisburg and
currently in London with the European Social Survey.
And that institute the US is housed in.
They're working on research that's using private CV preserving record linkage,
so that you don't even need to use that information or crypted uuu,
so that it can be more easily used for these record linkage endeavors.
And then, of course,
which is the issue of consent which is the next piece we're going to talk about.
Now circling back to the preprocessing, Tokle & Bender,
they have a workflow schematic here that the German Record Linkage Center
sort of used to portray all the work that needs to be done.
And basically you think of having a raw data file, where you have to first
think about and capture any data definition and be clear what these are.
You might have birthdates being month, month day, day, year,
year in one data set and in the other data set, because it comes from Europe,
you have day, day, month, month and then a year or even a year with four digits.
And of course all of that needs to be common,
until you need to learn about these data definitions ahead of time.
Then there's the parsing I all ready gave that example with the names.
So extensive data cleaning that happens and
in some instances, you also need to normalize, which means that you break up
a data table that has a lot of redundancy into a separate data table.
So, for example, if you have a server data that has one record for your household
member and then additional record for the household information, often it
would be helpful to have these tables spread out in a household level file.
And in a person level file and then you might match address information,
geocode positions, to the household level file because then,
you have just one row and not duplicated rows for each of geocodes much faster,
or addresses for that matter, much faster record linkage possible that way.
And then for the personal information,
you just keep that in a separate person level file.
So that would be the step of normalization,
14:08
you might have new variables that you derive,
because you take certain variables and combine them or what have you.
You might decide that you only link on a subset, so
there could be filtering going on.
And then of course, the data linkage itself.
We talked about resources saving through record linkage, but
these are all elements that add to additional resources and time,
the program on time that you might need to know in order to do this.
There's software that can help you with this.
On the left you see a bunch of packages,
the R Paket, RecordLinkage, probably popular.
But the others too, so you can look at them.
See what speaks to you and fits your purposes the best.
I'm most familiar with the Merge ToolBox,
MTB, which is a Java application developed in the German Record Linkage Center.
The last version update's from November 2012.
This is free for anyone who does research.
Unfortunately not for the non academic purposes but
you can find information on their website and you can get
counseling on how to use that software by the German Record Linkage Center.
15:30
Here you combine distinct data sources, okay?
So remember earlier we said we have the same cases and they are linked together.
Here we have a situation where we have disjointed data sets.
And you try to find good fitting cases across the two data sets.
Here records are linked, if there are similar based on similarity measures.
But how you create these similarity measures, up to you, right?
A lot of different techniques available.
Nearest neighbor would be one, regression based techniques.
Another one many use propensity score matching.
There is a whole set of literature that you wouldn't even see when you look
at the record linkage literature, but you can read up on if you look closely.
16:29
Causal inference literature and
anything that has to do with generalization of experimental results.
So think about it, you have an experiment that NIH, National Institute of Health,
so someone else does, and it's a small set of cases.
Or even more so, actually, more common,
you have some cases that have undergone
a treatment, but weren't randomized.
And you would like to see, well,
what would have happened had these people not undergone this treatment?
And so you're looking for a control group to compare them to.
So you're looking for cases that are just like the ones that have been exposed
to a certain training, have been exposed to labor market participation program or
something like that.
And you then later on want to analyze, what is the effect between those two?
So that would be a typical application for these statistical matching techniques and
as I said, there's a ton of literature out there.
The steps, when you do that, are also pretty straight forward.
You have to determine a set of covariates to use.
You define a distance measure.
You select a matching method and carry it out.
And then most importantly, you diagnose the quality of your matches,
17:54
often graphically, but other statistical techniques are there too.
And then you carry out the analysis on the matched samples.
Here is just one visual graph that sort of shows this treatment and control unit and
in gray on the left side you see all the cases for which no matches were found.
So later on in this analysis that Elizabeth Stuart and
Don Rubin did in this paper, they would do this analysis only for
the cases that had treatment and control units that could be matched to each other.
Any of the papers by Elizabeth Stuart at Johns Hopkins University,
are very accessible on this topic.
There's a couple of overview papers that you can see on her website and
I highly recommend it, if you want to dive into that particular topic