Welcome, so let's talk a little about a bit about the challenges in record linkage. [COUGH] No matter how big the size of your records is, we have some traditional challenges when linking records from multiple data sources. One big one is the lack of unique identifiers. Let's take for example, Julie Lane. A dear colleague of mine professor at the Wagner School in New York at NYU. And if I searched for her on the Internet using Google image search, immediately, two other pictures appear. They've share a feature with her, all three are females, but clearly different people up here. Not surprisingly, because Julia Lane is not that of a unique name. The same can happen with firms. Summit Consulting is a statistical consulting company here in DC. But if we look for Summit Consulting on the Internet, we'll find another company in Texas. So the lack of unique identifiers across different files can be a real challenge for any one who wants to make sure that the same entities are matched to each other. Another challenge, what's often referred to as dirty data. Data can be quite messy and dirty due to typographical errors, variations in the record. So let's assume that, for example, marital status changed over time, or someone moved and the address will be different in different records. Even so, it's the same entity. Values can be out of date in a similar way. And of course, values can be missing. In which case, we hope that we can insert them from one database to the next. Different coding schemes often appear when we deal with dates. That can be a big problem, because dates are recorded in one format, in one data source, and in a different format in the next. On the right hand side, you see some examples here. John Smith's being spelled differently, capitalized and not, Charles abbreviated to Chuck, same with William and Bill, often the same person in data sets, and the list goes on. The street names too, can have very different ways they are recorded. Wrote being abbreviated drive, and so on. And you see, in this example, a couple of different ways to write down dates as they might appear in different databases. These examples on the right are taken from a NIH resource. The full reference is given on the bottom of the slide and you can look at this online book yourself. Another challenge that we aware of and often happens is, the issue of privacy and the nature of the data that want to be linked. So on the right hand side you see a chart from the Eurobarometer. Where people were asked, how sensitive they would consider the following information to be. And you can see that, financial information, medical records are considered to be sensitive by a large fraction of the respondents in this survey. In many countries, you have to give informed consent and we have a separate manual on that. And of course, the use of the data will affect of all these issues. In the big data context, a couple of new challenges arise. One important one is scalability. So the naive compares in a one record to another, will explode. You have a lot of different records in all of these databases. And if you think doing pair wise comparison to see if they are the same or not, that will take a long time. So we need to move to an approach that's much more efficient. And research needs to be done too, into those techniques. Also, sometimes we have networks and complicated relationships that you might need to take into account when matching these records. And mostly a big challenge is that, we often do not know what is the true entity. So there aren't really good training set's available for the algorithms to learn what is the correct match and what is not. So this is why there is a lot of new research done in this field currently. The extended record linkage process, as depicted by Bender and his colleagues, is therefore not one away, just compare data filing in B. But where there's a lot of preprocessing going on in both files before you compare them. So the preprocessing could include the passing of individual names, correction of typos, anything of that nature. Ideally, you also reduce the search space. So if you know that someone is Chicago, maybe you can limit the search just to the area of Chicago and this is referred to as blocking. And then you do your comparison within those smaller sets before you classify them as true links or no links. Now to put this in perspective, it is the case, just like with a lot of analysis that we do that, that 75% of the work, according to Gill in 2001, amounts to this cleaning and passing step. 20% is roughly what the matching, checking, the matching process is correct, and only 5% goes to the actual linkage effort. The importance of preprocessing can't be overstated. I'd like you to read these two quotes by Winkler, by yourself. So one last note on identifiers. Typically, they have first and last name, address, birth date and the like. And you might have across these different records, different identifiers. As I said, if they are from multiple sources, you might see changes, different spellings and so on. It's a good thing to not throw them away, even if you decide that there's sort of a core piece that belongs to all of them. Because, If you add additional data sources, it could be helpful to have these original identifiers to then later on link it to one of these different varieties in your data set. So variations within a given unit is often possible, can arise almost anywhere. And as I said, it's helpful to keep those. In summary, Christen for that reason makes this process even more complicated, I would say. Where you allow for a couple of feedback loops looping at the matches,the non-matches, and potential matches and a clinical review and then see if you can improve the process going forward. How you get to matches and non-matches and potential matches, we'll address in the next segment.