We have defined data integration and data fragmentation. We know the problem is that data is in disparate sources and often need to be integrated or harmonized. Then we moved into a more detailed section and data mapping. Now, we will build on data mapping and get even more into the details of one important analytical process that is almost always an issue with data integration, this is called entity resolution or record linkage. This topic is a form of data mapping. We'll want to link data from different data sources. What we're doing is getting more into the technical field of how people talk about and perform complex linking and mapping tasks. This lesson will provide you with the definition of record linkage and show you what processes are necessary to perform effective linkages. Once complete, you will be able to communicate the technical terms that are used to describe and perform record linkages. Entity resolution and record linkage are two terms that generally mean the same thing. The objective of this process is to be able to merge real-world entities captured in one database with other databases. For example, if we want to improve patient care, we might need to merge administrative data with clinical data. But the analyses, however, depend on data from each domain and how they're linked by some type of patient ID. Sometimes this is easy because concise IDs with the same structure will be used in both databases. However, there are often different ways to define entities. In more specific terms, entity resolution is the process of, first, finding non-identical duplicates among data sources. Second, merging the duplicates into a single record or an entity. First, let's provide some concrete examples for entity resolution. An entity is a real-world object such as a specific patient provider or a facility. Resolution is when programmers or analysts make decisions. The specific objects in a database are the same real world entity. For example, is provider A in both databases really the same real world provider of health care services? People come up with new technologies and processes and use a variety of names. Thus, you'll hear people talk about entity matching, record linkage, and duplicate identification. In addition, in a few minutes, I will mention merging, matching, and linking. In general, these are all the same conceptual process. But sometimes, there are important differences. For example, a match occurs if records refer to the same entity. This is usually an unknown and it is objective of the merging process. Thus, one might talk about the likelihood the records are a match. A link is when the analyst decides that the records refer to the same entity. This is a known outcome of the merging process. For example, one might talk about a linked dataset. With better definitions, let's now talk about attributes that define entities. It is important to note that classes of objects in the real world have attributes. Birds have feathers, people have names and addresses, provider and patients have names, IDs, and addresses. In the domain of entity resolution, attributes of an entity become identifiers as specific real-world objects. This means that values of attributes describe particular records in a database. For example, people can be defined by their name, age, address, and associated IDs. One might say that the names are all that one needs to define a person, but sometimes databases might have 20 different people with the same name of John Smith. Thus, some attributes are better than others to discriminate between different entities and name is usually better than age in identifying specific people. As another example, street addresses are usually more discriminating among specific people than simply the more general information about a city from which they are from. Let me show you an example. Entity resolution is a process of figuring out which records represent the same thing and merging relevant records together. In this example, all of these are the same real-world object. Brian Paciotti, a person in Pennsylvania. However, the database records are not identical. If the structure of the data are different, it becomes even harder. Another question might be, why should we merge data? Well, it is obvious here since we are talking about the need to integrate disparate data sources and we often need to merge real-world entities among different sources. In other words, clinical and organizational decisions were information that can be found once data from different sources are integrated. To achieve these objectives, records or entities from a variety of data sources are merged. This process is confusing because the word matching or linking get used interchangeably. The next question is, what entities are merged? Usually, data records or rows. But the important concept is the real-world entity that we want to merge among data sources. For example, the patients, the providers, or the facilities. With the merging process, it is important to first identify all non duplicates. For example, it could get confusing when merging datasets if the same entities are included more than once. If data source one has two John Smith patient records that are duplicated, a query merged to the other data will duplicate the John Smith in the other data. Duplicates cause analytical programmers a great deal of pain. I will provide another merging example. Imagine a clinical systems such as a hospital that is the goal of merging patient records from an EHR system with death information from a state vital statistics file. Clinical records from a patient demographics table within an EMR must be merged with demographic information from state vital statistics such as name, address, and social security number. The analysts will identify possible match records using the attributes of name, address, and social security number. Once it completes and the merging process has been validated, the programmer can load the linked data into the warehouse for analysis. Although errors could still be present, the analysts can then present the outcome of the project, in this case, the linked records. Now, let's compare matching approaches, deterministic versus probabilistic. For deterministic matching, all the chosen variables or fields that define attributes must match exactly. Thus, in deterministic matching, you have criteria which is applied to each record pair. For example, you can look for exact attributes on first-name, second name, address, and gender. If the pair of entities or record satisfy the criteria, it is a link. If not, it's a non link. The process is easily automated by a computer program. The example here shows exact matches and attributes for a person who was in two facilities. This image provides an example of when entities that should match do not because of data quality issues and missing data. In the first example, a deterministic match would not link the records because of the mistake in how SSN was typed. In the second example, the match would not occur because SSN was missing in Facility 2. When we manually review entities, we use intuition to help us identify positive matches for records containing slight variations or missing information for data between the two files. In this example, there is a typo and social security number due to the transposition of digits. A manual reviewer might see this and deem the para-match. In probabilistic matching, the objective is to estimate the probability or the likelihood that two entities or records are the same and hence can be linked. Probabilistic matching does not require exact matches among the fields. Unlike deterministic matching, concepts from probability theory are used for the linkage process. This is a much more complex method but more likely to achieve matches when there are errors or other data quality problems. Now, let me go through some of the general steps used in probabilistic matches. In step one, it is important to identify fields or variables for matching. For example, which attributes discriminate among various records and have high data quality? In step two, one can use computer code or a linkage tool to identify records that are likely matches. Many software programs such as Link Plus are designed for linking data. In step three, analysts can calculate a score that indicates, for record pairs, the likelihood that they both refer to the same entity. Software can create what are known as M and U probabilities. M probability, or the match probability, is similar to sensitivity. U probability, or the unmatched probability, is similar to specificity. The scoring algorithm is related to probabilities or weights that are calculated for each attribute comparison. The total score for a linkage between any two records is the sum of the scores generated from matching individual fields. As mentioned earlier, some attributes will have more discriminating power, and thus will contribute more to the final score. For example, higher probability of real discrimination between different entities emerges when records are linked based on the following examples. Use more uncommon data value such as the name Paciotti versus Smith. Use more specific variables such as social security number and address as compared to gender and race. Use multiple variables for linking records. For example, an analyst could use social security number, name, and gender rather than simply the patient's name. Moving on, again to the merging process, in step four, it is a good idea to sort the outputted linked dataset by total scores. Then in step five, it is necessary to define a cut-point for determining what will be a matched record. In this process, it is useful to set aside match pairs that are unlikely to be true linked entities. With this strategy, the set of matched pairs near the cut point maybe more uncertain and will be good candidates for manual review in the final step. Although I won't go into the details here, it is important to mention the concept of blocking. With so many comparisons, large files can make impossible resource demands. For example, two files with only 10,000 rows each, can lead to a 100 million comparisons. Blocking is an initial probabilistic linkage step that reduces the number of record comparisons between files. The process involves sorting and matching data files by one or more identifying blocking variables. Now, let's talk about measuring success. Matching attribute values between two entities is the most common basis for making a linkage decision. For example, one might choose only highly scored records in the probabilistic method or simply using deterministic linkage and report on the total proportion of records that matched. If one can find additional information about the true entities within the data, sensitivity and specificity can be calculated. Remember, sensitivity and specificity can only be calculated if, for a sample of records, the real entities are clearly known and can be compared. If the data are available, false positive and false negative rates can be calculated. Here, false negatives are real entities that were not linked whereas false positive links are non-equivalent entities that were linked. Users who have merged data will want to know the validity of the record linkage product. It may be worth the effort to measure success even if it means taking a sample of the link records and carefully evaluating false negatives and false positives. From a more social and legal perspective, I will finish this section by mentioning that merging projects should always comply with HIPAA regulations and any other state or organizational policies. The data integration process must adhere to all regulations. This is especially important because there is a risk that data could become identifiable because of the integration process. For example, encrypted patient IDs, once matched with identifiers such as name and address, become a much riskier file for data breaches. As a final note, I will make a comment about the likely value of data integration. Value is the ratio of benefits divided by the costs. There are lots of tools and methodologies to help, but usually manual effort will be required. Entity resolution with large complex data sets can be complex and time consuming, especially if the data are of low quality. Overall, I believe that data integration may result in high payoffs to healthcare organizations, but the effort will likely be large. I suspect that many of you will be at least tangentially involved in data integration projects. Enjoy the challenge and good luck to you.