[MUSIC] Hello, this lesson builds off our data harmonization lesson. Building off the reasons why data are often fragmented, I will continue the theme by describing approaches to data integration. I will illustrate why data integration is often challenging and costly, but it's often worth the effort. At the end of the lesson, you will be able to provide concrete examples about why it is often necessary to integrate data and why data coming from disparate sources will often lead to challenging data conflicts that require resolution. To start, it is important to ask ourselves what is data integration? I offer a definition I found at an IBM analytics paper. The authors of the white paper define data integration as the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information. Moreover, a complete data integration solution delivers trusted data from a variety of sources. I like this definition because it makes clear that data integration involves both technical and human processes. In our case the business processes relate to healthcare organizations and how they provide clinical care or how they finance the care. It is also important the data are trusted. This means that users should have confidence the data are of high quality and results in valid and reliable information. Next, it is important to ask ourselves why we should consider integrating data within a healthcare environment. The first part of the answer is that data are often the fuel of business intelligence applications. And the business intelligence applications in healthcare are becoming more and more important as these support clinical processes, finance research and many other aspects of delivering and paying for care. Second, data are the fuel of applications, yet most organizations have data in multiple systems. These include the newer EMR systems along with the important legacy systems. Finally, what I term high value information or information that can actually lead to actionable improvements and processes often requires data from many of the data sources. As an example, hospitals care about their processes and often is ineffective to analyze data within silos, such as clinical, operational and financial domains. The next question might be, why is data integration hard? Well, if it was easy or trivial this would not be as important of a topic. Few jobs are tasked or related to the endeavor. But as all programmers now, data integration is often one of the most challenging aspects of data-related projects. Moreover, it is worth noting, integrating just two sources of data is difficult, but is often necessary to integrate multiple sources. Here I list a few reasons why data integration is challenging. First, database schema are often heterogeneous. In other words, database structures such as tables and fields differ. Next, data may come from databases, spreadsheets, and other sources, thus it can be difficult to get all the data into one place. Third, many data systems are unstructured, there are no discrete fields to query. For example, all of the important data might be written and detailed on unstructured doctor's notes. Fourth, data may come from different parts of the organization where the team has no authority over the data. Sometimes there are political issues with giving accessing data within an analyst's own organization. A final point is that often a large amount of tedious and manual effort is required to create mappings. We will cover these topics in more detail in a few moments. To further illustrate the challenges of data integration, it is useful to review different types of conflicts. First, communication conflicts relate to different methods required to simply access the data sources. For example, some databases may allow simple open connections such as ODBC where others are more sophisticated using APIs. Next, database schema conflicts is a broad topic. One example is different tables to our database information in various ways. For example, fields from various sources measure the same concepts. However, the field store data in different places or use different formats such as character or numeric field types. Third, data values can also differ amongst sources. For example, one database might spell out avenue for street address whereas others use abbreviations. When I discuss entity resolution in the lessons coming up, we will soon see how these subtle differences can cause us problems. Fourth, semantic conflicts relate to the meaning of the data. For example, the meaning of a field called place of injury could be confusing. Does this mean the location of the body, or the state in which the person was hurt? Finally, databases come in many different types. For example, an organisation might have an object oriented relational, document oriented or a graph database. Any integration requires reconciling some of the major differences in how data are stored among these database types. Let us look at an example. Health systems such as hospitals have many data sources collected by different units or divisions within the organization. These include electronic health records, billing systems, and research systems. Moreover, many of these systems might have different modules that have distinct and non-integrated databases. For example, research systems might have separate clinical trials management systems and research volunteer registries. Almost all hospitals including the one I've worked for have legacy databases that predate the HR. In my experience, executives are often interested to gain organized value by integrating many or all of these systems. Next, allow me to provide an example from health insurers or health plans. Healthcare payers that pay providers for services, as an example, Blue Shield or Kaiser Permanente in the United States, often have a lot of data in numerous sources. Many of these organizations realize that data integration is necessary to improve efficiency, reduce costs, and improve quality. Healthcare payers mainly deal with claims records related to fee-for-service payments, yet other databases are important for integration. These include encounter records related to managed care systems, government databases such as hospital discharge datasets, member satisfaction data, and finally, clinical data from providers. Clinical information to improve quality and reduce costs is sometimes dependent on data integration process. For example, outcomes research related to claims, might require clinical information to create predictor or independent variables. Okay, that was a very quick review of data integration and how resolving conflicts between data is a common task, yet it's often challenging. In the next lesson, we will move into a more detailed topic about how to resolve data conflicts. This is the area of data mapping. See you soon.