Now, I want to switch gears and talk about a more general subject referred to as Big Data. Although this topic is related to the massive clinical datasets found among hospital EHR systems, we will talk more here about the new technologies that evolved to store huge datasets in a new manner. I will come back to health care to remind us that we often use huge datasets with billions of rows, and a huge number of dimensions to actually create small subsets or cohorts of patients. Thus, big data is often used to create small and precise populations. After the lesson, you will be able to describe how big data formats are different than common relational database technologies, that require a lot of effort in database domains of database modeling and planning. So, what in the world is big data mean in the context of healthcare? So, you might ask, what in the world does big data mean in the context of healthcare? The definition of big data is rapidly evolving. At one point in time, it was purely a volume metric. Today, it refers to both large data volumes and inconsistent rapidly changing data structures. Although formal definitions are evolving, let me offer some of the common terms that are used to describe big data. Most of these terms start with the letter V. First, volume is the well accepted idea that big data is just that, a lot of data. For example, a hospital that I've worked with has over 2.3 million patients in the HR system in many, many billions of rows of data. Second, variety. Variety refers to data coming from various sources and thus having various formats. Think about all the forms of social media. Users are demanding more and more functionality in the form of new data types and multimedia. This level of volatility in the data types that need to be persisted or saved has never been seen before. As such, new technologies are arising to better handle this new reality. However, most of them do so at the expense of formal modeling in some of the legacy data store systems. This means the data is merely saved. It is not broadly accessible. To relate this to medicine, consider progress notes written by your doctor. Most progress notes agree on basic notes sections, typically subjective, objective, assessment and plan. However, notes very widely. Consider factors such as the level of detail, use of abbreviations, language differences, regional influences, and doctor specific influences. All of these factors also interacts with the continuously evolving nature of medicine. Third, velocity gives the notion of movement. Indeed, most healthcare organizations with large databases are constantly getting new data and sometimes even modifying old data, and often these changes happen daily. Fourth, verification means that the data are complex and often changing. In combination with volume and variety, it can be very difficult to verify that the data are really valid. There are other components to the definitions but I will encourage you to read about these details on your own. With the definition of big data, now allow me to focus on the technologies that were created specifically to account for new and unique data demands. These new data stores technologies, commonly referred to as no SQL or no relation technologies, also sometimes are referred to as big data appliances. Previous storage technologies require a formal description of the data, a process known as data modeling, prior to accepting or storing the data. In these new big data appliances, data is usually stored in a more natural form eliminating the need to incur the cost of the data modeling step. This however comes at a great expense since the storage strategy itself does not formally describe the data. The data is only valuable to the process that created it or to those who are willing to spend a lot of time analyzing it. Now, I will be introducing the most common big data appliances at a very high level simply to provide you some contexts. I encourage you to do some more research on these more deeply if you have specific interests. I hope only to introduce the major strengths and weaknesses of each of these technologies. I'll briefly mentioned three big data appliances. Key-value stores, document stores, and graph databases. Key-value storage are the most simplistic of the three, offering extremely fast performance but absolutely no data visibility. Documents stores are less blind to the data but this comes at a slight performance and configuration costs compared to the key-value store. Graph databases are the best for modeling data in its natural state. However, getting an understanding of the data that's present and the relationships among them can be a very time consuming process. In general, these technologies don't currently have much presence in the healthcare field. There is one exception. A specific technology known as Hadoop is a form of a document data store. This technology is beginning to have an impact in unstructured data analytics. Going back to key-value stores, these offer an extremely simple data model. A key is uniquely associated with the value. This approach requires no formal data modeling but does provide fast and reliable storage. The use or understanding of the data must be done completely outside of the data store. This is very much the same as saving a document or any other file on your computer. These characteristics have made key-value store is very popular for rapid object-oriented application development. The reason for this is that the key-value store does not place on the limitations on what datatype the value can be. It can range from anything from a string all the way to a complex object. Oftentimes, object-oriented programmers will simply save their objects into this database making retrieval very fast. Now, I will mention how we can move from the topic of big data to a new and exciting field of precision medicine. Big data are impressive and these do offer opportunities. But sometimes it is important to remind ourselves that sometimes the goal of big data is to have enough data so that we can understand rare diseases and problems for smaller sub-populations. The relatively new field of precision medicine integrates molecular and clinical research with patient data placing the patient at the center of all elements. It means identifying those who respond to treatments and then targeting or being more precise about our therapies. Typically it involves the use of multiple molecular techniques to provide a profile of a disease which predicts outcomes or progression. This then allows for directed or targeted therapy So, typically, the measurements that we see done today are molecular tests that involve measuring methylation of DNA, counting gene copy number, identifying the number and types of mutations, transcription issues, and of course malformed proteins or proteomics. We're now measuring many of these. Biomarkers are emerging and work is being done to correlate these different disease outcomes and progression when exposed to different therapies. People are also starting to invent therapies that are based on targeting specific mutations. For example, problematic transcriptions. In some, a lot of these measurements or discovered biomarkers if they are molecular targets turn out to be good targets for therapies. Excellent, we have now completed this module and I hope that you have a greater appreciation for the variety of data types and formats that are common in healthcare. As data scientists and analysts, I cannot stress enough how important it is to understand the context of data. But if you know why data are collected and for what purpose, you will be far ahead in your work.