In this brief lesson, my objective is to provide a little bit of information about a large and vast field the data and information that comes from genomic analyses. Like the other lessons just covered, we do not have time to go very deep into the subject, but we can provide a foundation for you to do your own homework and to learn more. The world has been experiencing a dramatic reduction on the cost of DNA sequencing and protein detection, and a dramatic rise in the amount of data that comes out of these devices. This is going to be increasingly used in health care. Even though it's current impact is not high, you may see an increasing impact by genomic data over the course of the next decade. In some, what's coming is likely to be an onslaught of genomic data. At the end of this lesson you will not become a geneticist, but you will be able to identify the common ways that various types of genomic data are stored in computer readable files. My purpose of this lesson is to entice you about the future of genomics in health care. Let's not go into details about genomics, however let's do get started with some definitions that I borrowed word for word from various sources. The World Health Organization defines genetics as the study of heredity. Genomics is defined as the study of genes and their functions They state that the main difference between genomics and genetics is that genetic scrutinizes the functioning and composition of the single gene, in contrast, genomics addresses all of the genes and their interrelationships in order to identify the combined influence on the growth and development of the organism. I liked the definitions provided by the National Human Genome Research Institute, here I read basic genomics from their paper called, A Brief Guide to Genomics. DNA molecules are made of two paired strands, these are often referred to as a double helix. Each DNA strand is made of four chemical units called nucleotide bases which comprise the genetic "alphabet". The bases are adenine, A, thymine, T, guanine, G, and cytosine with letter C. Bases on opposite strands pair specifically, and A always pairs with T and a C always pairs with the G. The order of the A's, the T's, the C's, and G's, determine the meaning of the information encoded in that part of the DNA molecule, just as the order of letters determines the meaning of a word. An organism's complete set of DNA is called its genome, virtually every single cell in the body contains a complete copy of the approximately three billion DNA base pairs or letters that make up the human genome. With its four-letter language, DNA contains the information needed to build the entire human body. A gene traditionally refers to the unit of DNA that carries the instructions for making a specific protein or a set of proteins. Each of the estimated 20,000 to 25,000 genes in the human genome codes for an average of three proteins. Located on 23 pairs of chromosomes packed into the nucleus of a human cell, genes direct the production of proteins with the assistance of enzymes and messenger molecules. With the brief review of genomics, let's now consider how the science of genomics offers real-world value to patients suffering from diseases. Genomics has progressed from having its value in understanding population heritability to a current phase where there is potential to understand the impact of specific genes on diseases. With this, there is a hope of a new form of personalized medicine or precision medicine were clinicians can better predict which specific treatments or medications will work most effectively for subpopulations based on genes, lifestyle, and environmental factors. There are now over 6,000 diseases that are tied to genetic inheritance, and over 1,500 clinically relevant traits that are being studied in the context of genomics. In some, medicine in the future really could become much more precise as treatments target very specific genetic traits. Okay. Great. We know genomics as potential in the context of precision medicine, but you might ask, isn't it too expensive to get genomic data for clinical use? Actually, the rise of genomic data is likely to come much more common, not only because it's likely value to patients, but also because the drop in costs make the data much easier to produce. The National Human Genome Research Institute tracks the cost associated with DNA sequencing. They use information to evaluate how improvements in DNA sequencing technologies will impact their programs. The NHGRI provide a few nice costs graphics. The first is cost per megabase of DNA sequence, their site describes this as the cost of the DNA sequence of one megabase which is a million bases. Cost per genome is the cost of sequencing a human-sized genome. Each chart has a comparison line to reflect Moore's Law or the doubling of compute power every two years. They state in their website that technology improvements that keep up with Moore's Law are widely regarded to be doing exceedingly well, making it a useful comparison. They write that both graphs use a logarithmic scale on the y-axis, the main idea here is to see how DNA technology have substantially outpaced Moore's Law beginning in 2008. Most of you have read about advances in genomics and its applicability to health. I will now review a few points from a nice paper called Big Data Analytics in Genomic Medicine. The authors of this paper argued that the next generation sequencing techniques, such as whole genome sequencing, is leading us towards the cheaper and reliable information that can come from genomic data. They discuss challenges associated with the huge files generated from the whole genome sequences, but seem optimistic that big data technology and informatics can deliver information into the clinical setting. Like our past lessons, the authors go on to talk about complex EHR systems and some of the challenges of storing and processing clinical data, yet they do think progress is being made in creating genomic output files that can be integrated into electronic health records. Continuing on with genomics and big data analytics, it is important to think about how massive volumes of data can be stored and analyzed within various health care environments. It is true that the volume of data decreases as it moves through the genomic analytic pipeline, but technical challenges remain. To give some context here, consider how the raw sequence data is in the gigabyte range per individual genome. The analytical paper I just mentioned states that the raw whole genome files can be 100 gigabytes. As the text sequence data is processed to normalize, the files get much more manageable. Variant Call Files are much more manageable one gig per genome. Finally, when you get to the interpretation stage with variations between the subjects DNA and the reference DNA in terms of variations in sequence, the files are only gigabytes of data. In terms of what would be stored on the patient, it's really a matter of how much disk space one has, and whether one sees value in storing the raw sequence data. Generally, there's so much raw sequence data that is being generated, that is currently impractical to store at all. In most cases, most people are storing the text sequence data and certainly the variation in sequence data, which is much smaller. The implication for anyone doing analytics is that you may be asked to do population space comparisons of variations and sequences, but much less likely to be called upon to analyze the raw data. Let me briefly mentioned genomic file formats. One of the most common file formats is FASTQ, which is a text-based data format. This format stores both biological sequence or a nucleotide sequence data, as well as its quality scores in an ASCII format. Most sequencers will output FASTQ format today. There are also SAM files. These are human readable files used for analysis. SAM is output from aligners that read FASTQ files and assign sequence, this is the position with respect to a known reference genome. BAM files are binary versions of the SAM files. These large files can be over 100 gigs per genome. Let's look at cancer for a minute. You would normally want a BAM file for your normal tissue and a BAM file for the tumor tissue. If you have 1,000 patients, one BAM file for the normal and one for the tumor tissue, you can see a significant amount of data is required for the cancer patients, let alone everyone else who comes into your system for care. Clearly, this is only a start to your adventures in genomic data, but hopefully it motivates you to learn more. With the single genome generating up to 100 gigabytes of data, this is a nice intro into the topic of big data. Let's move on to that topic in the next lesson. We'll see you soon.