So hopefully in the last video I motivated you, or at least scared you into thinking that reproducible research is important. So how do you achieve reproducibility? The first component is the data. So the first thing that you need is a data sharing plan. How are you going to take the data that you ingest, the raw data that you collect from somebody, turn it into some processed data, and turn it around and share it with other people? There should be four parts to this. First, you need the raw data. Whatever data you took in without doing anything to it. Then a tidy data set. That's what you did to the data. How did you process it? How did you get it all in a neat format that you could analyze? Third, a code book describing each variable and its values in the tidy data set. So this could be things that don't necessarily appear in the data set itself. Maybe, say, like the units of the measurements, or what machine they were run on, or who collected them, or any other information that you can't pack into the data set itself. And then fourth, an explicit and exact recipe you used to go from the raw data to the tidy data set in the code book. I just wanted to go through each of these a little bit more detailed just to make sure that you knew what we were talking about. So first the raw data. So, raw data's actually a relative term. It's relative to who you are. If you get the image files from a genomics sequencing experiment, then the image files are the raw data to you. If you get the fast q files, then the fast q files are the raw data to you. Regardless of what the raw data are, in terms of what you received, the way that you know it's raw to you is if you did no processing on it. You didn't compute anything or merge anything together. You didn't summarize it in any way, and you didn't delete or remove any values. That's the raw data. So when you're sharing an analysis that you do, you always need to have the raw data as a component of that. A second component is the tidy data set. So this is the data set where it's actually all organized and it's shareable and ready to be used. In general, a tidy data set will have one variable per column, one observation per row and one table per kind of variable. We'll talk about how in genomics there are usually three main types of tables that you'll collect in the tidy data set. And then you want linking indicators that link all of these different data sets together so that people can analyze all of the data across different data sets. So you need to have the tidy data set to go with the raw data set and then a code book. So the code book needs to explain things like, if the variable is coded as one and two, then you want to know that one means schizophrenia patients and two means controls. You also want to have things like variable descriptions, what do the variables mean, how are they measured, the units and any study design quirks you have. For example, some measurements are set to be 0 because they are below the limit of detection, you want to note that in the code book. Then you also need the recipe. So the recipe basically takes the raw data as an input, it does some processing, and produces the tidy data set as an output. Here, the input data is the only thing that should go into the code and the output should only be the tidy data. You shouldn't have to set anything, no parameters, everything should be fixed. And you should always get the same tidy data set out if you put the same raw data set in. So, then, if you don't actually have a script to do this. So if you haven't written some R code or some Python code, the best way to do this is to have explicit instructions. This is actually really, really hard to do well. You need to take very careful measures to make sure that you have every version of every piece of software measured. You have every parameter that you might have used on every piece of software. All of the order has to be right. You have to record basically everything you did in a document like the recipe document. It's much harder, so it's highly, highly recommended that you just write a script that will reproduce the results. The things you definitely have to avoid is any kind of vague instructions. You can't just say I used software X, because if software X changes, if there's a new version, or if it has a new function that's been added to it, people won't be able to use it. Make sure that you don't skip any steps, even the little tiny steps that you did, oh, I had to merge column 1 and column 2 into one column and then I ran the software on it. You have to record all that if you don't have a script that will do it. So there's the four components that you need, you're going to have to have the raw data, the tidy data, the code book and the recipe. And then in terms of sharing code, there's a couple of different ways that you could do it. One is you could just share raw files. This is an example of a .R file. It's just a raw R code file that you can share with people. This can be good. You should make sure that you comment all of the different code. In other words, make sure you explain what each of the functions do and how each of the functions work together. But an even better way to do this is what's called literate programming. An example in R is an R markdown document. An example in Python would be an iPython notebook. It's basically a document that mixes together writing and code so the people can read what you did and then you can explain it in plain language, follow that immediately by the code, followed by the results. In the way that it's all sort of interwoven together, so that it's easy for people to get looking at the data. So when you're distributing your analysis for this class, or for anything you do in the future, make sure you have a complete data set, as well as either your R code or whatever kind of code you have, or a literate programming document, if you can do that.