Now, we're going to get into the substance of how to formulate a data query. The first step in formulating a data query is answering the where, where is the data? Well, what kind of sources can we bring to bear on our question? What sources are available? Do I have access to those sources? Do I have access to the full source or just a limited sub source of it? Second step in formulating a data query is understanding, do we want all the results or some of the results? In the previous slide deck, we talked about narrowing a dataset. For example, let's find all patients with high blood pressure. That is narrowing based on a observational, either laboratory or diagnostic criteria. So that's called a filter. Step three is aggregate. Do we aggregate the data or process it? Are we looking for, for example, let's find the average height of patients in Baltimore, Maryland. Well, that doesn't result in all the patients in Baltimore, Maryland. It's at result in one number. So that's a process. Similarly, there's other processes that are under this rubric of aggregate. So do we aggregate the final results? Step four, do we filter that aggregate? For example, what is the average height of patients from Baltimore who are between the ages of 50 and 60? So that would apply both a process, we're calculating an average, and we are filtering. Step five is sorting the output. What makes it most readable? Do we sort it on patient ID? Well, to most humans, patient ID is a useless piece of information. We typically sort things by alphabetical order. That's how our minds are trained to sort a list or look at things. But there might be other ways to sort the data in the most readable fashion possible. And step six, the final one, is the output. Which columns do you want to output, or which data elements, I should say, do you want to output? A dataset might have hundreds or thousands of columns or pieces of specific data, but I only want three. If I'm making a mailing list for example of patients who came to a clinic, I don't need to know how many square feet their living spaces, or I don't need to know, I don't need to output what kind of car they drive. I just want to know their name and address to generate a mailing list. So what parts are most important to the query? So let's summarize these steps. Step one, where? Where do we pull the data from? Do we need to filter? Do we reduce the data set, or their filters on the selection or on the pulling of the data that we need to do? Do we need to process or aggregate or collect the data together to generate the result, or can we just look at the data as is? Step four, do we filter that final aggregate result? We only want a subset of that final result. Step five, how do we sort the data? Do we want to present it in ways that might not be intuitive or are readable to our particular customer? Finally, step six. What do we output? What are the pieces of data that are important to be in the results set? Well, not only does this framework answer the basic questions, but it also aligns with the foundations of the SQL language, the structured query language using relational databases. The Sources, step one, aligns with what's called a From clause of a Select statement. That's where you pick your data sources. The filter aligns with a Where clause. A where clause is where you set up conditions or filters that will reduce the dataset, and aggregation is a Group By, with or without having, you can further reduce your aggregate through the Having clause. The Sort By is the sort operation is done with the Order By clause. And finally counter-intuitive to the way it's written, the last intellectual or thought process step is called Select, where you're designating columns or elements for output. Again this course will not go into how exactly to write all those clauses of a Select statement. However, it's worthwhile knowing that the Select statement itself forces a discipline on how to ask a data question. The SQL statement, the Select statement, that is, lends itself not just to relational models of data, but it also enforces a discipline upon which you can build to structured data questions even in non-relational sets of data.