Welcome to Component 24, Health Care and Data Analytics, Unit 2 Working with Data. This unit is a laboratory exercise using publicly available data sets and Microsoft Excel. The activities focus on preparing data for analysis, and performing simple analyses. The objectives for this unit, working with data, are to describe reasons why data need to be cleaned or modified before analysis. Demonstrate the ability to identify and correct basic errors in data. Demonstrate the ability to perform descriptive analysis. Demonstrate the ability to use pivot tables. And describe the relationship between a database in a health IT system and data analysis tools. Common technologies and tools used for data analytics include spreadsheet programs such as Microsoft Excel, and dedicated statistical programs such as R, SAS, SPSS and Stata. Note that database management systems such as MySQL and Microsoft SQL Server can perform some basic analysis, but this is not their strength. Typically, data will be retrieved from a database, such as an EHR's database using a SQL query. The data will then be imported into another application for the statistical analysis. There are also specialized business intelligence applications. Some examples are Tableau, QlikView, and IBM's Cognos. For the exercises in this unit, you will use Microsoft Excel because it is widely available. The Analysis ToolPak is a Microsoft Excel add-in program that is available when you install Microsoft Office or Excel. To use it in Excel, however, you need to load it first. Open Excel, click the File tab, and then select Options. Click Add-Ins, and then in the Manage box, select Excel Add-ins. Click Go. In the Add-Ins available box, select the Analysis ToolPak check box, and then click OK. If Analysis ToolPak is not listed in the Add-Ins available box, click browse to locate it. If you get prompted that the Analysis ToolPak is not currently installed on your computer, click yes to install it. After you load the Analysis ToolPak, the Data Analysis command is available in the Analysis group on the Data tab. Note, to include Visual Basic for application VBA functions for the analysis tool pack, you load the Analysis ToolPak hyphen VBA add-in, the same way that you load the Analysis ToolPak. In the Add-Ins available box, select the Analysis ToolPak hyphen VBA check box, and then click OK. Most if not all datasets contain errors or empty fields. These can be due to a wide variety of reasons, such as keyboard errors, or fields left blank because the patient could not provide the information. Before you start doing any analysis you must examine your data, identify any problems, and determine how to address them. The first task is to identify the errors. You can run descriptive statistics, check counts of categorical data, and create pivot tables to examine your data. If or rather when you find errors or blank entries, you have to decide what to do. Can you determine the correct value from another source? For example, if the patients height is missing but you have the medical record number, can you look it up on another system? You may also have to infer or impute values. Perhaps the gender field on your patient is blank, and you cannot determine the gender from the name. But you see that the patient had a diagnosis of prostate cancer. You retrieve a pathology report and see that this is your patient, and can therefore safely change the blank gender field to male. If you cannot correct the record then you may have to make the decision to delete the record from your dataset. Good data governance principles are beyond the scope of this unit. But in general, always work with a copy of your original set of data. And keep a record or a log of every change you make on your data. Now let's take a look at descriptive statistics, which give a basic overview of the data, including the mean or average, the median or the midpoint, and the mode, the most frequently occurring value. The smallest and largest values and the total number of records. Running descriptive statistics is an excellent way to quickly check if the values are reasonable. For example, this set of data of 500 patient weights shows a mean weight of 189 pounds, with a minimum of 89 and a maximum of 565. To generate descriptive statistics in Excel, click data, then Data Analysis, then descriptive statistics. You will probably have data that is not numeric, and so counts of how many times certain values occur can be very helpful in not only understanding the data but also to identify errors in the data. Take a look at this list of 10 patient genders. If we use the COUNTIF function on the range of cells from B1 to B10 and specifically that we want to know how many M's there are we will get 5, for entries of F for female we will get 3, and for U or unknown we will get 1. Now, we have a problem. 5 plus 3 plus 1 only adds to 9, and we have 10 records. Obviously, in this example, we can see that there is an entry with a D in it. Which may have been a data entry error. But if we were working with thousands or millions of records, we would not be able to scroll through a list and look for the incorrect records. We need additional tools. So, let's look at another Excel tool called Filter. Excel's filter function let's you display only records that meet certain criteria. To filter records, click a cell in the column to be filtered. In this case we are using the gender column. And then click the Filter icon on the Data tab. Nothing will appear to happen other than a drop down arrow will now appear at the right hand of the column heading. Indicated here with the red arrow. Click the drop down arrow and we'll look what happens on the next slide. If you click the drop down arrow on the column header, you should now see a dialog box with several options. First, at the bottom of the dialog box is the list of all the values that were present in the gender column. The M, F, U and the D that isn't in correct value. To see only the the records with a D in the gender column, uncheck the F, M and U, and leave only the D checked. You can also specify additional filtering criteria such as whether the entry in a cell equals a value, does not equal a value, begins or end with a string of characters, and so on. Let's take a look at the weights for these 10 patients. We could graph them, but this isn't particularly helpful. Individual weights on a graph plotted against patient number doesn't give us insight into, we could graph them but this isn't particularly helpful. Individual weights on a graph plotted against patient number doesn't give us any insight into categories of weights, such as one category for 200 to 249 pounds, another category for 250 to 299 pounds, and so on. That's a job for our frequency histogram, which looks like a column graph but shows distributions. Let's look at frequencies and histograms now. Frequencies answer the question, how many of something is there? Recall that nominal data are things like names, labels, or categories of data. Examples of nominal data are patient names, genders, such as male, female, and unknown, and marital status, married, single, divorced, widowed. Frequencies are primarily shown in a table format and also in a histogram, which looks like a column chart so don't get confused. A histogram is a graph of the number of times values occur in a set of data, in other words, the frequency of the values. It looks like a column chart in that it is a visualization of the distribution of values. Let's look at an example now. On the left is a frequency table, and on the right is a histogram of the same data. The categories or bins from the frequency table are the numbers along the horizontal x-axis of the graph, and the frequencies or counts are the numbers along the vertical or y-axis of the graph. Let's step through the process of creating that table in histogram, we want to know how many patients fall into each category, and then we want to see this as a graph. We'll take our 500 patient weights and set up categories in roughly 50 pound divisions. Add a column to your Excel spreadsheet with the bins that you want to use to categorize the patient weights. Note that in this example only a few of the 500 patient weights are shown. To run frequencies in Microsoft Excel, click the Data tab, then click Data Analysis, and then choose histogram. In the input range, enter the range of cells that contain the weights, in this case, the data in H2, and continue through H501. In the Bin Range field, enter the cells that contain the category bins that you created. In this example, the bins are in cells L2 through L12. Finally, determine what type of output you want. Let's click Chart Output for this example. Excel will produce a new worksheet in your file with a table on the histogram. Sorting the histogram organizes the categories or bins by how often they occur. The most frequent bins will be to the left of the graph. Pivot tables are an Excel tool that you, Pivot tables are an Excel tool that let you summarize, analyze, and create different views of your data, you can arrange how the data is displayed. Pivot tables are very useful for identifying trends or relationships among data in large datasets. Here is an example of a raw set of data on the left and a pivot table of some of the data on the right. Along with a column chart of the data in the pivot table. You can practice creating this pivot table in the laboratory activity associated with this unit. Now think back to the section in this component on types of data. Recall that we discussed nominal or null interval and ratio data. Nominal data included things that were names, or categories, such as gender, race, religion, smokes, yes or no, and so on. Another name for nominal data is categorical data. And this term is typically used in statistics. So think about the example of gender. Let's say that we have three categories, male, female, or unknown. You can do simple analysis, such as counting how many of each type of gender in the hospital, but it doesn't make sense to do many kinds of calculations on categorical data. For example, it doesn't make sense to try to calculate the average gender of the patients in the hospital. However, you may want to know if there is a relationship between two categorical variables. For example, you may want to know if there is a relationship between patient gender, we'll simplify it to just male and female for this example, and smoking status, yes, no. The chi-square test is used to determine whether there is a relationship between two categorical or nominal variables. In other words, it checks to see if the two variables are independent of each other. You'll see this frequently reported in literature. This concludes component 24, healthcare and data analytics unit two, working with data. In summary, before you start doing any analysis, you must examine your data, identify any problems, and determine how to address them. We discussed descriptive statistics, pivot tables, histograms, frequency tables, and the chi-square test. Pivot tables describe the relationship between a database in an HIT system and data analysis tools. You can practice some of these data tools in the laboratory activities associated with this unit.