Two other types of plots that are very useful for displaying the distribution of the data are the box plot and the violin plot. Now the box plot is a plot that is going to define the range of the data by dividing the data up into four quarters. Then those four quarters are going to be drawn on a chart along with some additional information. If we look at the slide, you can see that we've taken the sale prices from our earlier example with a fast food and you can see that the box plot has been drawn such that the quarter 1 and quarter 3 have been defined. Now, in between quarter 1 and quarter 3 is what we call the interquartile range. The interquartile range is where 50 percent of the data values live and of course, in the first-quarter, there would be 25 percent and in the fourth quarter there will be 25 percent. But here within the interquartile range, we basically are looking at quarter 2 and quarter 3 combined together. Now, from this, we can determine what the overall spread of the data might be. If we've got a very narrow interquartile range, then that tends to indicate the data is quite tightly centered around the median or a middle value. Whereas if the size of the interquartile range is quite large, then the data is more spread out. Additionally, there is a whisker that's drawn from each under the box out to the min and max line. We've got a whisker here and a whisker here and these minimum and maximum values are not the highest and lowest values in the data set, but instead, they are calculated values that help us to more easily identify any outliers. In order to calculate the minimum value, we'll take the value of quarter 1 and we're going to subtract from that 1.5 times the interquartile range. Whatever the range is within that interquartile range, we're subtracting that from quarter 1 and anything that is less than that value would be considered an outlier. Also, to the right, we have the maximum value, and here we'll take the value of quarter 3 and we'll add 1.5 times the interquartile range to that value and that will give us the maximum value and anything that is outside of that maximum value would be considered to be an outlier. By looking at these summary statistics of the data, one of the key advantages of the box plot over perhaps looking at a histogram is that the box plot makes outliers more obvious. It's very good for that type of analysis and it's very good for comparing summary statistics between our different data sets because it gives us the median of the data set, it gives us the interquartile range, quarter 1 and quarter 2, it gives us the min and max values that help us to identify the outliers as well. But what the box plot does not do is it doesn't actually show the density of the data at each of the particular values. In other words, how many values were at this point, or how many values were at this point are not actually reflected. The width of the box effectively has no meaning. This dimension has meaning, but the height in this case, I guess you would say, has no meaning. Now these box plots can be drawn horizontally or vertically, doesn't matter. But understand that they're simply demonstrating the distribution of the values and not the density of the values at each data value. They also have no way of illustrating whether this data is unimodal or whether it might be bimodal or multi-modal. That's the weakness you could say, of the box plot. Now, the violin plot helps to overcome that weakness. With the violin plot, we're not just plotting the summary statistics like we do with the box plot. But the violin plot actually also includes the probability density. The violin plot does build on the box plot and you can see on the diagram that within the violin plot, there actually is a box plot where we have the box and we have the whiskers on the box and the median value that's being illustrated and of course, the end of each of those whiskers would represent the minimum and maximum values. We still have all the same summary statistics that we would have had with a box plot, but now we're also going to be recording the probability density of where the different values probably would fall. This is done using what's called a kernel density estimation. What this does is it basically takes like a histogram that has been bend it takes that and curves the shape, smooths it out and those smooth curves then are represented as the sides of the violin plot. Now, really you could show the violin plot if you wanted, just with one side of the plot. But in order to make it read better, they decided to mirror the information on both sides of the axis of the plot. By doing this, we can actually see where there is a higher probability of the data occurring or where there's lower probability of the data occurring. Now, just because some of the sale prices look like there's a probability that they could be in the negative values that doesn't mean we're going to be paying people to take our food. In fact, there are adjustments that can be made to the library to actually cause the plot to cut off the values below zero. Remember this is not saying that there are negative values, but just that there is in fact a probability that some values could end up in a negative state. But the advantage of the violin plot over showing where the probability of values are is that if the data distribution happens to be bimodal, you'll be able to see that for an example, you might have a second lump here in the shape of this violin plot and now of course we've got one mode here and one mode here. Something that you can see on the violin plot, but something that would not be visible if you simply use the box plot. Because of this, we know that one of the big advantages of using the violin plot is you can actually see the density distribution of the data and this helps to do things like reveal whether the data has multi modality or not.