0:00

Hi, everyone, this lesson is going to be about ggplot2 or the ggplot2 package.

Â And we're going to, I'm going to talk about how to

Â do some basic plots using the ggplot2 package and what it's about.

Â And in this, in the next lecture I'll talk a little bit in more detail

Â about how it's designed and how you

Â can make extensions to various ggplot2 plotting functions.

Â So the first question very basic. You know, what is ggplot2?

Â Basically it's a package

Â in R that you can download from CRAN.

Â And and it implements what's called the

Â grammar of graphics, which is originally written by

Â Leland Wilkinson and it is described in a, in a book called the Grammar of Graphics.

Â Now the Grammar of Graphics is a description of how

Â kind of graphics can be broken down into abstract concepts.

Â You can, so think of the grammar of a language like English.

Â You have things like verbs and nouns and adjectives.

Â And so

Â the question is, you know, what are the

Â verbs, nouns, and adjectives of a data graphic?

Â And the Grammar of Graphics kind of describes kind of those basic elements

Â so that you can put them together to make new types of graphics.

Â Just like you could take a verb and a noun and an

Â adjective and make a new sentence that maybe no one's ever heard before.

Â You could take the grammar of graphics and put together various aspects

Â of plots and make a graphic that no one's ever seen before and

Â so that's the basic idea.

Â It's a very powerful concept to kind of organize all kinds of data graphics.

Â And until recently there was no specific implementation for it

Â in R, but Hadley Wickham who when he was a graduate

Â student at Iowa State implement the Grammar of Graphics as an

Â R package called ggplot and its current implementation is called ggplot2.

Â 1:40

So one could think of this as almost

Â a third graphic system in R.

Â Even though it's based, is built upon the

Â grid graphic system which is, which comes with R.

Â It's kind of a third mode of, of plotting that has become very popular.

Â So if you think of the first mode as like

Â the base plots using functions like plot, and hist, and

Â boxplot, and then the second mode as the lattice plots

Â so using XY plot and these kinds of trellis type functions.

Â And then the third mode is ggplot.

Â So you get the package from CRAN. You can, you can use install.packages.

Â It installs on all almost all sys...I imagine on all systems.

Â You can go to the ggplot website which is ggplot2.org.

Â 2:18

And so the nice thing about ggplot is that, is that, is that it is based

Â on this grammar of graphics, and so, it,

Â in a sense, there's a theory of the graphics.

Â So you can take this theory and kind of

Â reassemble the different pieces to make new types of plots.

Â And as Hadley Whitcomb says in his book, you know, the basic idea

Â is that you want to shorten the distance from the mind to the page.

Â So if you have some data that you're looking at, And you want,

Â and you thin of a way that you want to visualize that data.

Â You want to be able to rapidly take those

Â ideas and turn them into a picture on your screen.

Â 2:51

So, from the GG plot two book,

Â this sentence kind of summarizes the basic idea.

Â But the idea is that, the grammar tells us

Â the statistical graphic is mapping from data to aesthetic attributes.

Â So color, shape and size.

Â - of geometric objects, so, points, lines, and bars, and the plot may also

Â contain statistical transformations of the data and

Â is drawn on a specific coordinate system.

Â So, we have things are that, we have a

Â mapping from data to aesthetics, geometric objects, we have statistics.

Â Now we have a coordinate system.

Â 3:24

So in this lecture I just want to talk about

Â the qplot function which is kind of the most basic

Â function and it's probably the best place to start for

Â someone who is transitioning from say the base plotting system.

Â So in the base plotting system you know the work horse function

Â is the plot function and so qplot which you can think of as

Â standing for quick plot Is kind the work horse function for for

Â GD plot and its analogous to the plot function and the base system.

Â So one

Â key difference that you have to get used to when you're using

Â GD plot is that typically when you make a plot and you pass

Â data to the q plot function you want to tell it where the

Â data comes from and the data will always come from a data frame.

Â So a data frame is going to be.

Â So, your data have to be organized in a data frame.

Â And then when you plot variables those

Â variables are going to come from the data frame.

Â Now, you don't have to specify a data frame.

Â You can

Â if you don't specify a data frame the the cue plot function or

Â all the plotting functions will, will look for the data in your workspace.

Â But it's generally a good idea to specify the data frame.

Â That way when you read the code that generated

Â the plot You know exactly where the data came from.

Â 4:30

So then so the data frame is

Â very important to organize before you start plotting.

Â Once you start plotting the plots are made up of aesthetics

Â and geoms and so aesthetics are things like the size, shape,

Â and color of things.

Â Points and the geoms are sort of

Â the objects that you're pointing, plotting I'm sorry.

Â So are you plotting points Are you plotting

Â lines, are you plotting bars, you know, whatnot.

Â 4:51

One aspect that's important for the qplot function, and also is similarly

Â important when you're using lattice functions,

Â is the idea of using factor variables.

Â So factors are very important because they indicate subsets of your data.

Â So if you imagine you have a data frame or you have a y variable and a x

Â variable and then a factor variable the factor will

Â indicate subsets of your data in the data frame.

Â So for example you might have factor that indicates the gender.

Â So you have a bunch of

Â males and a bunch of females.

Â So those are subsets of your data and you

Â might want to plot a certain relationship divided by

Â the various subsets or you might want to color

Â 5:27

certain points, depending on whether they're male of female.

Â And so the categories that are indicated by various

Â factor variables can be useful for annotating a plot.

Â And so, one aspect, so one thing that's

Â important about this feature is that, is that when

Â you have factor variables in a data set,

Â you want to make sure that they're properly labeled.

Â So it's usually not useful to label a factor variable

Â as one, two, and three, even if you have three categories.

Â One, two, and three is not particularly informative.

Â Usually you want to label them with the more informative labels

Â so that you know what those factor variables are trying to encode.

Â 6:04

Now the qplot function is a fairly straight forward function

Â to use. I think it's very easy to pick up.

Â It hides a lot of the details of, of what

Â ggplot is doing underneath which is fine for many cases.

Â But the ggplot function is really kind of the core function of the system.

Â It's very flexible and you can use it in combination

Â with a lot of things that g, that qplot can't do.

Â