This tutorial was created based closely on the “Introduction” tutorial provided with the ggformula
package. If desired you can access the original tutorial by running the code below in RStudio.
unless you want to view the original tutorial! Instead, click “Next Topic” below to continue with this tutorial…
library(ggformula)
library(learnr)
learnr::run_tutorial("introduction", package = "ggformula")
The plot below shows the number of live births in the United States each day of 1978. We are going to use it to learn how to create plots using the ggformula
package.
But before we do that, take a look at the plot.
What patterns do you see in the plot?
Can you conjecture reasons for those patterns?
How might we improve the plot to test your conjectures?
To get R (or any software) to create this plot (or do anything else, really), there are two important questions you must be able to answer. Before continuing, see if you can figure out what they are.
To get R (or any software) to create this plot, there are two important questions you must be able to answer:
To make this plot, the answers to our questions are
A. Make a scatter plot (i.e., a plot consisting of points)
A. The data used for the plot:
We just need to learn how to tell R these answers.
We will provide answers to our two questions by filling in the boxes of this important template:
We just need to identify which portions of our answers go into which boxes.
It is useful to provide names for the boxes:
These names can help us remember which things go where. (The ...
indicates that there are some additional arguments we will add eventually.)
Sometimes we will add or subtract a bit from our formula. Here are some other forms we will eventually see.
# simpler version
goal( ~ x, data = mydata )
# fancier version
goal( y ~ x | z , data = mydata )
# unified version
goal( formula , data = mydata )
box | fill in with | purpose |
---|---|---|
goal
|
gf_point
|
plot some points |
y
|
births
|
y-axis variable |
x
|
date
|
x-axis variable |
mydata
|
Births1978
|
name of data set |
Put each piece in its place in the template below and then run the code to create the plot.
goal(y ~ x, data = mydata)
If you get an “object not found” or “could not find function” error message, that indicates that you have not correctly filled in one of the four boxes from the template.
Note: R is case sensitive, so watch your capitalization.
For the record, here are the first few rows of Births1978
.
The most distinctive feature of ggformula
plots is the use of formulas to describe the positional information of a plot. Formulas in R always involve the tilde character, which is easy to overlook. It looks like this:
Make sure you know where the tilde is located on your computer’s keyboard! It is often near the upper left-hand corner on American keyboards.
Most gf_
functions take a formula that describes the positional attributes of the plot. Using one of these functions with no arguments will show you the “shape” of the formula it requires.
Run this code to see the formula shape for gf_point()
.
gf_point()
You should see that gf_point()
’s formula has the shape y ~ x
, so the y-variable name goes before the tilde and the x-variable name goes after. (Think: “y depends on x”. Also note that the y-axis label appears farther left than the x-axis label.)
Reverse the roles of the variables, changing births ~ date
to date ~ births
. How does the plot change?
gf_point(births ~ date, data = Births1978)
There is a reason that key is biggest – you should use it a lot!.
R is not very picky about spaces.
My advice is to use spaces liberally. Even if R doesn’t care, it makes your code easier for people to read.
Always put at space around things like +
, *
, ~
etc. (This is a place where R doesn’t care whether you have a space or not, but I recommend you do.)
Feel free to use extra spaces to “line things up” to make them easier to read.
Mimic the examples you see in this tutorial.
So far, we have created scatter plots using the function gf_point()
. But there are many other gf_
functions that create different types of plots. There are also helper functions that can customize axis labels, make multi-panel plots, and more. The following sections will help you explore some of these gf_
functions.
Before getting started, just to get an idea of what is included in the ggformula package, run the code below to get a list of all the gf_
functions that exist (don’t worry - we won’t cover them all today!):
# list all functions starting gf_
apropos("gf_")
In addition to a traditional scatterplot, there are other ways to represent the relationship between two quantitative variables. To make a differnt kind of plot, we simply replace gf_point()
with a different goal.
Replace the second gf_point()
with one of the following and see what happens:
gf_line()
– connect the points with line segments (but don’t draw the points)gf_smooth()
– a “smoother”gf_spline()
– a different kind of smoothergf_lm()
– lm stands for “linear model”Which do you like best? Why?
gf_point(births ~ date, data = Births1978)
gf_point(births ~ date, data = Births1978)
Note: You can create multiple plots in the same code box, just put each one on a separate line. Then you can see multiple plots at once and compare them.
Want to see two (or more) plots one “on top of” the other? Just add %>%
at the end of one plot, and the next will be layered on top of it.
gf_point(births ~ date, data = Births1978) %>%
gf_line(births ~ date, data = Births1978)
Change gf_line()
to gf_smooth()
.
Try adding a third layer.
Color can be a very important elemnt of a plot.
We can use color just to make our plots more beautiful (or to match our favorite color)
We can use color to communicate more about the data.
We can adjust the colors (and several othre plot attributes) using the ...
part of our template.
The general form for things in ...
is attribute = value
.
For example,
color = "red"
or fill = "navy"
(note quotes) can be used to change the colors of things. (fill
is typically used for regions that are “filled in” and color
for dots and lines.)
alpha = 0.5
(or any number between 0 and 1) will set the opacity for the objects being plotted (0 is completely transparent and 1 is completely opaque).
Here are some examples. Adjust color and alpha options to see how things change. We’ve inserted some line breaks to make the options easier to locate in the code.
gf_point(births ~ date, data = Births1978,
color = "navy")
gf_point(births ~ date, data = Births1978,
alpha=0.5, color = "red")
gf_point(births ~ date, data = Births1978,
size = 3, shape = 18,
alpha = 0.25, color = "purple")
You can learn about the options available for a given plot layer using the “quick help” for a gf_
plotting function. You can find out more by reading the detailed help file produced with ?
.
The code below will display the “quick help”. Have a look, then change it to use ?
and get the full help file.
# "quick help" for gf_point()
gf_point()
Curious to know all the available color names? Run this code.
colors()
A guide to R colors is also available online at http://research.stowers.org/mcm/efg/R/Color/Chart/ColorChart.pdf, if you want to preview which color goes with which name.
The births data in 1978 contains two clear “waves” of dots. One conjecture is that these are weekdays and weekends. We can test this conjecture by putting different days in different colors.
In the lingo of ggformula
, we need to map color to the variable wday
. Mapping and setting attributes are different in an important way.
color = "navy"
sets the color to “navy”. All the dots will be navy.
color = ~ wday
maps color to wday
. This means that the color will depend on the values of wday
. A legend (aka, a guide) will be automatically included to show us which days are which.
Change the color argument so that it maps color to wday
.
~
).gf_point(births ~ date, data = Births1978, color = "navy")
fill
and color
work for other types of plot too. Experiment.
gf_line(births ~ date, data = Births1978, color = ~ wday)
Time for some new data.
The NHANES
dataset contains measurements from 10,000 human subjects in the National Health and Nutrition Evaluation Survey. To learn more about the data, try one or more of these:
?NHANES
names(NHANES)
glimpse(NHANES)
inspect(NHANES)
?NHANES
Because NHANES
has so many more (and more kinds of) variables, we can create even more kinds of plots with NHANES
.
A histogram can help you visualize the distribution of one quantitative variable.
Histograms are bit different from scatter plots:
Instead of using gf_point()
, we will use gf_histogram()
. (You probably guessed that.)
We only need to tell R about one variable (the x-variable). R will calculate the information for the y-variable for us!
So we use a formula with only one variable in it: ~ x
. (Notice that x
goes on the right side.)
Try the example below as-is by hitting “Run Code”.
Now experiment with plotting different quantitative variables from the NHANES
dataset. (Remember you can use ?
or names(NHANES)
or glimpse(NHANES)
to get the variable names and more information.)
You can change the size of the bins using either bins
(the number of bins) or binwidth
(the width of the bins). Experiment with different bins. (If you don’t provide bins
or binwidth
information, R will just make something up. You can usually do better if you take control.)
To get density instead of counts on the y-axis, switch from function gf_histogram()
to gf_dhistogram()
.
gf_histogram( ~ BPSysAve, data = NHANES, bins = 30)
A density plot is a smoothed contour tracing the shape of a dataset’s distribution. The gf_density()
and gf_dens()
functions produce these plots (in slightly different ways). To compare with a histogram apples-to-apples, we will use a density histogram (gf_dhistogram()
).
gf_dhistogram( ~ BPSysAve, data = NHANES)
gf_dens( ~ BPSysAve, data = NHANES)
Use %>%
to overlay the two layers.
Use size = 2
or size = 3
to make the density curve thicker.
Change gf_dens()
to gf_density()
.
Can you make the density (use gf_density()
) plot a different color and partly transparent? That will make it easier to see the histogram.
Replace BPSysAve
with a different quantitative variable.
Boxplots and violin plots provde a quick comparison of the distribution of a quantitative variable in different groups.
Run the code below to see how this works (You’re plotting systolic blood pressure for seven groups who watch different amounts of TV each day).
Change gf_boxplot()
to gf_violin()
and see what happens.
Try changing the order of x
and y
in the formula – what happens?
If you are interested, try different variables. Remember, y
must be quantitative (that is what you’ll make boxplots of) and x
must be categorical (that defines the groups).
Change gf_boxplot()
to gf_point()
. What happens? Is this a useful plot?
Now try gf_jitter()
. Can you figure out what this does? (You might like to set alpha
to a smallish number.)
gf_boxplot(BPSysAve ~ TVHrsDay, data = NHANES)
Note: If you haven’t seen these plots before, we will talk more about what they are later.
Both gf_boxplot()
and gf_violin()
require a two-sided formula, so if you want to plot just one quantitative variable you have to replace the \(x\) in the formula with empty quotes “”.
gf_boxplot( BPSysAve ~ "", data = NHANES)
Boxplots are usually most useful to compare the distribution of a variable between several groups. (If we have only one variable, we can use histograms or density plots instead.)
Bar graphs help visualize the distribution of a categorical variable, and we can create them with gf_bar()
.
gf_bar( ~ TVHrsDay, data = NHANES)
gf_bar( ~ TVHrsDay, data = NHANES)
What if you want to show a set of proportions for each of several groups? For example, the NHANES data also includes a variable Marijuana
that indicates whether the person has used marijuana. Is there a relationship?
gf_bar( ~ TVHrsDay | Marijuana, data = NHANES)
What if, instead of one figure panel per group, you want to see a stacked bar graph for the same data? Here’s an example. You use the input
fill= ~ variable_name
to specify the name of the variable that defines the groups (here, Marijuana
).gf_bar( ~ TVHrsDay, fill = ~ Marijuana, data = NHANES)
What if, instead of stacked bars, you want side-by-side bars? Simply add the additional argument
position='dodge'
.gf_bar( ~ TVHrsDay, fill = ~ Marijuana, data = NHANES, position = 'dodge')
Try to make a similar plot to one or more of the bar graph variations shown above, but using a variable other than Marijuana
to specify the groups. First, we will look at the head()
of the dataset to see what variables are there - then it’s up to you to choose one to plot, and modify the code above to make the figure.
head(NHANES)
If we want to look at all 20 years of birth data, overlaying the data is likely to put too much information in too little space and make it hard to tell which data is from which year. (Even with good color and symbol choices, deciphering 20 colors or 20 shapes is hard.) Instead, we can put each year in separate facet or sub-plot. By default the coordinate systems will be shared across the facets which can make comparisons across facets easier.
There are two ways to create facets. The simplest way is to add a vertical bar |
to our formula.
gf_point(births ~ day_of_year | year, data = Births, size = 0.5)
The second way is to add on a facet command using %>%
:
gf_point(births ~ day_of_year, data = Births, size = 0.5) %>%
gf_facet_wrap( ~ year)
Edit the plot below to:
wday
)year
gf_point(births ~ day_of_year, data = Births,
size = 0.5, color = "blue")
gf_point(births ~ day_of_year | wday, data = Births,
size = 0.5, color = ~ year)
The faceting we did on the previous page is called facet wrapping. If the facets don’t fit nicely in one row, the facets continue onto additional rows.
A facet grid uses rows, or columns, or both in a fixed way.
Notice that after the |
there is now a formula instead of a single variable. Try generating the plot below - can you figure out what the formula does? If you need a hint, try changing year ~ wday
to wday ~ year
and see what happens…
gf_point(births ~ day_of_year | year ~ wday, data = Births, size = 0.5)
Hopefully, you figured out that the facet grid formula is interpreted as “row variable ~ column variable” – the resulting plot will have one row of facets for every value of the first variable, and one column of facets for every value of the second variable.
Recreate the plot below using gf_facet_grid()
. This works much like gf_facet_wrap()
and accepts a formula with one of three shapes
y ~ x
(facets along both axes)~ x
(facets only along x-axis)y ~ .
(facets only along y-axis; notice the important dot in this one)(These three formula shapes can also be used on the right side of |
.)
gf_bar( ~ TVHrsDay | Marijuana ~ Gender, fill = ~ Marijuana, data = NHANES)
Feel free to try some other variables or other types of plots.
If you got this far, maybe it’s time to explore on your own! Here are three data sets you can use.
HELPrct
has data from a study of people addicted to alcohol, cocaine, or heroineKidsFeet
has information about some kids’ feet.NHANES
has lots of physiologic and other measurements from 10,000 subjects in the National Health and Nutrition Evaluation Survey.To find out more about the data sets use ?HELPrct
, ?KidsFeet
, or ?NHANES
. To see the first few rows of the data, you can use head()
.
To get a list of functions available in ggformula
, run this code chunk.
# list all functions starting gf_
apropos("gf_")
Make some plots to explore one or more of these data sets.
gf_point(length ~ width, data = KidsFeet)
head(KidsFeet)
?KidsFeet
Post your favorite plot here by making a new slide in this google presentation:
If you like you plots “just so”, keep reading to learn some ways to customize your plots. Then come back and tweak your plots.
You may be wondering how to have more control over things like:
As you can imagine, all of these things can be adjusted pretty much however you like. We will cover a few of the most common options here.
One of the most common customizations you will want to make to your plots will be to change the title, subtitle, and axis labels (and maybe add a caption). All these things can be done by chaining (%>%
) the function gf_labs()
with a plot layer.
Check out the example below, and try changing the text labels to ones that make sense to you. Note that all the input arguments to gf_labs
are optional. So, for example, you could alter only the x-axis label by chaining the command gf_labs(x='My X Axis Label')
with your plot.
gf_bar(~Race1, data=NHANES) %>%
gf_labs(title = "Race in NHANES Data",
subtitle = "(2009-2012)",
caption = "These data were collected by the US National Center for Health Statistics (NCHS), which has conducted a series of health and nutrition surveys since the early 1960's.",
x = "", # empty quotes here results in no x-axis label!
y = "Number of Observations"
)
One way to zoom in (or out) is to filter the dataset so that only the data you wish to plot is included.
You can also use gf_lims()
to set custom x and y axis limits.
gf_point(births ~ day_of_year, data = Births1978) %>%
gf_lims(x = c(100, 200), y = c(8000, 9000))
A number of predefined themes exist that control the appearance of non-data elements of plots. In this tutorial, we set the default theme using
theme_set(theme_bw())
Other themes include theme_minimal()
, theme_classic()
, theme_gray()
, theme_light()
, theme_map()
, and theme_quickmap()
. The ggthemes package includes some additional themes, including theme_economist()
, theme_economist_white()
, theme_excel()
, theme_fivethirtyeight()
, theme_stata()
, theme_tufte()
, and theme_wsj()
. Many of these themes mimic the look of other software packages or of popular publications.
The theme can be set for an individual plot using gf_theme()
– for example:
gf_bar(~ Race1, data = NHANES) %>%
gf_theme(theme = theme_wsj())
Choose some different themes in place of theme_wsj()
and see how the resulting plot changes.
gf_bar(~ Race1, data = NHANES) %>%
gf_theme(theme = theme_wsj())
Themes have optional arguments. One of the most useful is base_size
, which controls the base font size (in points). (base_size
is an input argument to the theme, so it goes inside the parentheses after the theme name… ) Try adjusting the font size and theme in the figure below - what theme and size do you think are optimal?
gf_bar(~ Race1, data = NHANES) %>%
gf_theme(theme = theme_wsj(base_size = 6))
Numerical summaries can be obtained using the same template we used for plots, so there is almost nothing to learn.
Look how similar these are. (I’ve inserted some spaces to make it even more obvious.) The first of these creates side by side boxplots. The second does side-by-side means.
gf_boxplot(BPSysAve ~ TVHrsDay, data = NHANES)
mean(BPSysAve ~ TVHrsDay, data = NHANES)
Try replaceing mean()
with some other functions:
median()
max()
min()
sd()
(standard deviation)iqr()
(interquartile range)If we only have one variable, we can mimic histograms or barplots.
gf_histogram( ~ BPSysAve, data = NHANES)
mean( ~ BPSysAve, data = NHANES)
tally()
is a useful function for tallying a categorical variable. (It can handle two categorical variables, too.)
gf_bar( ~ Race1, data = NHANES)
tally( ~ Race1, data = NHANES)
The df_stats()
function allows us to compute multiple summaries simultaneously – great for all you multi-taskers out there. (It also saves paper if you are printing your work.)
df_stats( ~ BPSysAve, data = NHANES, mean, median)
Add on max
, and min
to the list of summaries in the example above.
Now delete all the summaries (get rid of mean
, median
, etc.) to see what the default summaries are. (It’s a pretty useful set of summaries.)
Now create a boxplot on a separate line using the same formula and data. Do you see how where the values corresponding to the edges of the boxes show up in the default summary of df_stats()
?
The counts()
and props()
summaries compute counts and proportions for categorical variables.
df_stats( ~ Race1, data = NHANES, counts)
df_stats( SmokeNow ~ Race1, data = NHANES, counts)
Change counts
to props
to see what happens.
What happens if you switch the order of SmokeNow
and Race1
with counts
?
What happens if you switch the order of SmokeNow
and Race1
with props
?
Can we get counts and proportions at the same time? Try it.
Not quite as good as Frodo’s ring, but our template is pretty powerful.
We can use it for all sorts of numerical and graphical summaries of data. (And later in the course we will use it statistical modeling as well.)