library(palmerpenguins)
library(ggplot2)
<- penguins penguins
Data Visualization
ggplot2
Data and Aesthetic Layers
Geometries Layer
Facets and Statistics Layer
Coordinates Layer
Theme Layer
Common Types of Plots
Data visualization is an essential skill for anyone working with data and requires a combination of design principles with statistical understanding. In general there are two purposes for needing to graphically visualize data:
In this class, we will learn about the fundamentals of data visualization using the ggplot2
package. This is by far the most popular package for data visualization in R.
You have already seen and used ggplot2
in previous classes, but now we will cover how to actually use this package.
The elements for creating a ggplot was largely inspired from the work of Leland Wilkinson (Grammar of Graphics, 1999), who formalized two main principles in plotting data:
In this framework, the essential grammatical elements required to create any data visualization are:
We will use a data set from the palmerpenguins package
Go ahead and load the palmerpenguins
and ggplot2
packages using library()
.
Additionally, let’s make the penguins
data set that is loaded with palmerpenguins
visible in the environment by explicitly assigning it to an object.
The Data Layer specifies the data object that is being plotted.
It is the first grammatical element that is required to create a plot:
The next grammatical element is the aesthetic layer, or aes for short. This layer specifies how we want to map our data onto the scales of the plot.
The aesthetic layer maps variables in our data onto scales in our graphical visualization, such as the x and y coordinates. In ggplot2
the aesthetic layer is specified using the aes()
function.
You can see we went from a blank box to a graph with the variable and scales of bill_length_mm
mapped onto the x-axis and flipper_length_mm
on the y-axis.
The aesthetic layer also maps variables in our data to other elements in our graphical visualization, such as color, size, fill, etc.
These other elements are useful for adding a third variable onto our graphical visualizations. For instance, we can add the variable of species
by mapping species
onto the color aesthetic.
The next essential grammatical element for graphical visualization is the geometries layer or geom for short. This layer specifies the visual elements that should be used to plot the actual data.
There are a lot of different types of geoms to use in ggplot2. Some of the most commonly used ones are:
Points or jittered points: geom_point()
or geom_jitter()
Lines: geom_line()
Bars: geom_bar()
Violin: geom_violin()
Error bars: geom_errobar()
or geom_ribbon()
For a full list see the ggplot2 documentation
For now, let’s demonstrate this using geom_point()
. We will create what is called a scatterplot - plotting the individual data points for two continuous variables.
You can also specify the color = species
aesthetic mapping on the geometric layer
Note
Note that in ggplot2
there is a special notation that is similar to the pipe operator |>
seen in previous classes. Except in ggplot2
you have to use a plus sign +
.
Besides mapping variables in your data to certain aesthetics, you can change the aesthetic properties of geometrical elements. Common aesthetic properties include:
Color: applies to most geoms geom_(color = )
Fill: applies to most geoms geom_(fill = )
Transparency: applies to most geoms. values can range from 0 to 1, 0 = transparent; 1 = opaque; geom_(alpha = )
Shape: geom_point(shape = )
, geom_jitter(shape = )
Size: geom_point(size = )
, geom_jitter(size = )
Line Type: geom_line(linetype = )
Line Width: geom_line(linewidth = )
Width: geom_errobar(width = )
, geom_jitter(width = )
Besides the data, aesthetics, and geometries layers, there are often other types of elements you may want to include.
The facets layer allows you to create panels of subplots within the same graphic object
The previous three layers are the essential layers. The facet layer is not essential, but it can be useful when you want to communicate the relationship among 4 or more variables.
Let’s create a facet layer of our scatterplot with different panels for sex
See the ggplot2 documentation on facet_grid and facet_wrap
The statistics layer allows you plot aggregated statistical values calculated from your data
The statistics layer is used in combination with a geom to plot values that are a function (e.g., mean) of the values in your data. The two main stat functions are:
geom_(stat = "summary")
stat_smooth()
Statistics are often evaluated at the aggregate level (e.g., think mean differences between groups). We can calculate summary statistics inside inside of the geom functions using stat = "summary"
. There are two main arguments you need to specify:
stat: set this to “summary” to calculate a summary statistic
fun: The function used to calculate an aggregated summary statistic. Functions like mean, sum, min, max, sd can all be specified. You can then specify additional argument that should be passed into these functions as regular arguments in geom_()
To plot the average (mean) body_mass_g
for each species
.
Using multiple geoms you can plot both the raw values for each individual penguin and the summary statistic
The fun =
argument returns only a single summary statistic value (e.g., a mean). However, some geoms actually require two values. For instance, when plotting errorbars you will need both ymin and ymax values returned. For these types of cases, you need to use the fun.data argument instead:
mean_cl_normal
is a function to calculate 95% confidence limits from your data.
stat_smooth(method = "lm")
is used in scatterplots to plot the regression line on your data.
You can add separate regression lines if other variables are mapped to aesthetics and/or are wrapped in different facets
The coordinate layer allows you to adjust the x and y coordinates
There are two main groups of functions that are useful for adjusting the x and y coordinates.
coord_cartesian()
for adjusting the axis limits (zoom in and out)
scale_x_
and scale_y_
for setting the axis ticks and labels
You can adjust limits (min and max) of the x and y axes using coord_cartesian(xlim = c(), ylim = c())
Important
If you want to compare two separate graphs, then they need to be on the same scale. This an important design principle in graphical visualization.
You can adjust limits (min and max) of the x and y axes using coord_cartesian(xlim = c(), ylim = c())
male <- filter(penguins, sex == "male")
female <- filter(penguins, sex == "female")
p1 <- ggplot(male, aes(species, body_mass_g)) +
stat_summary(fun = mean, geom = "point") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
coord_cartesian(ylim = c(3000, 6000)) +
labs(title = "male")
p2 <- ggplot(female, aes(species, body_mass_g)) +
stat_summary(fun = mean, geom = "point") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
coord_cartesian(ylim = c(3000, 6000)) +
labs(title = "female")
You can adjust the scale (major and minor ticks) of the x and y axes using the scale_x_
and scale_y_
set of functions. The two main set of functions to know are for continuous and discrete scales:
continuous: scale_x_continuous(breaks = seq())
and scale_y_continuous(breaks = seq())
discrete: scale_x_discrete(breaks = seq())
and scale_y_continuous(breaks = seq())
The theme layer refers to visual elements that are not mapped to the data but controls the overall design, colors, and labels on the plot
There are three main set of functions that we can use to control the theme layer:
Color: scale_color_
set of functions will change the color scheme of the geometric elements:
Labels: labs()
is a convenient function for labeling the title, subtitle, axes, and legend
Theme templates: There are predefined theme templates that come with ggplot2
Other theme elements: theme()
can be used to further customize the look of your plot
The RColorBrewer
package offers several color palettes for R:
Also, check out the ggsci
color palettes inspired by scientific journals, science fiction movies, and TV shows.
You can access these palettes using scale_color_brewer(palette = "palette name")
ggplot(penguins, aes(species, body_mass_g, color = sex)) +
stat_summary(fun = mean, geom = "point") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
coord_cartesian(ylim = c(3000, 6000)) +
scale_y_continuous(breaks = seq(3000, 6000, by = 500)) +
scale_color_brewer(palette = "Set1")
Changing labels and adding titles is easy using labs()
ggplot(penguins, aes(species, body_mass_g)) +
stat_summary(fun = mean, geom = "point") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
coord_cartesian(ylim = c(3000, 6000)) +
scale_y_continuous(breaks = seq(3000, 6000, by = 500)) +
labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
x = "Species", y = "Body Mass (g)")
To change labels for legends you need to refer to the aesthetic mapping that was defined in aes()
(e.g., color, shape).
adelie <- filter(penguins, species != "Gentoo")
ggplot(adelie, aes(species, body_mass_g, color = sex, shape = island)) +
stat_summary(fun = mean, geom = "point") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
coord_cartesian(ylim = c(3000, 4250)) +
scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
scale_color_brewer(palette = "Set1")
Here are some themes that come loaded with ggplot2
theme_bw()
theme_light()
theme_dark()
theme_minimal()
theme_classic()
theme_void()
Using a theme template is straightforward
adelie <- filter(penguins, species != "Gentoo")
ggplot(adelie, aes(species, body_mass_g, color = sex, shape = island)) +
stat_summary(fun = mean, geom = "point") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
coord_cartesian(ylim = c(3000, 4250)) +
scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
scale_color_brewer(palette = "Set1") +
theme_classic()
In addition to using a pre-defined theme template, you may also want to tweak other design elements on your plot. You can do this using theme()
adelie <- filter(penguins, species != "Gentoo")
ggplot(adelie, aes(species, body_mass_g, color = sex, shape = island)) +
stat_summary(fun = mean, geom = "point") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
coord_cartesian(ylim = c(3000, 4250)) +
scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
scale_color_brewer(palette = "Set1") +
theme_classic() +
theme(legend.title = element_text(face = "bold"))
Here is a list of different elements you can change. They are organized into text, line, and rectangle elements:
Text, line, and rectangle elements each have their corresponding element function e.g., element_text()
Obviously, there are a lot of different theme elements you can tweak and it is hard to memorize them all. Make use of Google, ggplot2 documentation, and Generative AI’s for assistance.
Often times you may want to apply the same customized theme elements to multiple plots and even across multiple projects.
One convenient way of doing so is to use theme_set()
theme_set()
will automatically apply the same theme settings across all ggplots created in a document.
For instance, if you want to make sure all your ggplots have a bolded legend title and use theme_classic()
you can create a theme to do that:
Then you need to set the theme that will be applied across all ggplots
Now any ggplots you create will be given this theme setting without you having to include it in the actual ggplot.
ggplot(adelie, aes(species, body_mass_g,
color = sex, shape = island)) +
stat_summary(fun = mean, geom = "point") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
coord_cartesian(ylim = c(3000, 4250)) +
scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
scale_color_brewer(palette = "Set1")
theme_spacious <- function(font_size = 14, bold = TRUE) {
key_size <- trunc(font_size * .8)
if (bold == TRUE) {
face.type <- "bold"
} else {
face.type <- "plain"
}
theme(text = element_text(size = font_size),
axis.title.x = element_text(margin = margin(t = 15, r = 0,
b = 0, l = 0),
face = face.type),
axis.title.y = element_text(margin = margin(t = 0, r = 15,
b = 0, l = 0),
face = face.type),
legend.title = element_text(face = face.type),
legend.spacing = unit(20, "pt"),
legend.text = element_text(size = key_size),
plot.title = element_text(face = face.type, hjust = .5,
margin = margin(b = 10)),
plot.subtitle = element_text(hjust = .5),
plot.caption = element_text(hjust = 0, size = key_size,
margin = margin(t = 20)),
strip.background = element_rect(fill = "white", color = "white"),
strip.text = element_text(color = "black",
face = face.type))
}
output_theme <- theme_linedraw() +
theme_spacious(font_size = 12, bold = TRUE) +
theme(panel.border = element_rect(color = "gray"),
axis.line.x = element_line(color = "gray"),
axis.line.y = element_line(color = "gray"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank())
theme_set(output_theme)
ggplot(adelie, aes(species, body_mass_g,
color = sex, shape = island)) +
stat_summary(fun = mean, geom = "point") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
coord_cartesian(ylim = c(3000, 4250)) +
scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
scale_color_brewer(palette = "Set1")
When you have a categorical (nominal or ordinal) variable on the x-axis and you want to plot that against a continuous variable on the y-axis this is usually done in the form of a bar, point, or line plot.
Bar plots
Point plots
Line plots
When you have a continuous (interval or ratio) variable on the x-axis and you want to plot it against a continuous variable on the y-axis this is known as a scatterplot
ggplot(penguins, aes(bill_length_mm, body_mass_g)) +
geom_point() +
stat_smooth(method = "lm", color = "forestgreen")