Data Visualization

Class 8

The main objective of this class is to get familiar with the basics of creating data visualizations in R.

🖥️ Slides

Prepare

Before starting this class:

📦 Install patchwork, RColorBrewer, and ggsci


Data visualization is an essential skill for anyone working with data and requires a combination of design principles with statistical understanding. In general there are two purposes for needing to graphically visualize data:

  1. Data exploration: It is difficult to fully understand your data just by looking at numbers on a screen arranged in rows and columns. Being skilled in the graphical visualization of data will help you better understand patterns and relationships that exist in your data.
  2. Explain and Communicate: Data visualization is the most effective way of explaining and communicating your statistical findings to colleagues, in scientific presentations and publications, and especially to a broader non-academic audience.

Setup Quarto Document

To follow along, you should create an empty Quarto document for this class.

YAML

Setup the YAML section:

---
title: "Document Title"
author: Your Name
date: today
theme: default
format:
  html:
    code-fold: true
    code-tools: true
    code-link: true
    toc: true
    toc-depth: 1
    toc-location: left
    page-layout: full
    df-print: paged
execute:
  error: true
  warning: true
self-contained: true
editor_options: 
  chunk_output_type: console
---

Headers

Create a level 1 header for a Setup section to load packages and import data

# Setup

Then as you follow along you can create more level 1 and level 2 headers as you see fit.

R Code Chunks

Add an R code chunk below the level 1 header

In the toolbar: Insert -> Executable Cell -> R

or

Mac: ⌘⌥ I (Cmd + Option + I)

Windows: ⌃⎇ I (Ctrl + Alt + I)

Load the palmerpenguins and ggplot2 packages and create a data frame in your environment penguins . penguins is loaded with palmerpenguins were are just explicitly making it visible in our environment.

library(palmerpenguins)
library(ggplot2)

penguins <- penguins

ggplot2

In this class, we will learn about the fundamentals of data visualization using the ggplot2 package. This is by far the most popular package for data visualization in R.

You have already seen and used ggplot2 in previous classes, but now we will cover how to actually use this package. The elements for creating a ggplot was largely inspired from the work of Leland Wilkinson (Grammar of Graphics, 1999), who formalized two main principles in plotting data:

  1. Layering: The idea of layering involves building plots by adding different layers of grammatical elements. Each layer can consist of components such as points, lines, bars, etc., and can be combined to create complex plots.
  2. Mapping: This principle involves mapping variables in your data to aesthetic properties of the graphical objects, such as size, shape, color, position, and the scales on the x and y axes.

In this framework, the essential grammatical elements required to create any data visualization are:

Let’s take a look at how these elements work to create a simple visualization of data. In Class 1 I introduced you to the fun palmerpenguins data set. I will use this data set to illustrate how ggplot2 works.

Data layer

The Data Layer specifies the data object that is being plotted.

It is the first grammatical element that is required to create a plot:

Note

Create a new level 1 header and below it insert an R code chunk

ggplot(data = penguins)

You can see that we only have a blank square. This is because we have not added any other layers yet, we have only specified the data layer. ggplot() doesn’t yet know how to map the variables onto the axis scales. That is where the aesthetic mapping layer comes in.

Aesthetics Layer

The next grammatical element is the aesthetic layer, or aes for short. This layer specifies how we want to map our data onto the scales of the plot.

The aesthetic layer maps variables in our data onto scales in our graphical visualization, such as the x and y coordinates. In ggplot2 the aesthetic layer is specified using the aes() function. Let’s create a plot of the relationship between bill_length_mm and flipper_length_mm, putting them on the x and y axis respectively.

ggplot(penguins, 
       mapping = aes(x = bill_length_mm, y = flipper_length_mm))

You can see we went from a blank box to a graph with the variable and scales of bill_length_mm mapped onto the x-axis and flipper_length_mm on the y-axis.

The aesthetic layer also maps variables in our data to other elements in our graphical visualization, such as color, size, fill, etc. These other elements are useful for adding a third variable onto our graphical visualizations. For instance, we can add the variable of species by mapping species onto the color aesthetic.

ggplot(penguins, 
       mapping = aes(bill_length_mm, flipper_length_mm, color = species))

You will notice that the plot has not changed. Species is not plotted by color. This is because ggplot() does not know the geometrical form the data should take - a bar plot, line plot, dot plot, etc.? It cannot add color to a geometrical form that is not specified yet. That is where the geometries layer comes in.

Geometries Layer

The next essential grammatical element for graphical visualization is the geometries layer or geom for short. This layer specifies the visual elements that should be used to plot the actual data.

There are a lot of different types of geoms to use in ggplot2. Some of the most commonly used ones are:

  • Points or jittered points: geom_point() or geom_jitter()

  • Lines: geom_line()

  • Bars: geom_bar()

  • Violin: geom_violin()

  • Error bars: geom_errobar() or geom_ribbon()

For a full list see the ggplot2 documentation


For now, let’s demonstrate this using geom_point(). We will create what is called a scatterplot - plotting the individual data points for two continuous variables.

To create a scatterplot we can simply add geom_point() to our ggplot.

Note

Note that in ggplot2 there is a special notation that is similar to the pipe operator |> seen before. Except in ggplot2 you have to use a plus sign + .

ggplot(penguins, aes(bill_length_mm, flipper_length_mm, color = species)) +
  geom_point()

Note you can also specify the color = species aesthetic mapping on the geometric layer instead of the data layer:

ggplot(penguins, aes(bill_length_mm, flipper_length_mm)) +
  geom_point(mapping = aes(color = species))

Besides mapping variables in your data to certain aesthetics, you can change the aesthetic properties of geometrical elements. Common aesthetic properties include:

Facets Layer

The facets layer allows you to create panels of subplots within the same graphic object

The previous three layers are the essential layers. The facet layer is not essential, but it can be useful when you want to communicate the relationship among 4 or more variables.

Let’s create a facet layer of our scatterplot with different panels for sex

# first let's remove any missing values for sex
library(dplyr)
penguins <- filter(penguins, !is.na(sex))

ggplot(penguins, 
       aes(bill_length_mm, flipper_length_mm, color = species)) +
  geom_point() +
  facet_grid(cols = vars(sex))

You can see we are now conveying information about 4 different variables in our data; bill_length_mm, flipper_length_mm, species, and sex .

See the ggplot2 documentation on facet_grid and facet_wrap

Statistics Layer

The statistics layer allows you plot aggregated statistical values calculated from your data

The statistics layer is used in combination with a geom to plot values that are a function (e.g., mean) of the values in your data. The two main stat functions are:

  • geom_(stat = "summary")

  • stat_smooth()

geom_(stat = “summary”)

In the previous plots, we plotted the raw values for each individual penguin. However, statistics are often evaluated at the aggregate level (e.g., think mean differences between groups). We can calculate summary statistics inside inside of the geom functions using stat = "summary". There are two main arguments you need to specify:

  • stat: set this to “summary” to calculate a summary statistic

  • fun: The function used to calculate an aggregated summary statistic. Functions like mean, sum, min, max, sd can all be specified. You can then specify additional argument that should be passed into these functions as regular arguments in geom_()

In the penguins data set, we have the raw values of body_mass_g for each individual penguin but perhaps we are just interested in the average (mean) values for each species. This can be done easily using geom_point(stat = "summary", fun = mean)

ggplot(penguins, aes(species, body_mass_g)) +
  geom_point(stat = "summary", fun = mean, na.rm = TRUE,
             shape = "diamond", size = 5, color = "firebrick")

Using multiple geoms you can plot both the raw values for each individual penguin and the summary statistic

ggplot(penguins, aes(species, body_mass_g)) +
  geom_jitter(width = .1, size = .75, alpha = .2) +
  geom_point(stat = "summary", fun = mean, na.rm = TRUE,
             shape = "diamond", size = 5, color = "firebrick")

The fun = argument returns only a single summary statistic value (e.g., a mean). However, some geoms actually require two values. For instance, when plotting errorbars you will need both ymin and ymax values returned. For these types of cases, you need to use the fun.data argument instead:

mean_cl_normal is a function to calculate 95% confidence limits from your data.

ggplot(penguins, aes(species, body_mass_g)) +
  geom_jitter(width = .1, size = .75, alpha = .2) +
  geom_point(stat = "summary", fun = mean, na.rm = TRUE,
             shape = "diamond", size = 5, color = "firebrick") +
  geom_errorbar(stat = "summary", fun.data = mean_cl_normal, width = .1)

stat_smooth(method = “lm”)

stat_smooth(method = "lm") is used in scatterplots to plot the regression line on your data.

ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm)) +
  geom_point() +
  stat_smooth(method = "lm")

You can add separate regression lines if other variables are mapped to aesthetics and/or are wrapped in different facets

ggplot(penguins, 
       aes(bill_length_mm, flipper_length_mm, color = species)) +
  geom_point() +
  facet_grid(cols = vars(sex)) +
  stat_smooth(method = "lm")

Coordinates Layer

The coordinate layer allows you to adjust the x and y coordinates

There are two main groups of functions that are useful for adjusting the x and y coordinates.

  • coord_cartesian() for adjusting the axis limits (zoom in and out)

  • scale_x_ and scale_y_ for setting the axis ticks and labels

axis limits

You can adjust limits (min and max) of the x and y axes using the coord_cartesian(xlim = c(), ylim = c()) function.

Note

If you want to compare two separate graphs, then they need to be on the same scale. This an important design principle in graphical visualization.

Compare these two sets of plots

male <- filter(penguins, sex == "male")
female <- filter(penguins, sex == "female")

p1 <- ggplot(male, aes(species, body_mass_g)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(2000, 10000))

p2 <- ggplot(female, aes(species, body_mass_g)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) + 
  coord_cartesian(ylim = c(3000, 5000))
# patchwork can be used to combine multiple plots into one image
library(patchwork)

p1 + labs(title = "male") + p2 + labs(title = "female")

A cursory look at this plot, you might conclude a couple things

  • Female Gentoo penguins have the largest body mass

  • There is a larger difference in body mass, relative to the other penguin species, for the Female Gentoo penguins than for Male Gentoo penguins.

These are both false! Take a closer look at the y-axis on the two plots. Let’s plot the exact same data but make the scales on the y-axis the same.

p1 <- ggplot(male, aes(species, body_mass_g)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 6000)) +
  labs(title = "male")

p2 <- ggplot(female, aes(species, body_mass_g)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) + 
  coord_cartesian(ylim = c(3000, 6000)) +
  labs(title = "female")
# patchwork can be used to combine multiple plots into one image
library(patchwork)

p1 + labs(title = "male") + p2 + labs(title = "female")

Note

patchwork is a convenient package to combine multiple plots into one image. The package can be used to create a more complex arrangement of multiple plots but the simplest use of it is to add plots side-by-side by simply using the + notation that is already used to add additional layers to a ggplot()

plot1 + plot2

axis ticks and labels

You can adjust the scale (major and minor ticks) of the x and y axes using the scale_x_ and scale_y_ set of functions. The two main set of functions to know are for continuous and discrete scales:

  • continuous: scale_x_continuous(breaks = seq()) and scale_y_continuous(breaks = seq())

  • discrete: scale_x_discrete(breaks = seq()) and scale_y_continuous(breaks = seq())

For example:

p1 <- ggplot(male, aes(species, body_mass_g)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 6000)) +
  scale_y_continuous(breaks = seq(3000, 6000, by = 500))

p2 <- ggplot(female, aes(species, body_mass_g)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) + 
  coord_cartesian(ylim = c(3000, 6000)) +
  scale_y_continuous(breaks = seq(3000, 6000, by = 500))
p1 + labs(title = "male") + p2 + labs(title = "female")

Theme Layer

The theme layer refers to visual elements that are not mapped to the data but controls the overall design, colors, and labels on the plot

There are three main set of functions that we can use to control the theme layer:

  • Color: scale_color_ set of functions will change the color scheme of the geometric elements:

  • Labels: labs() is a convenient function for labeling the title, subtitle, axes, and legend

  • Theme templates: There are predefined theme templates that come with ggplot2

  • Other theme elements: theme() can be used to further customize the look of your plot

    • theme(legend.title = element_text(face = "bold"))

Color

Changing the color theme can get complicated but is an important design element in your plot.

The RColorBrewer package offers several color palettes for R:

You can access these palettes using scale_color_brewer(palette = "palette name")

ggplot(penguins, aes(species, body_mass_g, color = sex)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 6000)) +
  scale_y_continuous(breaks = seq(3000, 6000, by = 500)) +
  scale_color_brewer(palette = "Set1")

Note

Check out the ggsci color palettes inspired by scientific journals, science fiction movies, and TV shows.

Labels

Changing labels and adding titles is easy using labs()

ggplot(penguins, aes(species, body_mass_g)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 6000)) +
  scale_y_continuous(breaks = seq(3000, 6000, by = 500)) +
  labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
       x = "Species", y = "Body Mass (g)")

To change labels for legends you need to refer to the aesthetic mapping that was defined in aes() (e.g., color, shape).

adelie <- filter(penguins, species != "Gentoo")

ggplot(adelie, aes(species, body_mass_g, 
                     color = sex, shape = island)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 4250)) +
  scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
  labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
       x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
  scale_color_brewer(palette = "Set1")

Theme templates

Here are some themes that come loaded with ggplot2

  • theme_bw()
  • theme_light()
  • theme_dark()
  • theme_minimal()
  • theme_classic()
  • theme_void()

Using a theme template is straightforward

adelie <- filter(penguins, species != "Gentoo")

ggplot(adelie, aes(species, body_mass_g, 
                     color = sex, shape = island)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 4250)) +
  scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
  labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
       x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
  scale_color_brewer(palette = "Set1") +
  theme_classic()

Other theme elements

In addition to using a pre-defined theme template, you may also want to tweak other design elements on your plot. You will mostly due this using theme()

For instance, to give the legend titles a bold face font:

adelie <- filter(penguins, species != "Gentoo")

ggplot(adelie, aes(species, body_mass_g, 
                     color = sex, shape = island)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 4250)) +
  scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
  labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
       x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
  scale_color_brewer(palette = "Set1") +
  theme_classic() +
  theme(legend.title = element_text(face = "bold"))

Here is a list of different elements you can change. They are organized into text, line, and rectangle elements:

Text, line, and rectangle elements each have their corresponding element function e.g., element_text()

Obviously, there are a lot of different theme elements you can tweak and it is hard to memorize them all. Make use of Google, ggplot2 documentation, and Generative AI’s for assistance.

Here is the ggplot2 documentation on theme elements

Create your own theme template

Often times you may want to apply the same customized theme elements to multiple plots and even across multiple projects.

One convenient way of doing so is to use theme_set()

theme_set() will automatically apply the same theme settings across all ggplots created in a document.

For instance, if you want to make sure all your ggplots have a bolded legend title and use theme_classic() you can create a theme to do that:

bold_legend <- theme(legend.title = element_text(face = "bold"))

plot_theme <- theme_classic() + bold_legend

Then you need to set the theme that will be applied across all ggplots

theme_set(plot_theme)

Now any ggplots you create will be given this theme setting without you having to include it in the actual ggplot.

ggplot(adelie, aes(species, body_mass_g, 
                     color = sex, shape = island)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 4250)) +
  scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
  labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
       x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
  scale_color_brewer(palette = "Set1")

You can also create theme functions to give you more flexibility from one document/project to another (change the font size, whether to use “bold” fonts for titles or not, etc.).

Here is a custom theme I made. One of the things it does is increases the amount of white space betwen plot elements, such as the axis labels and the axis ticks (it annoys me how close the y-axis label is to the axis tick labels by default).

Show theme_spacious()
theme_spacious <- function(font_size = 14, bold = TRUE) {
  key_size <- trunc(font_size * .8)
  if (bold == TRUE) {
    face.type <- "bold"
  } else {
    face.type <- "plain"
  }

  theme(text = element_text(size = font_size),
        axis.title.x = element_text(margin = margin(t = 15, r = 0,
                                                    b = 0, l = 0),
                                    face = face.type),
        axis.title.y = element_text(margin = margin(t = 0, r = 15,
                                                    b = 0, l = 0),
                                    face = face.type),
        legend.title = element_text(face = face.type),
        legend.spacing = unit(20, "pt"),
        legend.text = element_text(size = key_size),
        plot.title = element_text(face = face.type, hjust = .5,
                                  margin = margin(b = 10)),
        plot.subtitle = element_text(hjust = .5),
        plot.caption = element_text(hjust = 0, size = key_size,
                                    margin = margin(t = 20)),
        strip.background = element_rect(fill = "white", color = "white"),
        strip.text = element_text(color = "black",
                                  face = face.type))
}
output_theme <- theme_linedraw() + 
  theme_spacious(font_size = 12, bold = TRUE) +
  theme(panel.border = element_rect(color = "gray"),
        axis.line.x = element_line(color = "gray"),
        axis.line.y = element_line(color = "gray"),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank())

theme_set(output_theme)
ggplot(adelie, aes(species, body_mass_g, 
                     color = sex, shape = island)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 4250)) +
  scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
  labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
       x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
  scale_color_brewer(palette = "Set1")

Common Types of Plots

Histogram

ggplot(penguins, aes(body_mass_g)) +
  geom_histogram(bins = 20, fill = "white", color = "black")

Bar, point, and line

When you have a categorical (nominal or ordinal) variable on the x-axis and you want to plot that against a continuous variable on the y-axis this is usually done in the form of a bar, point, or line plot.

Bar plot

ggplot(penguins, aes(species, body_mass_g)) +
  geom_bar(stat = "summary", fun = mean)

Point plot

ggplot(penguins, aes(species, body_mass_g)) +
  geom_point(stat = "summary", fun = mean)

Point plot - with raw values

ggplot(penguins, aes(species, body_mass_g)) +
  geom_jitter(width = .1, size = .75, alpha = .2) +
  geom_point(stat = "summary", fun = mean, 
             size = 4, color = "steelblue")

Line plot

ggplot(penguins, aes(species, body_mass_g)) +
  geom_line(stat = "summary", fun = mean, group = 1)

Line plot - with raw values

ggplot(penguins, aes(species, body_mass_g)) +
  geom_jitter(width = .1, size = .75, alpha = .2) +
  geom_line(stat = "summary", fun = mean, group = 1) +
  geom_point(stat = "summary", fun = mean, 
             size = 4, color = "steelblue")

Scatterplot

When you have a continuous (interval or ratio) variable on the x-axis and you want to plot it against a continuous variable on the y-axis this is known as a scatterplot.

ggplot(penguins, aes(bill_length_mm, body_mass_g)) +
  geom_point() +
  stat_smooth(method = "lm", color = "forestgreen")