Class 8

Data Visualization

Outline

 

  • ggplot2

  • Data and Aesthetic Layers

  • Geometries Layer

  • Facets and Statistics Layer

  • Coordinates Layer

  • Theme Layer

  • Common Types of Plots

ggplot2

Data Visualization

 

Data visualization is an essential skill for anyone working with data and requires a combination of design principles with statistical understanding. In general there are two purposes for needing to graphically visualize data:

 

  1. Data exploration: It is difficult to fully understand your data just by looking at numbers on a screen arranged in rows and columns. Being skilled in the graphical visualization of data will help you better understand patterns and relationships that exist in your data.
  2. Explain and Communicate: Data visualization is the most effective way of explaining and communicating your statistical findings to colleagues, in scientific presentations and publications, and especially to a broader non-academic audience.

 

In this class, we will learn about the fundamentals of data visualization using the ggplot2 package. This is by far the most popular package for data visualization in R.

 

You have already seen and used ggplot2 in previous classes, but now we will cover how to actually use this package.

The elements for creating a ggplot was largely inspired from the work of Leland Wilkinson (Grammar of Graphics, 1999), who formalized two main principles in plotting data:

  1. Layering
  2. Mapping

In this framework, the essential grammatical elements required to create any data visualization are:

palmerpenguins

 

We will use a data set from the palmerpenguins package

 

palmerpenguins

 

  • Go ahead and load the palmerpenguins and ggplot2 packages using library() .

  • Additionally, let’s make the penguins data set that is loaded with palmerpenguins visible in the environment by explicitly assigning it to an object.

library(palmerpenguins)
library(ggplot2)

penguins <- penguins

Data and Aesthetic Layers

Data Layer

The Data Layer specifies the data object that is being plotted.

 

 

It is the first grammatical element that is required to create a plot:

ggplot(data = penguins)

Aesthetic Layer

The next grammatical element is the aesthetic layer, or aes for short. This layer specifies how we want to map our data onto the scales of the plot.

 

 

The aesthetic layer maps variables in our data onto scales in our graphical visualization, such as the x and y coordinates. In ggplot2 the aesthetic layer is specified using the aes() function.

 

ggplot(penguins, mapping = aes(x = bill_length_mm, y = flipper_length_mm))

Aesthetic Layer

 

ggplot(penguins, mapping = aes(x = bill_length_mm, y = flipper_length_mm))

You can see we went from a blank box to a graph with the variable and scales of bill_length_mm mapped onto the x-axis and flipper_length_mm on the y-axis.

Aesthetic Layer

 

  • The aesthetic layer also maps variables in our data to other elements in our graphical visualization, such as color, size, fill, etc.

  • These other elements are useful for adding a third variable onto our graphical visualizations. For instance, we can add the variable of species by mapping species onto the color aesthetic.

ggplot(penguins, 
       mapping = aes(bill_length_mm, flipper_length_mm, color = species))

Geometries Layer

Geometries Layer

The next essential grammatical element for graphical visualization is the geometries layer or geom for short. This layer specifies the visual elements that should be used to plot the actual data.

 

There are a lot of different types of geoms to use in ggplot2. Some of the most commonly used ones are:

  • Points or jittered points: geom_point() or geom_jitter()

  • Lines: geom_line()

  • Bars: geom_bar()

  • Violin: geom_violin()

  • Error bars: geom_errobar() or geom_ribbon()

For a full list see the ggplot2 documentation

geom_point()

 

For now, let’s demonstrate this using geom_point(). We will create what is called a scatterplot - plotting the individual data points for two continuous variables.

 

ggplot(penguins, aes(bill_length_mm, flipper_length_mm, color = species)) +
  geom_point()

 

You can also specify the color = species aesthetic mapping on the geometric layer

ggplot(penguins, aes(bill_length_mm, flipper_length_mm)) +
  geom_point(mapping = aes(color = species))

 

Note

Note that in ggplot2 there is a special notation that is similar to the pipe operator |> seen in previous classes. Except in ggplot2 you have to use a plus sign + .

geom_point()

 

ggplot(penguins, aes(bill_length_mm, flipper_length_mm, color = species)) +
  geom_point()

Aesthetic Properties of geoms

Besides mapping variables in your data to certain aesthetics, you can change the aesthetic properties of geometrical elements. Common aesthetic properties include:

 

  • Color: applies to most geoms geom_(color = )

  • Fill: applies to most geoms geom_(fill = )

  • Transparency: applies to most geoms. values can range from 0 to 1, 0 = transparent; 1 = opaque; geom_(alpha = )

  • See a full list of R colors here

  • Shape: geom_point(shape = ), geom_jitter(shape = )

  • Size: geom_point(size = ), geom_jitter(size = )

  • Line Type: geom_line(linetype = )

  • Line Width: geom_line(linewidth = )

  • Width: geom_errobar(width = ), geom_jitter(width = )

Aesthetic Properties of geoms

 

ggplot(penguins, 
       aes(bill_length_mm, flipper_length_mm, color = species)) +
  geom_point(shape = "diamond filled", size = 3, fill = "white")

More ggplot2 layers

 

Besides the data, aesthetics, and geometries layers, there are often other types of elements you may want to include.

Facets and Statistics Layer

Facets Layer

The facets layer allows you to create panels of subplots within the same graphic object

 

The previous three layers are the essential layers. The facet layer is not essential, but it can be useful when you want to communicate the relationship among 4 or more variables.

 

Let’s create a facet layer of our scatterplot with different panels for sex

# first let's remove any missing values for sex
library(dplyr)
penguins <- filter(penguins, !is.na(sex))

ggplot(penguins, 
       aes(bill_length_mm, flipper_length_mm, color = species)) +
  geom_point() +
  facet_grid(cols = vars(sex))

See the ggplot2 documentation on facet_grid and facet_wrap

Facets Layer

 

# first let's remove any missing values for sex
library(dplyr)
penguins <- filter(penguins, !is.na(sex))

ggplot(penguins, 
       aes(bill_length_mm, flipper_length_mm, color = species)) +
  geom_point() +
  facet_grid(cols = vars(sex))

Statistics Layer

The statistics layer allows you plot aggregated statistical values calculated from your data

 

The statistics layer is used in combination with a geom to plot values that are a function (e.g., mean) of the values in your data. The two main stat functions are:

  • geom_(stat = "summary")

  • stat_smooth()

geom_(stat = “summary”)

 

Statistics are often evaluated at the aggregate level (e.g., think mean differences between groups). We can calculate summary statistics inside inside of the geom functions using stat = "summary". There are two main arguments you need to specify:

  • stat: set this to “summary” to calculate a summary statistic

  • fun: The function used to calculate an aggregated summary statistic. Functions like mean, sum, min, max, sd can all be specified. You can then specify additional argument that should be passed into these functions as regular arguments in geom_()

To plot the average (mean) body_mass_g for each species.

ggplot(penguins, aes(species, body_mass_g)) +
  geom_point(stat = "summary", fun = mean, na.rm = TRUE,
             shape = "diamond", size = 5, color = "firebrick")

 

Using multiple geoms you can plot both the raw values for each individual penguin and the summary statistic

ggplot(penguins, aes(species, body_mass_g)) +
  geom_jitter(width = .1, size = 1, alpha = .2) +
  geom_point(stat = "summary", fun = mean, na.rm = TRUE,
             shape = "diamond", size = 5, color = "firebrick")

geom_(stat = “summary”)

 

ggplot(penguins, aes(species, body_mass_g)) +
  geom_jitter(width = .1, size = 1, alpha = .2) +
  geom_point(stat = "summary", fun = mean, na.rm = TRUE,
             shape = "diamond", size = 5, color = "firebrick")

geom_(stat = “summary”)

 

The fun = argument returns only a single summary statistic value (e.g., a mean). However, some geoms actually require two values. For instance, when plotting errorbars you will need both ymin and ymax values returned. For these types of cases, you need to use the fun.data argument instead:

 

mean_cl_normal is a function to calculate 95% confidence limits from your data.

ggplot(penguins, aes(species, body_mass_g)) +
  geom_jitter(width = .1, size = .75, alpha = .2) +
  geom_point(stat = "summary", fun = mean, na.rm = TRUE,
             shape = "diamond", size = 5, color = "firebrick") +
  geom_errorbar(stat = "summary", fun.data = mean_cl_normal, width = .1)

stat_smooth(method = “lm”)

 

stat_smooth(method = "lm") is used in scatterplots to plot the regression line on your data.

 

ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm)) +
  geom_point() +
  stat_smooth(method = "lm")

stat_smooth(method = “lm”)

 

You can add separate regression lines if other variables are mapped to aesthetics and/or are wrapped in different facets

ggplot(penguins, 
       aes(bill_length_mm, flipper_length_mm, color = species)) +
  geom_point() +
  facet_grid(cols = vars(sex)) +
  stat_smooth(method = "lm")

stat_smooth(method = “lm”)

 

ggplot(penguins, 
       aes(bill_length_mm, flipper_length_mm, color = species)) +
  geom_point() +
  facet_grid(cols = vars(sex)) +
  stat_smooth(method = "lm")

Coordinates Layer

Coordinates Layer

The coordinate layer allows you to adjust the x and y coordinates

 

 

There are two main groups of functions that are useful for adjusting the x and y coordinates.

  • coord_cartesian() for adjusting the axis limits (zoom in and out)

  • scale_x_ and scale_y_ for setting the axis ticks and labels

axis limits

You can adjust limits (min and max) of the x and y axes using coord_cartesian(xlim = c(), ylim = c())

 

Important

If you want to compare two separate graphs, then they need to be on the same scale. This an important design principle in graphical visualization.

axis limits

 

You can adjust limits (min and max) of the x and y axes using coord_cartesian(xlim = c(), ylim = c())

male <- filter(penguins, sex == "male")
female <- filter(penguins, sex == "female")

p1 <- ggplot(male, aes(species, body_mass_g)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 6000)) +
  labs(title = "male")

p2 <- ggplot(female, aes(species, body_mass_g)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) + 
  coord_cartesian(ylim = c(3000, 6000)) +
  labs(title = "female")

axis limits

 

axis limits

 

axis ticks and labels

 

You can adjust the scale (major and minor ticks) of the x and y axes using the scale_x_ and scale_y_ set of functions. The two main set of functions to know are for continuous and discrete scales:

 

  • continuous: scale_x_continuous(breaks = seq()) and scale_y_continuous(breaks = seq())

  • discrete: scale_x_discrete(breaks = seq()) and scale_y_continuous(breaks = seq())

For example:

ggplot(male, aes(species, body_mass_g)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 6000)) +
  scale_y_continuous(breaks = seq(3000, 6000, by = 500))

axis ticks and labels

 

ggplot(male, aes(species, body_mass_g)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 6000)) +
  scale_y_continuous(breaks = seq(3000, 6000, by = 500))

Theme Layer

Theme Layer

The theme layer refers to visual elements that are not mapped to the data but controls the overall design, colors, and labels on the plot

 

There are three main set of functions that we can use to control the theme layer:

  • Color: scale_color_ set of functions will change the color scheme of the geometric elements:

  • Labels: labs() is a convenient function for labeling the title, subtitle, axes, and legend

  • Theme templates: There are predefined theme templates that come with ggplot2

  • Other theme elements: theme() can be used to further customize the look of your plot

Color

 

The RColorBrewer package offers several color palettes for R:

 

Also, check out the ggsci color palettes inspired by scientific journals, science fiction movies, and TV shows.

Color

You can access these palettes using scale_color_brewer(palette = "palette name")

ggplot(penguins, aes(species, body_mass_g, color = sex)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 6000)) +
  scale_y_continuous(breaks = seq(3000, 6000, by = 500)) +
  scale_color_brewer(palette = "Set1")

Labels

Changing labels and adding titles is easy using labs()

 

ggplot(penguins, aes(species, body_mass_g)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 6000)) +
  scale_y_continuous(breaks = seq(3000, 6000, by = 500)) +
  labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
       x = "Species", y = "Body Mass (g)")

Labels

To change labels for legends you need to refer to the aesthetic mapping that was defined in aes() (e.g., color, shape).

 

adelie <- filter(penguins, species != "Gentoo")

ggplot(adelie, aes(species, body_mass_g, color = sex, shape = island)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 4250)) +
  scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
  labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
       x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
  scale_color_brewer(palette = "Set1")

Theme templates

 

Here are some themes that come loaded with ggplot2

  • theme_bw()
  • theme_light()
  • theme_dark()
  • theme_minimal()
  • theme_classic()
  • theme_void()

Theme templates

 

Using a theme template is straightforward

adelie <- filter(penguins, species != "Gentoo")

ggplot(adelie, aes(species, body_mass_g, color = sex, shape = island)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 4250)) +
  scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
  labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
       x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
  scale_color_brewer(palette = "Set1") +
  theme_classic()

Other Theme Elements

In addition to using a pre-defined theme template, you may also want to tweak other design elements on your plot. You can do this using theme()

adelie <- filter(penguins, species != "Gentoo")

ggplot(adelie, aes(species, body_mass_g, color = sex, shape = island)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 4250)) +
  scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
  labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
       x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
  scale_color_brewer(palette = "Set1") +
  theme_classic() +
  theme(legend.title = element_text(face = "bold"))

Other Theme Elements

Here is a list of different elements you can change. They are organized into text, line, and rectangle elements:

 

Other Theme Elements

Text, line, and rectangle elements each have their corresponding element function e.g., element_text()

 

Obviously, there are a lot of different theme elements you can tweak and it is hard to memorize them all. Make use of Google, ggplot2 documentation, and Generative AI’s for assistance.

Here is the ggplot2 documentation on theme elements

Create your own theme template

Often times you may want to apply the same customized theme elements to multiple plots and even across multiple projects.

 

  • One convenient way of doing so is to use theme_set()

  • theme_set() will automatically apply the same theme settings across all ggplots created in a document.

Create your own theme template

 

For instance, if you want to make sure all your ggplots have a bolded legend title and use theme_classic() you can create a theme to do that:

 

bold_legend <- theme(legend.title = element_text(face = "bold"))

plot_theme <- theme_classic() + bold_legend

 

Then you need to set the theme that will be applied across all ggplots

theme_set(plot_theme)

 

Now any ggplots you create will be given this theme setting without you having to include it in the actual ggplot.

Create your own theme template

 

ggplot(adelie, aes(species, body_mass_g, 
                     color = sex, shape = island)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 4250)) +
  scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
  labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
       x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
  scale_color_brewer(palette = "Set1")

Create your own theme template

 

Show theme_spacious()
theme_spacious <- function(font_size = 14, bold = TRUE) {
  key_size <- trunc(font_size * .8)
  if (bold == TRUE) {
    face.type <- "bold"
  } else {
    face.type <- "plain"
  }

  theme(text = element_text(size = font_size),
        axis.title.x = element_text(margin = margin(t = 15, r = 0,
                                                    b = 0, l = 0),
                                    face = face.type),
        axis.title.y = element_text(margin = margin(t = 0, r = 15,
                                                    b = 0, l = 0),
                                    face = face.type),
        legend.title = element_text(face = face.type),
        legend.spacing = unit(20, "pt"),
        legend.text = element_text(size = key_size),
        plot.title = element_text(face = face.type, hjust = .5,
                                  margin = margin(b = 10)),
        plot.subtitle = element_text(hjust = .5),
        plot.caption = element_text(hjust = 0, size = key_size,
                                    margin = margin(t = 20)),
        strip.background = element_rect(fill = "white", color = "white"),
        strip.text = element_text(color = "black",
                                  face = face.type))
}
output_theme <- theme_linedraw() + 
  theme_spacious(font_size = 12, bold = TRUE) +
  theme(panel.border = element_rect(color = "gray"),
        axis.line.x = element_line(color = "gray"),
        axis.line.y = element_line(color = "gray"),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank())

theme_set(output_theme)

Create your own theme template

 

ggplot(adelie, aes(species, body_mass_g, 
                     color = sex, shape = island)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
  coord_cartesian(ylim = c(3000, 4250)) +
  scale_y_continuous(breaks = seq(3000, 4250, by = 250)) +
  labs(title = "A Plot Title", subtitle = "A subtitle", tag = "A)",
       x = "Species", y = "Body Mass (g)", color = "Sex", shape = "Island") +
  scale_color_brewer(palette = "Set1")

Common Types of Plots

Histogram

 

ggplot(penguins, aes(body_mass_g)) +
  geom_histogram(bins = 20, fill = "white", color = "black")

Bar, point, and line

When you have a categorical (nominal or ordinal) variable on the x-axis and you want to plot that against a continuous variable on the y-axis this is usually done in the form of a bar, point, or line plot.

 

  • Bar plots

  • Point plots

  • Line plots

Bar plot

 

ggplot(penguins, aes(species, body_mass_g)) +
  geom_bar(stat = "summary", fun = mean)

Point plot

 

ggplot(penguins, aes(species, body_mass_g)) +
  geom_point(stat = "summary", fun = mean)

Point plot - with raw values

 

ggplot(penguins, aes(species, body_mass_g)) +
  geom_jitter(width = .1, size = .75, alpha = .2) +
  geom_point(stat = "summary", fun = mean, 
             size = 4, color = "steelblue")

Line plot

 

ggplot(penguins, aes(species, body_mass_g)) +
  geom_line(stat = "summary", fun = mean, group = 1)

Line plot - with raw values

 

ggplot(penguins, aes(species, body_mass_g)) +
  geom_jitter(width = .1, size = .75, alpha = .2) +
  geom_line(stat = "summary", fun = mean, group = 1) +
  geom_point(stat = "summary", fun = mean, 
             size = 4, color = "steelblue")

Scatterplot

When you have a continuous (interval or ratio) variable on the x-axis and you want to plot it against a continuous variable on the y-axis this is known as a scatterplot

 

ggplot(penguins, aes(bill_length_mm, body_mass_g)) +
  geom_point() +
  stat_smooth(method = "lm", color = "forestgreen")