Class 1

An Introduction to Working with Data in R



Before starting this class:

📖 Read the Prerequisites page



  • R Basics

  • Data Frames

  • Brief intro to

    • Data transformation

    • Graphical visualization

    • Statistical analysis

R Basics

Script Files


To create a new R script go to

File -> New File -> R Script


This course will refer to two types of R script files:

  • Reproducible script file: Script file for actually processing and analyzing your data. Can reproduce your steps of processing and analysis.

    • This is the file you actually save to process your data

    • Is commented and polished enough to share with others.

  • Scratchpad script file: A script file for testing, debugging, and exploring your data.

    • Often saved as Untitled.R or not saved at all
    • Alternatively, you can just execute code directly in the console

Running R Code

Two ways of executing R code


  1. Typing code in an R script file and executing R code line-by-line Ctrl + Enter.
  2. Typing code directly in the console window
1 + 2
3 * 9

Creating R Objects


  • Objects are created using the assignment operator, <-.

  • object <- functions or values.

Go ahead and type the following two lines of code in your scratchpad script file/console and execute the lines of code.


my_first_object <- "hello"
my_second_object <- c(5,6,7,8)


  • You should now see my_first_object and my_second_object in your Environment window

  • Note that R is case sensitive

Using Functions

Anything you do in R is by using functions


  • Learning R is learning what functions are available and how to use them.

  • Example: there is a function to create a sequence of numbers, seq().

seq(1, 100, by = 10)


  • Functions take arguments

  • If you don’t label argument names, then the order of arguments matters!

seq(from = 1, to = 100, by = 10)
seq(to = 100, by = 10, from = 1)
seq(1, 100, 10)

Helper Function



  • You should make frequent use of the helper function ?

    • e.g., ?seq()
  • The names of arguments

  • What the arguments do

  • Argument default values

    • You don’t have to and you almost never will specify all the possible arguments.

    • In some cases it might be important to know what the default value of an argument is.

R Packages


  • Functions are organized in R packages

  • R comes with a set of R packages and functions

  • Developers and other researchers have created a lot of R packages specifically for use in psychology research

  • Most R packages are hosted on The Comprehensive R Archive Network - CRAN.

    • Others may be hosted on GitHub

Install and Load Packages


To install packages from CRAN is easy



  • Installing the package installs it on your computer

  • When you want to use the functions in the package you need to load the package into your current environment


Data Frames

Example Data Set

We will use a data set from the palmerpenguins package


Data Frames


  • Let’s create a new script - a reproducible script to use for this class

  • You should always load packages at the top of the script

# load packages

# import data
data_import <- penguins




It is a good idea to comment your code to provide organization and clarity as to what the code is doing

Viewing the Data

In scratchpad script / console




Viewing the Data

In scratchpad script / console


Get columns names

## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"


Sneak peak of data

## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

Viewing the Data

In scratchpad script / console


  • Use $ to refer to a column in a data frame

Get unique values in a column

## [1] Adelie    Gentoo    Chinstrap
## Levels: Adelie Chinstrap Gentoo

Types of Values


Classes are types of values that exist in R. Here are a list of some common value types:

  • character (or non-numeric) "hello", "goodbye"

  • double (or numeric) 2, 32.55

  • integer 5, 99

  • logical TRUE, FALSE

  • missing NA NaN

Types of Values

In scratchpad script / console


To evaluate the type of values in a column you can use typeof()

## [1] "double"
## [1] "integer"


To change the class of values in an object you can use as.character() , as.numeric() , as.double() , as.integer() , as.logical() functions.

In scratchpad script / console


## [1] "integer"
## [1] Adelie    Gentoo    Chinstrap
## Levels: Adelie Chinstrap Gentoo
## [1] TRUE


  • Factors are a special type of column that represent levels of a category with an order to those levels.

  • The actual values in Factors can be of type character, double, integer, or logical.

  • Factors become especially important in data visualization and statistical analysis.

Creating Factors

In scratchpad script / console


You can set a column of values as a factor by using factor()

factor(data_import$year, levels = c(2007, 2008, 2009))

Data Transformation



Let’s take a look at three variables

  • species

  • flipper length

  • body mass

Data Transformation


  • Compute the mean flipper length and body mass for each species of penguin
# transform data
data_plot <- data_import |>
  mutate(.by = species,
         flipper_length_mm.mean = mean(flipper_length_mm, na.rm = TRUE),
         body_mass_g.mean = mean(body_mass_g, na.rm = TRUE))


  • Strategy: Stay in the data frame!

    • Store variables and computations in columns in the data frame

Not advised

flipper_length_mm.mean <- mean(data_import$flipper_length_mm, na.rm = TRUE)
body_mass_g.mean <- mean(data_import$body_mass_g, na.rm = TRUE)

Data Transformation


  • Compute the mean flipper length and body mass for each species of penguin
data_plot <- data_import |>
  mutate(.by = species,
         flipper_length_mm.mean = mean(flipper_length_mm, na.rm = TRUE),
         body_mass_g.mean = mean(body_mass_g, na.rm = TRUE))


  • The |> notation says pass data_import into the mutate() function

  • Then the result of mutate() is assigned to a new object, data_plot

Data Transformation


  • Compute the mean flipper length and body mass for each species of penguin
data_plot <- data_import |>
  mutate(.by = species,
         flipper_length_mm.mean = mean(flipper_length_mm, na.rm = TRUE),
         body_mass_g.mean = mean(body_mass_g, na.rm = TRUE))


Visualizing Data

ggplot2 - data and aesthetics layer


# visualize data
ggplot(data_plot, aes(x = flipper_length_mm, y = body_mass_g,
                      color = species, shape = species))

ggplot2 - geometries layer


# visualize data
ggplot(data_plot, aes(x = flipper_length_mm, y = body_mass_g,
                      color = species, shape = species)) +

ggplot2 - theme


# visualize data
ggplot(data_plot, aes(x = flipper_length_mm, y = body_mass_g,
                      color = species, shape = species)) +
  geom_point() +



# visualize data
ggplot(data_plot, aes(x = flipper_length_mm, y = body_mass_g,
                      color = species, shape = species)) +
  geom_point() +
  geom_hline(aes(yintercept = body_mass_g.mean, color = species)) +
  geom_vline(aes(xintercept = flipper_length_mm.mean, color = species)) +

Statistical Analysis



# statistical analysis
cor.test(data_import$body_mass_g, data_import$flipper_length_mm)
##  Pearson's product-moment correlation
## data:  data_import$body_mass_g and data_import$flipper_length_mm
## t = 32.722, df = 340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.843041 0.894599
## sample estimates:
##       cor 
## 0.8712018

Class 1: Reproducible Script


# load packages

# import data
data_import <- penguins

# transform data
data_plot <- data_import |>
  mutate(.by = species,
         flipper_length_mm.mean = mean(flipper_length_mm, na.rm = TRUE),
         body_mass_g.mean = mean(body_mass_g, na.rm = TRUE))

# visualize data
ggplot(data_plot, aes(x = flipper_length_mm, y = body_mass_g,
                      color = species, shape = species)) +
  geom_point() +
  geom_hline(aes(yintercept = body_mass_g.mean, color = species)) +
  geom_vline(aes(xintercept = flipper_length_mm.mean, color = species)) +

# statistical analysis
cor.test(data_import$body_mass_g, data_import$flipper_length_mm)