|>
data |>
step1 |>
step2 step3
Importing, Merging, and Restructuring Data
Before starting this class:
📦 Install readr
, dplyr
, tidyr
, ggplot2
, Hmisc
, RColorBrewer
Download sample data files: (right-click to download linked file)
Import: readr
File paths
read_csv()
and read_delim()
Merge: dplyr
bind_()
functions
join_()
functions
Restructure: tidyr
pivot_wider()
pivot_longer()
The pipe operator allows you to chain together a set of functions to conduct a sequence of manipulations on it. Conceptually, here’s what code written using the pipe operator looks like:
We start with our data. Then we do step1. Then we do step2. Then we do step3. The pipe ties it all together, enabling us to do multiple things to our data, all in one execution of code.
There are different approaches to writing code that performs multiple functions on the same object.
Here is the standard, non-pipe operator way:
An alternative is to use the pipe operator |>
With the pipe operator, the result of the previous line gets passed (or piped) onto the next function.
The first line in this example is simply specifying the data frame that is being passed from one line to the next.
Notice how I did not have to specify data
inside the filter()
, mutate()
, and select()
, functions. This makes the code more concise and easier to read.
The end result of the last function, then gets assigned to data_new <-
.
R needs to know the full file path to the file on your computer in order to import it
On Macs:
Users/username/projects/project_name/a_file.csv
On Windows:
C:\username\projects\project_name\a_file.csv
setwd()
here()
Never… ever… ever… use option 1, setwd()
.
Instead, you should use RProjects and here()
. But we will not cover this until Class 5.
For now, we can just use the RStudio Import Dataset GUI to find the absolute file path.
Select Browse on the top right and select the data file you want to import.
The Data Preview window will let you see if it is importing it in the right format.
You can change the Import Options below.
Click on the 📋 icon above the Code Preview window to copy the code.
Click on Cancel to exit out of the Import GUI window
Paste the code into your Untitled.R script
Exploring Your Data
When you want to just explore some data and don’t care about creating a reproducible script it can be perfectly acceptable to not copy and paste the code from Code Preview window and just select the Import button.
csv
tab-delimited
csv
Data Files
CSV stands for “Comma-Separated Values.”
csv
files are typically saved with .csv file extension (e.g., datafile.csv)
csv
files are by far the easiest files to import into R and most software programs.
I suggest any time you want to save/output a data file to your computer, do it in csv
format.
tab-delimited
Data Files
tab-delimited
files are saved with the more standard .txt file extension (e.g., datafile.txt)
There are a lot of different types of delimiters
tab-delimited
files are a little more tedious to import
You have to memorize more arguments to import tab-delimited
files
readr
Use the RStudio Import Dataset GUI to get the filepath that you will need
In Class 5 you will learn a better way to specify file paths using RProjects and here::here()
You might find yourself in a situation where you need to import multiple data files and merge them into a single data frame. There are two general classes of merging data
Bind
Join
Combining data frames together by stacking either the rows or columns
Row Bind: same columns but different rows and stack them on top of each other
Combining data frames together by stacking either the rows or columns
Column Bind: same rows but different columns and stacks them side-by-side. This is a much less common situation than a row bind and can usually be accomplished with a join instead
dplyr
Merging data frames together that have at least one column in common with a mix of shared and unique entries in that column (e.g. Subject IDs).
For the most part you can get away with just knowing how to do a full join using full_join()
from the dplyr
package
You need to specify what are the key column(s) to join by - columns that are common between the data frames.
Often times there is more than one key column that the data frames need to be joined by:
The exact same data can be structured in different ways
There are two main formats that any data set can be structured as:
Wide: Variables are spread out across columns, making the data frame wider
ID | Stress | Creativity | Memory |
---|---|---|---|
1 | 5 | 7 | 8 |
2 | 3 | 6 | 7 |
3 | 4 | 8 | 6 |
Long: Variables and values are spread across rows, making the data frame longer
ID | Test Type | Score |
---|---|---|
1 | Stress | 5 |
1 | Creativity | 7 |
1 | Memory | 8 |
2 | Stress | 3 |
2 | Creativity | 6 |
2 | Memory | 7 |
3 | Stress | 4 |
3 | Creativity | 8 |
3 | Memory | 6 |
And actually, you can have a mix of wide and long formatted data in a single data frame.
Participant ID | Session | Stress Level | Creativity | Memory |
---|---|---|---|---|
1 | 1 | 5 | 7 | 8 |
1 | 2 | 4 | 8 | 9 |
1 | 3 | 3 | 9 | 10 |
2 | 1 | 6 | 6 | 7 |
2 | 2 | 5 | 7 | 8 |
2 | 3 | 4 | 8 | 9 |
3 | 1 | 4 | 8 | 6 |
3 | 2 | 3 | 9 | 7 |
3 | 3 | 2 | 10 | 8 |
A good rule of thumb for formatting data is to have your variables (IVs and DVs) each have their own column.
This often results in:
Measured variables in wide format
Experimental conditions or repeated-measures in long format
Restructuring data involves changing the structure from long-to-wide (wider) or wide-to-long (longer)
The tidyr
package provides useful functions to do this:
pivot_wider()
“widens” data from long-to-wide
pivot_longer()
“lengthens” data from wide-to-long
pivot_wider()
Using the example data sets on previous slides, let’s restructure the long data frame to a wide format
First let’s create the data frame - you can just copy and paste this code in an Untitled.R script:
pivot_wider()
The two main arguments to specify in pivot_wider()
are
names_from: The column name that contains the variables to create new columns by (e.g. “Test Type”). The values in this column will become column names in the wider data format.
values_from: The column name that contains the values (e.g. “Score”).
Column Names
R does not like column names to have spaces in them, but it does allow it.
Notice how in data_wide
the column Stress Level
contains a space. This is because the value in data_long
had a space, which was not a problem then.
clean_names()
from janitor
provides a convenient way to get rid of spaces and replace them with an _
pivot_longer()
The three main arguments to specify in pivot_longer()
are:
cols: The column names that will be restructured to a longer format
names_to: The new column name that will contain values which correspond to the column names in the wide data
values_to: The new column name that will contain the actual values in the wide data
Converting from a wide data format to a mix of wide and long can be more complicated
Let’s say we have a wide data set with multiple sessions of Stress
, Creativity
, and Memory
.
Copy and paste this:
data_sessions_wide <- tibble(
ParticipantID = 1:3,
StressLevel_S1 = c(5, 6, 4),
Creativity_S1 = c(7, 6, 8),
Memory_S1 = c(8, 7, 6),
StressLevel_S2 = c(4, 5, 3),
Creativity_S2 = c(8, 7, 9),
Memory_S2 = c(9, 8, 7),
StressLevel_S3 = c(3, 4, 2),
Creativity_S3 = c(9, 8, 10),
Memory_S3 = c(10, 9, 8)
)
View(data_sessions_wide)
Let’s restructure this to 4 columns: Session
, Stress Level
, Creativity
, and Memory
.
There are two strategies for doing this:
pivot_longer()
, separate()
, and then pivot_wider()
pivot_longer()
# 1. Using pivot_longer(), separate(), and then pivot_wider()
data_sessions_1 <- data_sessions_wide |>
pivot_longer(cols = contains("_S"),
names_to = "Session",
values_to = "Score") |>
separate(Session, into = c("Test", "Session")) |>
pivot_wider(names_from = Test,
values_from = Score)
# 2. Use more complicated syntax in pivot_longer()
data_sessions_2 <- data_sessions_wide |>
pivot_longer(cols = contains("_S"),
names_to = c(".value", "Session"),
names_pattern = "(.*)_(S\\d)")
Note
You may find yourself in a situation where you need to select multiple columns in a argument but it can be tedious to type each every column, especially if the number of columns is larger.
There are what is known as tidy select functions to easily select column names, particularly if there is a consistent pattern to those column names.
This can be very useful in functions like pivot_longer()
. In the examples provided on previous slides, we used some tidy select functions: any_of()
and contains()
.
To learn more about what tidy select functions are available and what they do read the tidy select documentation.