Sorting the mess: Data Wrangling in Tidyverse
Introduction
There are times when you have a data set with several cases (rows) and variables (columns) but you are interested in a subset of these. What do you do that case?
Suppose you want to create a new variable by transforming an existing variable. What do you do? Manipulating a data set into a format that suites your analysis is called data wrangling. Here are the basic techniques used in data wrangling:
- select
- mutate
- filter
- arrange
- “summarize”, and
- group_by
- drop_na
Read on to learn how to use these functions:
Select
The select function is used to select specific variables (columns) from a larger data set.
Suppose you have a data set named Data with several variables (a, b, c, d, e, f, g, and h). Suppose you are interested in creating a new data set named New_Data with variables a,b, and c only. To create this new data set, you use the following command:
New_Data <- Data %>%
select (a, b, c)
Note:
- After running the above code, a new data set named New_Data will be created with only three variables a, b, and c.
- The symbols <- is used to designate a new object while %>% is a “pipe” that sends anything on the its left to the commands on the right side. In this case, we pipe Data into a select function, then we state what columns we want to select.
Next, we consider the mutate function.
Mutate
The mutate function alters (mutates) an existing variable in some way and creates a new column(variable) with a name that you designate. Suppose you want to create a new variable called sum that adds the values of variable a and b for each case. Suppose your data set is called Data and you want to use New_Data_2
New_Data_2 <- Data %>%
mutate (sum = a+b)
Note:
- If you want to delete the original variable from the data set, you can use the “transmute” instead of “mutate”.
- The above example assumes that variables a and b are both numerical, otherwise, addition may not be possible.
- You can do other operations besides addition.
Filter
Just like we can select certain variables (columns) from a data set, it is possible to filter a certain rows (cases/observations) in a data set.
Suppose you have a variable called sex and you want to isolate cases for which sex is female. Suppose also that you want to name your new data set as New_Data_3. Use the code
New_Data_3 <- Data %>%
filter (sex == "female")
Note:
You can also filter using multiple criteria. For example, if you have another variable called age and you want females that are above 30, you can use the code:
New_Data_3 <- Data %>% filter (sex == "female", age > 30)
We use double equal signs “==” to specify the levels of the categorical variable. The level specified must be put in quotes.
The new data set will have fewer cases than the original data set.
If you want to have an “OR” condition (e.g., cases that are female OR aged above 30), you use “|” instead of a comma to separate the variables.
Group_by
Group_by is a special kind of filtering that is commonly used alongside a summarize function. Suppose you want to compute the average age by sex (are males older than female cases on average?). Here, you first use the group_by function to group the cases by sex, then run the summarize function and call the mean. See code below:
Dataset %>%
group_by (sex) %>%
summarize(mean(age))
Note:
The group_by function makes R to think about the data as different data frames.
The above code returns the means for each group in a single row. It essentially flattens the data (i.e., the summarize function)
If, for some reason, you want to have a new column in the data set with the mean score for each group, you can replace the summarize function with mutate as follows:
New <- Data %>% group_by(sex) %>% mutate(mean=mean(age))
This will add a new column named mean to your data set (now named New) with the mean age for each group.
drop_na
Some statistics may not be computed with if the variable of interest has missing values (these appear as na ). In such cases, it might be necessary to drop missing values before moving forward with your analysis. To drop missing values from a given column such as age, you use the na function as shown below:
New_Data <- Data %>%
drop_na(age)
If you goal was to, say, compute the mean age, you do not need to do a two-step operation, you can leverage the pipe operator and drop na then compute the mean as an extra step:
New_Data <- Data %>%
drop_na(age) %>%
summarize(mean(age)) # adding an extra line to compute mean
Citation
@online{geteregechi2022,
author = {Geteregechi, Joash},
title = {Sorting the Mess: {Data} {Wrangling} in {Tidyverse}},
date = {2022-11-15},
url = {https://jmochogi.quarto.pub/posts/2023-07-06-Data-Wrangling/},
langid = {en}
}