Data Wrangling in Tidyverse
Data wrangling is one of the most important skills required in statistics and data science
There are times when you have a data set with several cases (rows) and variables (columns) but you are interested in a subset of these. What do you do that case?
Suppose you want to create a new variable by transforming an existing variable. What do you do? Manipulating a data set into a format that suites your analysis is called data wrangling. In this post, I provide the basic data wrangling techniques commonly used by data scientists and statisticians. The techniques include:
- select
- mutate
- filter
- arrange
- “summarize”, and
- group_by
Read on to learn how to use these functions:
Select
The select function is used to select specific variables (columns) from a larger data set.
Suppose you have a data set named Data with several variables (a, b, c, d, e, f, g, and h). Suppose you are interested in creating a new data set named New_Data with variables a,b, and c only. To create this new data set, you use the following command:
New_Data <- Data %>% select (a, b, c)
Note:
- After running the above code, a new data set named New_Data will be created with only three variables a, b, and c.
- The symbols <- is used to designate a new object while %>% is a “pipe” that sends anything on the its left to the commands on the right side. In this case, we pipe Data into a select function, then we state what columns we want to select.
Next, we consider the mutate function.
Mutate
The mutate function alters (mutates) an existing variable in some way and creates a new column(variable) with a name that you designate. Suppose you want to create a new variable called sum that adds the values of variable a and b for each case. Suppose your data set is called Data and you want to use New_Data_2
New_Data_2 <- Data %>% mutate (sum = a+b)
Note:
- If you want to delete the original variable from the data set, you can use the “transmute” instead of “mutate”.
- The above example assumes that variables a and b are both numerical, otherwise, addition may not be possible.
- You can do other operations besides addition.
Filter
Just like we can select certain variables (columns) from a data set, it is possible to filter a certain rows (cases/observations) in a data set.
Suppose you have a variable called sex and you want to isolate cases for which sex is female. Suppose also that you want to name your new data set as New_Data_3. Use the code
New_Data_3 <- Data %>% filter (sex == "female")
Note:
You can also filter using multiple criteria. For example, if you have another variable called age and you want females that are above 30, you can use the code:
New_Data_3 <- Data %>% filter (sex == "female", age > 30)
We use double equal signs “==” to specify the levels of the categorical variable. The level specified must be put in quotes.
The new data set will have fewer cases than the original data set.
If you want to have an “OR” condition (e.g., cases that are female OR aged above 30), you use “|” instead of a comma to separate the variables.
Group_by
Group_by is a special kind of filtering that is commonly used alongside a summarize function. Suppose you want to compute the average age by sex (are males older than female cases o average?).Here, you first use the group_by function to group the cases by sex, then run the summarize. See code below:
Dataset %>% group_by (sex) %>% summarize(mean(age))
Note:
The group_by function makes R to think about the data as different data frames.
The above code returns the means for each group in a single row. It essentially flattens the data (i.e., the summarize function)
If for some reason you want to have a new column in the data set with the mean score for each group, you can replace the summarize function with mutate as follows:
New <- Data %>% group_by(sex) %>% mutate(mean=mean(age))
This will add a new column named mean to your data set (now named New) with the mean age for each group.