Data Manipulation in R
Data manipulation involves the process of transforming and restructuring data to prepare it for analysis or to extract insights from it.
There are 8 fundamental data manipulation verbs. Most are in the dplyr
package.
filter()
picks rows based on their values.distinct()
removes duplicate rowsarrange()
reorders the rowsselect()
selects variables by their names.rename()
renames columnsmutate()
and.transmutate()
adds/creates new variablessummarise()
compute statistical summaries.
Subsetting and Filtering
Selecting specific subsets of data or filtering rows based on certain criteria to focus on relevant observations.
subset()
Is a base R function but for data frames filter()
is preferred.
create a subset of a data frame based on specified conditions.
Syntax:
subset(x, subset, select, ...)
x
: The data frame you want to subset.subset
: The condition or logical expression that defines which rows to select. Rows for which thesubset
condition isTRUE
will be included in the subset.select
: Optional. Specifies which columns (variables) to include in the subset. You can provide a vector of column names or indices. If omitted, all columns will be included....
: Additional arguments to pass to thesubset
function. These are not commonly used.
Example:
fair_diamond <- subset(diamonds, quality == "Fair")
filter()
It is part of the dplyr
package
library(dplyr)
# Create a subset using filter() and select()
subset_data <- data_frame %>%
filter(condition) %>%
select(column1, column2)
Subsetting data frame rows in R
The following functions are used:
slice()
extracts rows by positionfilter()
extracts rows that meet a certain logical criteriasample_n()
randomly select n rowssample_frac()
randomly select a fraction of rowstop_n()
select top n rows ordered by a variable.
Variable Selection and Renaming
rename()
part of dplyr
# Rename columns using the rename() function
data <- data %>%
rename(NewName1 = OldName1, NewName2 = OldName2)
Rename using the assignment operator (<-
)
# Create a sample data frame
data <- data.frame(
OldName1 = c(1, 2, 3),
OldName2 = c("A", "B", "C")
)
# Rename columns using the assignment operator
colnames(data) <- c("NewName1", "NewName2")
select()
part of the dplyr
package
It is used to select specific columns (variables) from a data frame or data table.
selected_data <- data %>%
select(Name, Age, Score)
Here are some common use cases for select()
:
Select specific columns by name.
Use the: operator to select a range of columns.
Use
-
to exclude specific columns.Combine with other
dplyr
functions likemutate()
orfilter()
for complex data manipulation.
Helper functions with select.
They are not part of dplyr
package but are in the tidyverse package.
starts_with("prefix")
: Select columns that start with a specified prefix.select(data, starts_with("prefix"))
ends_with("suffix")
: Select columns that end with a specified suffix.select(data, ends_with("suffix"))
contains("substring")
: Select columns that contain a specified substring.select(data, contains("substring"))
matches("regex")
: Select columns that match a specified regular expression.select(data, matches("regex"))
one_of("col1", "col2", "col3")
: Select a specific set of columns.select(data, one_of("col1", "col2", "col3"))
num_range("x", 1:5)
: Select columns with a numeric range in their names.select(data, num_range("x", 1:5))
everything()
: Select all columns not explicitly mentioned.select(data, -column_to_exclude)
where(predicate)
: Select columns based on a custom logical condition.select(data, where(~is.numeric(.)))
using []
diamonds[3,1]
3 - represents the rows
1 - represents the columns.
Creating Derived Variables
mutate()
creates a new column or modifies the existing one. It retains the original column in the output.
transmute()
creates new column or modify existing ones. It only keeps the new or modified ones and does not include the original column.
Both work based on existing data.
Aggregation and Summarization
distinct()
: This function is primarily used for data manipulation. It selects unique rows from a data frame or data table, which is often done as part of data exploration and analysis.count()
: It counts the number of occurrences of each unique combination of variables.summarize()
function in dplyr is commonly used to compute summary statistics within each group.aggregate()
group_by()
Grouping data by one or more columns.
Arrange Data
Data can be ordered using the arrange()
function.
arrange(desc(Score))
Joining and Merging Data
Combining data from multiple sources or tables into a single dataset for analysis.
1. merge()
It is used to combine two or more data frames based on common columns or variables. It performs a similar operation to SQL JOINs, allowing you to merge data frames horizontally.
merged_data <- merge(df1, df2, by = "ID")
2. data.table
Package:
The
data.table
package provides a high-performance method for joining large datasets.The
data.table
class has amerge()
method similar to the base Rmerge()
function.You can use the
on
argument to specify the columns to join on.
library(data.table)
# Inner join two data tables
merged_data <- data.table1[data.table2, on = "common_column"]
3. dplyr
Package:
The
dplyr
package provides several functions for data manipulation, including joining data frames.inner_join()
,left_join()
,right_join()
, andfull_join()
functions allow you to perform different types of joins.You can use the
by
argument to specify the columns to join on.
library(dplyr)
# Inner join two data frames
merged_data <- inner_join(data_frame1, data_frame2, by = "common_column")
4. cbind()
concatenate data frames column-wise
rbind()
concatenate data frames row-wise
5. bind_rows()
concatenate data frames row-wise
bind_cols()
concatenate data frames column-wise
library(dplyr)
# Concatenate data frames row-wise
combined_data <- bind_rows(data_frame1, data_frame2)
# Concatenate data frames column-wise
combined_data <- bind_cols(data_frame1, data_frame2)
Reshaping Data
spread()
is a helpful function for reshaping data when you want to work with it in a different structure, such as for creating summary tables or for visualization.
spread(data, key, value)
data
: The data frame or tibble you want to reshape.key
: The column in the data frame that contains the values you want to use as column headers.value
: The column in the data frame that contains the values to fill in the new columns.
Here's a step-by-step example to illustrate how spread()
works:
Suppose you have a data frame like this in long format:
# Example data in long format
data_long <- data.frame(
Student = c("Alice", "Bob", "Charlie", "Alice", "Bob", "Charlie"),
Subject = c("Math", "Math", "Math", "Science", "Science", "Science"),
Score = c(90, 88, 92, 78, 85, 88)
)
It looks like this:
Student Subject Score
1 Alice Math 90
2 Bob Math 88
3 Charlie Math 92
4 Alice Science 78
5 Bob Science 85
6 Charlie Science 88
# Using spread() to transform data to wide format
library(dplyr)
library(tidyr)
data_wide <- data_long %>%
spread(Subject, Score)
The resulting data_wide
data frame will look like this:
Student Math Science
1 Alice 90 78
2 Bob 88 85
3 Charlie 92 88
# Using separate() to split Full_Name into First_Name and Last_Name
library(tidyr)
data_split <- data %>%
separate(Full_Name, into = c("First_Name", "Last_Name", sep = " ")