Data Manipulation in R

Data manipulation involves the process of transforming and restructuring data to prepare it for analysis or to extract insights from it.

There are 8 fundamental data manipulation verbs. Most are in the dplyr package.

  1. filter() picks rows based on their values.

  2. distinct() removes duplicate rows

  3. arrange() reorders the rows

  4. select() selects variables by their names.

  5. rename() renames columns

  6. mutate() and. transmutate() adds/creates new variables

  7. summarise() compute statistical summaries.

Subsetting and Filtering

Selecting specific subsets of data or filtering rows based on certain criteria to focus on relevant observations.

subset()

Is a base R function but for data frames filter() is preferred.

create a subset of a data frame based on specified conditions.

Syntax:

subset(x, subset, select, ...)
  • x: The data frame you want to subset.

  • subset: The condition or logical expression that defines which rows to select. Rows for which the subset condition is TRUE will be included in the subset.

  • select: Optional. Specifies which columns (variables) to include in the subset. You can provide a vector of column names or indices. If omitted, all columns will be included.

  • ...: Additional arguments to pass to the subset function. These are not commonly used.

Example:

fair_diamond <- subset(diamonds, quality == "Fair")

filter()

It is part of the dplyr package

library(dplyr)

# Create a subset using filter() and select()
subset_data <- data_frame %>%
  filter(condition) %>%
  select(column1, column2)

Subsetting data frame rows in R

The following functions are used:

  1. slice() extracts rows by position

  2. filter() extracts rows that meet a certain logical criteria

  3. sample_n() randomly select n rows

  4. sample_frac() randomly select a fraction of rows

  5. top_n() select top n rows ordered by a variable.

Variable Selection and Renaming

rename()

part of dplyr

# Rename columns using the rename() function
data <- data %>%
  rename(NewName1 = OldName1, NewName2 = OldName2)

Rename using the assignment operator (<-)

# Create a sample data frame
data <- data.frame(
  OldName1 = c(1, 2, 3),
  OldName2 = c("A", "B", "C")
)

# Rename columns using the assignment operator
colnames(data) <- c("NewName1", "NewName2")

select()

part of the dplyr package

It is used to select specific columns (variables) from a data frame or data table.

selected_data <- data %>%
  select(Name, Age, Score)

Here are some common use cases for select():

  1. Select specific columns by name.

  2. Use the: operator to select a range of columns.

  3. Use - to exclude specific columns.

  4. Combine with other dplyr functions like mutate() or filter() for complex data manipulation.

Helper functions with select.

They are not part of dplyr package but are in the tidyverse package.

  1. starts_with("prefix"): Select columns that start with a specified prefix.

     select(data, starts_with("prefix"))
    
  2. ends_with("suffix"): Select columns that end with a specified suffix.

     select(data, ends_with("suffix"))
    
  3. contains("substring"): Select columns that contain a specified substring.

     select(data, contains("substring"))
    
  4. matches("regex"): Select columns that match a specified regular expression.

     select(data, matches("regex"))
    
  5. one_of("col1", "col2", "col3"): Select a specific set of columns.

     select(data, one_of("col1", "col2", "col3"))
    
  6. num_range("x", 1:5): Select columns with a numeric range in their names.

     select(data, num_range("x", 1:5))
    
  7. everything(): Select all columns not explicitly mentioned.

     select(data, -column_to_exclude)
    
  8. where(predicate): Select columns based on a custom logical condition.

     select(data, where(~is.numeric(.)))
    

using []

diamonds[3,1]

3 - represents the rows

1 - represents the columns.

Creating Derived Variables

mutate() creates a new column or modifies the existing one. It retains the original column in the output.

transmute() creates new column or modify existing ones. It only keeps the new or modified ones and does not include the original column.

Both work based on existing data.

Aggregation and Summarization

  • distinct(): This function is primarily used for data manipulation. It selects unique rows from a data frame or data table, which is often done as part of data exploration and analysis.

  • count(): It counts the number of occurrences of each unique combination of variables.

  • summarize() function in dplyr is commonly used to compute summary statistics within each group.

  • aggregate()

  • group_by() Grouping data by one or more columns.

Arrange Data

Data can be ordered using the arrange() function.

arrange(desc(Score))

Joining and Merging Data

Combining data from multiple sources or tables into a single dataset for analysis.

1. merge() It is used to combine two or more data frames based on common columns or variables. It performs a similar operation to SQL JOINs, allowing you to merge data frames horizontally.

merged_data <- merge(df1, df2, by = "ID")

2. data.table Package:

  • The data.table package provides a high-performance method for joining large datasets.

  • The data.table class has a merge() method similar to the base R merge() function.

  • You can use the on argument to specify the columns to join on.

library(data.table)

# Inner join two data tables
merged_data <- data.table1[data.table2, on = "common_column"]

3. dplyr Package:

  • The dplyr package provides several functions for data manipulation, including joining data frames.

  • inner_join(), left_join(), right_join(), and full_join() functions allow you to perform different types of joins.

  • You can use the by argument to specify the columns to join on.

library(dplyr)

# Inner join two data frames
merged_data <- inner_join(data_frame1, data_frame2, by = "common_column")

4. cbind()

concatenate data frames column-wise

rbind()

concatenate data frames row-wise

5. bind_rows()

concatenate data frames row-wise

bind_cols()

concatenate data frames column-wise

library(dplyr)

# Concatenate data frames row-wise
combined_data <- bind_rows(data_frame1, data_frame2)

# Concatenate data frames column-wise
combined_data <- bind_cols(data_frame1, data_frame2)

Reshaping Data

spread() is a helpful function for reshaping data when you want to work with it in a different structure, such as for creating summary tables or for visualization.

spread(data, key, value)
  • data: The data frame or tibble you want to reshape.

  • key: The column in the data frame that contains the values you want to use as column headers.

  • value: The column in the data frame that contains the values to fill in the new columns.

Here's a step-by-step example to illustrate how spread() works:

Suppose you have a data frame like this in long format:

# Example data in long format
data_long <- data.frame(
  Student = c("Alice", "Bob", "Charlie", "Alice", "Bob", "Charlie"),
  Subject = c("Math", "Math", "Math", "Science", "Science", "Science"),
  Score = c(90, 88, 92, 78, 85, 88)
)

It looks like this:

   Student Subject Score
1    Alice    Math    90
2      Bob    Math    88
3  Charlie    Math    92
4    Alice Science    78
5      Bob Science    85
6  Charlie Science    88
# Using spread() to transform data to wide format
library(dplyr)
library(tidyr)

data_wide <- data_long %>%
  spread(Subject, Score)

The resulting data_wide data frame will look like this:

   Student Math Science
1    Alice   90      78
2      Bob   88      85
3  Charlie   92      88
# Using separate() to split Full_Name into First_Name and Last_Name
library(tidyr)

data_split <- data %>%
  separate(Full_Name, into = c("First_Name", "Last_Name", sep = " ")