Missing Data in R

Missing data can arise for a multitude of reasons, from data collection errors to the inherent nature of some observations.

represented with NA

Testing for missing values

is.na()

This function is used to check if a value is missing (i.e., NA) in a vector or data frame. It returns a logical vector with TRUE indicating missing values.

uses map to apply `sum(is.na(.))` to each column of a data frame. It calculates the count of missing values in each column.

'.' represents all the columns

map(~sum(is.na(.)))

n_miss()

Which is part of the naniar package.

Counts the missing values in your dataset.

The output will be a table that displays the count of missing values for each variable in your dataset, making it easy to identify which columns have missing data.

n_miss(data)

Visualizing Missing Values

gg_miss_var() function will show you the number (or %) of missing values in each column.

Still part of naniar package.

gg_miss_var(data, show_pct = TRUE)

miss_var_summary() provides summaries of missing data at the variable level, including the number and proportion of missing values.

miss_case_summary() gives summaries at the case (row) level, showing which cases have missing data and how many values are missing in each.

How to deal with missing data

  • Check with the data collection source

    • Missing completely at random (MCAR) - No systematic relationship between missing data and other values. Data entry errors when inputting data.

    • Missing at random (MAR) - systematic relationship between missing data and other observed values.

    • Missing not at random (MNAR) systematic relationship between missing data and other unobserved,medvalues.

  • Drop the missing values

    If the missing data is minimal and doesn't introduce significant bias, you can remove rows or columns containing missing values.

  • drop_na()

    This function is part of the tidyr package and is used to drop rows with missing values in a data frame.

  • Replace the missing values

    • Replace with the mean, median``, mode

    • Replace with zero

    • Replace it based on other functions e.g.

      regression imputation, k-nearest neighbors imputation, or interpolation.

replace_na()

conditional replacing

    # Load the dplyr package
    library(dplyr)

    # Assuming you have a dataset called 'df' with columns 'Age' and 'Group'
    # Replace missing age values with the median age for each group

    df <- df %>%
      group_by(Group) %>%
      mutate(Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age)) %>%
      ungroup()

Excluding Missing Values

na.rm = TRUE

This argument is often used in functions like mean(), sum(), and others to exclude missing values when performing calculations on numeric vectors. It helps avoid issues like getting NA as a result.

mean(my_vector, na.rm = TRUE)

na.omit()

This function is used to remove rows with missing values. It returns the object with listwise deletion of missing values, effectively removing rows where any variable has a missing value.

na.omit(data)

Handling missing Categorical Values

  1. A new category can be created. Either unknown, missing, etc

  2. Input using the most frequent(mode).

Further Reading

https://epirhandbook.com/en/missing-data.html

https://r4ds.hadley.nz/missing-values