Missing data can arise for a multitude of reasons, from data collection errors to the inherent nature of some observations.
represented with NA
Testing for missing values
is.na()
This function is used to check if a value is missing (i.e., NA) in a vector or data frame. It returns a logical vector with TRUE indicating missing values.
uses map to apply `sum(is.na(.))` to each column of a data frame. It calculates the count of missing values in each column.
'.' represents all the columns
map(~sum(is.na(.)))
n_miss()
Which is part of the naniar package.
Counts the missing values in your dataset.
The output will be a table that displays the count of missing values for each variable in your dataset, making it easy to identify which columns have missing data.
n_miss(data)
Visualizing Missing Values
gg_miss_var() function will show you the number (or %) of missing values in each column.
Still part of naniar
package.
gg_miss_var(data, show_pct = TRUE)
miss_var_summary()
provides summaries of missing data at the variable level, including the number and proportion of missing values.
miss_case_summary()
gives summaries at the case (row) level, showing which cases have missing data and how many values are missing in each.
How to deal with missing data
Check with the data collection source
Missing completely at random (MCAR) - No systematic relationship between missing data and other values. Data entry errors when inputting data.
Missing at random (MAR) - systematic relationship between missing data and other observed values.
Missing not at random (MNAR) systematic relationship between missing data and other unobserved,medvalues.
Drop the missing values
If the missing data is minimal and doesn't introduce significant bias, you can remove rows or columns containing missing values.
drop_na()
This function is part of the
tidyr
package and is used to drop rows with missing values in a data frame.Replace the missing values
Replace with the mean, median``, mode
Replace with zero
Replace it based on other functions e.g.
regression imputation, k-nearest neighbors imputation, or interpolation.
replace_na()
conditional replacing
# Load the dplyr package
library(dplyr)
# Assuming you have a dataset called 'df' with columns 'Age' and 'Group'
# Replace missing age values with the median age for each group
df <- df %>%
group_by(Group) %>%
mutate(Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age)) %>%
ungroup()
Excluding Missing Values
na.rm = TRUE
This argument is often used in functions like mean(), sum(), and others to exclude missing values when performing calculations on numeric vectors. It helps avoid issues like getting NA as a result.
mean(my_vector, na.rm = TRUE)
na.omit()
This function is used to remove rows with missing values. It returns the object with listwise deletion of missing values, effectively removing rows where any variable has a missing value.
na.omit(data)
Handling missing Categorical Values
A new category can be created. Either unknown, missing, etc
Input using the most frequent(mode).