Regular expressions

Sequence of (meta) characters.

used for pattern matching or string matching

Uses:

  • Data extraction

  • Cleaning

  • Data analysis

  • Data validation

  • Text mining

  • parsing

[abc]a,b,c
[^abc]any character except a, b, c
[a- z]a to z
[A - Z]A to Z
[a -z A- Z]a to z, A to Z
[0 - 9]0 to 9
[ ]?occurs 0 or 1 time
[ ]+Occurs 1 or more times
[ ] *occurs 0 or more times
[ ]{n}occurs n times
[ ]{n, }occurs n or more times
[ ]{y, z}occurs at least y times but less than z times
[:alnum:]any alphanumeric character
[:digit:]any numeric digit
[:alpha:]any letter (upper or lowercase)
[:upper:]any uppercase letter
[:lower:]any lowercase letter

Regex Metacharacters

\d[0 - 9]
\D[^0 - 9]
\w[a - z A - z 0 -9]
\W[^\w]
"\\s"a single space
^Anchors the pattern to the beginning of a string.
$Anchors the pattern to the end of a string.
*Any character that is matched zero or more times

grepl()

Searches for a pattern within a character vector or list of character strings.

Stands for "global regular expression pattern matching with logical return."

Returns logical vector indicating whether a match was found for the pattern.

Syntax:

grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

pattern regular expression pattern you want to search

x the character vector or string which you want to search for the pattern

ignore.case ` optional legacy argument that specifies whether the pattern matching should be case_insensitive (TRUE)

  • The \1 in the replacement argument of sub() gets set to the string that is captured by the regular expression [0-9]+.

Example

grep()

Returns a numeric vector of indices (positions) where the pattern is found in the input vector. It returns the position of the elements that match the pattern.

Syntax

grep(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

sub()

The function is used for pattern substitution within character strings.

It replaces the first occurrence of a specified pattern (regular expression) in a character vector with a replacement string and returns the modified character vector.

syntax

sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE)

pattern regular expression pattern you want to search

replacement This is the string that will replace the first occurrence of the pattern in each element ox x

x the character vector or string which you want to search for the pattern

ignore.case ` optional legacy argument that specifies whether the pattern matching should be case_insensitive (TRUE)

  • perl: An optional logical argument that indicates whether the pattern should be treated as a Perl-compatible regular expression (TRUE) or a basic regular expression (FALSE, the default).

  • fixed: An optional logical argument that specifies whether pattern should be treated as a fixed string (TRUE) or as a regular expression (FALSE, the default).

Note: If you want to replace all occurrences, you can use the gsub() function.

regexpr()

is used to find the starting position of a specified pattern (regular expression) within a character vector or a list of character strings.

If no match is found, it returns -1.

Syntax

regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE)
  • pattern: This is the regular expression pattern you want to search for within the character vector text.

  • text: This is the character vector or list of character strings in which you want to find the pattern.

  • ignore.case: An optional logical argument that specifies whether the pattern matching should be case-insensitive (TRUE) or case-sensitive (FALSE, the default).

  • perl: An optional logical argument that indicates whether the pattern should be treated as a Perl-compatible regular expression (TRUE) or a basic regular expression (FALSE, the default).

  • fixed: An optional logical argument that specifies whether pattern should be treated as a fixed string (TRUE) or as a regular expression (FALSE, the default).

  • regexpr() is useful when you specifically need to know the starting position of the first occurrence of a pattern within each string in text.

    Note: If you want to find the positions of all occurrences of the pattern within each element, you can use the grep() or gregexpr() function, which returns positions for multiple matches.

    Example

Further Reading

https://bookdown.org/rdpeng/RProgDA/text-processing-and-regular-expressions.html