Regular expressions
Sequence of (meta) characters.
used for pattern matching or string matching
Uses:
Data extraction
Cleaning
Data analysis
Data validation
Text mining
parsing
[abc] | a,b,c | |
[^abc] | any character except a, b, c | |
[a- z] | a to z | |
[A - Z] | A to Z | |
[a -z A- Z] | a to z, A to Z | |
[0 - 9] | 0 to 9 | |
[ ]? | occurs 0 or 1 time | |
[ ]+ | Occurs 1 or more times | |
[ ] * | occurs 0 or more times | |
[ ]{n} | occurs n times | |
[ ]{n, } | occurs n or more times | |
[ ]{y, z} | occurs at least y times but less than z times | |
[:alnum:] | any alphanumeric character | |
[:digit:] | any numeric digit | |
[:alpha:] | any letter (upper or lowercase) | |
[:upper:] | any uppercase letter | |
[:lower:] | any lowercase letter |
Regex Metacharacters
\d | [0 - 9] | |
\D | [^0 - 9] | |
\w | [a - z A - z 0 -9] | |
\W | [^\w] | |
"\\s" | a single space | |
^ | Anchors the pattern to the beginning of a string. | |
$ | Anchors the pattern to the end of a string. | |
* | Any character that is matched zero or more times |
grepl()
Searches for a pattern within a character vector or list of character strings.
Stands for "global regular expression pattern matching with logical return."
Returns logical vector indicating whether a match was found for the pattern.
Syntax:
grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
pattern
regular expression pattern you want to search
x
the character vector or string which you want to search for the pattern
ignore.case `
optional legacy argument that specifies whether the pattern matching should be case_insensitive (TRUE)
- The \1 in the replacement argument of sub() gets set to the string that is captured by the regular expression [0-9]+.
Example
grep()
Returns a numeric vector of indices (positions) where the pattern is found in the input vector. It returns the position of the elements that match the pattern.
Syntax
grep(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
sub()
The function is used for pattern substitution within character strings.
It replaces the first occurrence of a specified pattern (regular expression) in a character vector with a replacement string and returns the modified character vector.
syntax
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE)
pattern
regular expression pattern you want to search
replacement
This is the string that will replace the first occurrence of the pattern in each element ox x
x
the character vector or string which you want to search for the pattern
ignore.case `
optional legacy argument that specifies whether the pattern matching should be case_insensitive (TRUE)
perl
: An optional logical argument that indicates whether the pattern should be treated as a Perl-compatible regular expression (TRUE
) or a basic regular expression (FALSE
, the default).fixed
: An optional logical argument that specifies whetherpattern
should be treated as a fixed string (TRUE
) or as a regular expression (FALSE
, the default).
Note: If you want to replace all occurrences, you can use the gsub() function.
regexpr()
is used to find the starting position of a specified pattern (regular expression) within a character vector or a list of character strings.
If no match is found, it returns -1.
Syntax
regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE)
pattern
: This is the regular expression pattern you want to search for within the character vectortext
.text
: This is the character vector or list of character strings in which you want to find the pattern.ignore.case
: An optional logical argument that specifies whether the pattern matching should be case-insensitive (TRUE
) or case-sensitive (FALSE
, the default).perl
: An optional logical argument that indicates whether the pattern should be treated as a Perl-compatible regular expression (TRUE
) or a basic regular expression (FALSE
, the default).fixed
: An optional logical argument that specifies whetherpattern
should be treated as a fixed string (TRUE
) or as a regular expression (FALSE
, the default).regexpr()
is useful when you specifically need to know the starting position of the first occurrence of a pattern within each string intext
.Note: If you want to find the positions of all occurrences of the pattern within each element, you can use the
grep()
orgregexpr()
function, which returns positions for multiple matches.Example
Further Reading
https://bookdown.org/rdpeng/RProgDA/text-processing-and-regular-expressions.html