Regular Expressions I

Jo Hardin

2024-01-11

Regular Expressions

A regular expression … is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

Tools for characterizing a regular expression

Escape sequences

Just to scratch the surface, here are a few special characters that cannot be directly coded. Therefore, they are escaped with a backslash, \.

  • \': single quote.
  • \": double quote.
  • \n: newline.
  • \r: carriage return.
  • \t: tab character.

Quantifiers

Quantifiers specify how many repetitions of the pattern.

  • *: matches at least 0 times.
  • +: matches at least 1 times.
  • ?: matches at most 1 times.
  • {n}: matches exactly n times.
  • {n,}: matches at least n times.
  • {n,m}: matches between n and m times.

Examples of quantifiers

strings <- c("a", "ab", "acb", "accb", "acccb", "accccb")
grep("ac*b", strings, value = TRUE)
grep("ac*b", strings, value = FALSE)
grep("ac+b", strings, value = TRUE)
grep("ac?b", strings, value = TRUE)
grep("ac{2}b", strings, value = TRUE)
grep("ac{2,}b", strings, value = TRUE)
grep("ac{2,3}b", strings, value = TRUE)

grep() stands for “global regular expression print”. grep() returns a character vector containing the selected elements, grepl() returns a logical vector of TRUE/FALSE for whether or not there was a match.

Examples of quantifiers - solution

strings <- c("a", "ab", "acb", "accb", "acccb", "accccb")
grep("ac*b", strings, value = TRUE)
[1] "ab"     "acb"    "accb"   "acccb"  "accccb"
grep("ac*b", strings, value = FALSE)
[1] 2 3 4 5 6
grep("ac+b", strings, value = TRUE)
[1] "acb"    "accb"   "acccb"  "accccb"
grep("ac?b", strings, value = TRUE)
[1] "ab"  "acb"
grep("ac{2}b", strings, value = TRUE)
[1] "accb"
grep("ac{2,}b", strings, value = TRUE)
[1] "accb"   "acccb"  "accccb"
grep("ac{2,3}b", strings, value = TRUE)
[1] "accb"  "acccb"

grep() stands for “global regular expression print”. grep() returns a character vector containing the selected elements, grepl() returns a logical vector of TRUE/FALSE for whether or not there was a match.

Position of pattern within the string

  • ^: matches the start of the string.
  • $: matches the end of the string.
  • \b: matches the boundary of a word. Don’t confuse it with ^ $ which marks the edge of a string.
  • \B: matches the empty string provided it is not at an edge of a word.

Examples of positions

strings <- c("abcd", "cdab", "cabd", "c abd")
grep("ab", strings, value = TRUE)
grep("^ab", strings, value = TRUE)
grep("ab$", strings, value = TRUE)
grep("\\bab", strings, value = TRUE)

Examples of positions - solution

strings <- c("abcd", "cdab", "cabd", "c abd")
grep("ab", strings, value = TRUE)
[1] "abcd"  "cdab"  "cabd"  "c abd"
grep("^ab", strings, value = TRUE)
[1] "abcd"
grep("ab$", strings, value = TRUE)
[1] "cdab"
grep("\\bab", strings, value = TRUE)
[1] "abcd"  "c abd"

Operators

  • .: matches any single character,
  • [...]: a character list, matches any one of the characters inside the square brackets. A - inside the brackets specifies a range of characters.
  • [^...]: an inverted character list, similar to [...], but matches any characters except those inside the square brackets.
  • \: suppress the special meaning of metacharacters in regular expression, i.e. $ * + . ? [ ] ^ { } | ( ) \. Since \ itself needs to be escaped in R, we need to escape metacharacters with double backslash like \\$.
  • |: an “or” operator, matches patterns on either side of the |.
  • (...): grouping in regular expressions. This allows you to retrieve the bits that matched various parts of your regular expression so you can alter them or use them for building up a new string.

Examples of operators

strings <- c("^ab", "ab", "abc", "abd", "abe", "ab 12", "a|b")
grep("ab.", strings, value = TRUE)
grep("ab[c-e]", strings, value = TRUE)
grep("ab[^c]", strings, value = TRUE)
grep("^ab", strings, value = TRUE)
grep("\\^ab", strings, value = TRUE)
grep("abc|abd", strings, value = TRUE)
grep("a[b|c]", strings, value = TRUE)
str_extract(strings, "a[b|c]")

Examples of operators - solution

strings <- c("^ab", "ab", "abc", "abd", "abe", "ab 12", "a|b")
grep("ab.", strings, value = TRUE)
[1] "abc"   "abd"   "abe"   "ab 12"
grep("ab[c-e]", strings, value = TRUE)
[1] "abc" "abd" "abe"
grep("ab[^c]", strings, value = TRUE)
[1] "abd"   "abe"   "ab 12"
grep("^ab", strings, value = TRUE)
[1] "ab"    "abc"   "abd"   "abe"   "ab 12"
grep("\\^ab", strings, value = TRUE)
[1] "^ab"
grep("abc|abd", strings, value = TRUE)
[1] "abc" "abd"
grep("a[b|c]", strings, value = TRUE)
[1] "^ab"   "ab"    "abc"   "abd"   "abe"   "ab 12" "a|b"  
str_extract(strings, "a[b|c]")
[1] "ab" "ab" "ab" "ab" "ab" "ab" "a|"

Character classes

Character classes allow specifying entire classes of characters, such as numbers, letters, etc. There are two flavors of character classes, one uses [: and :] around a predefined name inside square brackets and the other uses \ and a special character. They are sometimes interchangeable.

  • (?i) before the string indicates that the match should be case insensitive.
  • [:digit:] or \d: digits, 0 1 2 3 4 5 6 7 8 9, equivalent to [0-9].
  • \D: non-digits, equivalent to [^0-9].
  • [:lower:]: lower-case letters, equivalent to [a-z].
  • [:upper:]: upper-case letters, equivalent to [A-Z].
  • [:alpha:]: alphabetic characters, equivalent to [[:lower:][:upper:]] or [A-z].
  • [:alnum:]: alphanumeric characters, equivalent to [[:alpha:][:digit:]] or [A-z0-9].
  • \w: word characters, equivalent to [[:alnum:]_] or [A-z0-9_] (letter, number, or underscore).
  • \W: not word, equivalent to [^A-z0-9_].
  • [:blank:]: blank characters, i.e. space and tab.
  • [:space:]: space characters: tab, newline, vertical tab, form feed, carriage return, space.
  • \s: space, .
  • \S: not space.
  • [:punct:]: punctuation characters, ! ” # $ % & ’ ( ) * + , - . / : ; < = > ? @ [  ] ^ _ ` { | } ~.
  • [:graph:]: graphical (human readable) characters: equivalent to [[:alnum:][:punct:]].
  • [:print:]: printable characters, equivalent to [[:alnum:][:punct:]\\s].

Some examples

More examples for practice!

Case insenstive

  • Match only the word meter in “The cemetery is 1 meter from the stop sign.” Also match Meter in “The cemetery is 1 Meter from the stop sign.”

Case insenstive

  • Match only the word meter in “The cemetery is 1 meter from the stop sign.” Also match Meter in “The cemetery is 1 Meter from the stop sign.”
string <- c("The cemetery is 1 meter from the stop sign.", 
            "The cemetery is 1 Meter from the stop sign.")

str_extract(string, "(?i)\\bmeter\\b")
[1] "meter" "Meter"

Proper times and dates

  • Match dates like 01/15/24 and also like 01.15.24 and like 01-15-24.
string <- c("01/15/24", "01.15.24", "01-15-24", "011524", 
            "January 15, 2024")

Proper times and dates

  • Match dates like 01/15/24 and also like 01.15.24 and like 01-15-24.
string <- c("01/15/24", "01.15.24", "01-15-24", "01 15 24", 
            "011524", "January 15, 2024")

str_extract(string, "\\d\\d.\\d\\d.\\d\\d")
[1] "01/15/24" "01.15.24" "01-15-24" "01 15 24" NA         NA        
str_extract(string, "\\d\\d[/.\\-]\\d\\d[/.\\-]\\d\\d")
[1] "01/15/24" "01.15.24" "01-15-24" NA         NA         NA        
str_extract(string, "\\d{2}[/.\\-]\\d{2}[/.\\-]\\d{2}")
[1] "01/15/24" "01.15.24" "01-15-24" NA         NA         NA        

Proper times and dates

  • Match a time of day such as “9:17 am” or “12:30 pm”. Require that the time be a valid time (not “99:99 pm”). Assume no leading zeros (i.e., “09:17 am”).
string <- c("9:17 am", "12:30 pm", "99:99 pm", "09:17 am")

Proper times and dates

  • Match a time of day such as “9:17 am” or “12:30 pm”. Require that the time be a valid time (not “99:99 pm”). Assume no leading zeros (i.e., “09:17 am”).

^(1[012]|[1-9]):[0-5][0-9] (am|pm)$

string <- c("9:17 am", "12:30 pm", "99:99 pm", "09:17 am")

str_extract(string, "(1[012]|[1-9]):[0-5][0-9] (am|pm)")
[1] "9:17 am"  "12:30 pm" NA         "9:17 am" 
str_extract(string, "^(1[012]|[1-9]):[0-5][0-9] (am|pm)$")
[1] "9:17 am"  "12:30 pm" NA         NA        

Alternation operator

The “or” operator, | has the lowest precedence and parentheses have the highest precedence, which means that parentheses get evaluated before “or”.

  • What is the difference between \bMary|Jane|Sue\b and \b(Mary|Jane|Sue)\b?
string <- c("Mary", "Mar", "Janet", "jane", "Susan", "Sue")

str_extract(string, "\\bMary|Jane|Sue\\b")
str_extract(string, "\\b(Mary|Jane|Sue)\\b")

Alternation operator

The “or” operator, | has the lowest precedence and parentheses have the highest precedence, which means that parentheses get evaluated before “or”.

  • What is the difference between \bMary|Jane|Sue\b and \b(Mary|Jane|Sue)\b?
string <- c("Mary", "Mar", "Janet", "jane", "Susan", "Sue")

str_extract(string, "\\bMary|Jane|Sue\\b")
[1] "Mary" NA     "Jane" NA     NA     "Sue" 
str_extract(string, "\\b(Mary|Jane|Sue)\\b")
[1] "Mary" NA     NA     NA     NA     "Sue"