library(tidyverse)
library(stringr)
Lab 4 - regular expressions
Today’s lab will provide practice working with regular expressions in R.
The goals for lab 4 include:
- practicing writing regular expressions
- understanding quantifiers
- using positioning of patterns
- learning special operators and character classes
Advice for turning in the assignment
Be sure to indicate (in the .qmd file) which problem is being answered with which code. A sentence or two with each response goes a long way toward your understanding!
save the .Rproj file somewhere you can find it. Don’t keep everything in your downloads folder. Maybe make a folder called
SDS261
or something. That folder could live on your Desktop. Or maybe in your Dropbox.The .qmd document should be saved in the R Project as
lab4-sds261-yourlastname-yourfirstname.qmd
.
Example: Let’s say that I want to test whether the string contains a US zip code (of the format: xxxxx or xxxxx-xxxx). I might want to test it against a particular string.
<- c("01063-6302", "91711", "6302", "01063", "zip 01063") string_zip
I would use the str_extract()
function (in the stringr package) to test whether my regular expression is correct.
str_extract(string_zip, "^\\d{5}(-\\d{4})?$")
[1] "01063-6302" "91711" NA "01063" NA
Depending on how strict I was being, I might have kept the last one by leaving out the starting and ending positioning.
str_extract(string_zip, "\\d{5}(-\\d{4})?")
[1] "01063-6302" "91711" NA "01063" "01063"
Note that in R, \d
needs to be escaped to \\d
. That’s true with any metacharacter which uses a backslash.
Assignment
Go through the lessons in https://regexone.com/. Nothing to turn in.
Catch all of the instances of the words color or colour, case insensitive. Test on the given string.
<- c("color", "colour", "Color", "Colour", "Colr", "cols") string
- Match any number (including zero) of o’s, as in: ggle, gogle, google, gooogle, …
- Match at least one o, as in: gogle, google, gooogle, …
Test on the given string.
<- c("ggle", "gogle", "google", "gooogle", "goooogle", "gooooogle") string
- Validate dates which are in the format mm/dd/yy or mm/dd/yyyy. Allow for any digits for the values (e.g., month could be 47). As an extra challenge, try to make the numerical values realistic (e.g., months only between 01 and 12). Test on the given string.
<- c("01/11/2024", "1/11/2024", "1/1/24", "01/11/24", "24/01/4700" ) string_date
- Check a command line response so that true, t, yes, y, okay, ok, and 1 are all accepted in any combination of uppercase and lowercase. Test on the given string.
<- c("true", "t", "yes", "y", "okay", "ok", "1", "tRUe", "TRUE", "T",
str_affirm "YES!", "yeS", "okay...", "sure", "maybe")
- Match numbers that use the comma as the thousands separator and the dot as the decimal separator. Test on the given string.
<- c("12345", "12,345", "123.45", "1,234,567.890", "12,345.") string_number
- Determine whether a user entered a North American phone number in a common format, including the local area code. Common formats include 1234567890, 123-456-7890, 123.456.7890, 123 456 7890, (123) 456 7890, and all related combinations. Test on the given string.
<- c("1234567890", "1234", "456-7890", "123-456-7890", "123.456.7890", "123 456 7890", "(123) 456 7890", "+1 (123) 456 789") string_phone
- Find all words that occur inside an html emphasis tag (
<em>
and</em>
). Test on the given string. (After Friday’s class.)
<- c("<p><strong>Pellentesque habitant morbi tristique</strong> senectus et netus et malesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas semper. <em>Aenean ultricies mi vitae est.</em> Mauris placerat eleifend leo. Quisque sit amet est et sapien ullamcorper pharetra. Vestibulum erat wisi, condimentum sed, <code>commodo vitae</code>, ornare sit amet, wisi. Aenean fermentum, elit eget tincidunt condimentum, eros ipsum rutrum orci, sagittis tempus lacus enim ac dui. <a href='#'>Donec non enim</a> in turpis pulvinar facilisis. Ut felis.</p>") string_emph