# A tibble: 10 × 6
Task Description episode contestant score series
<chr> <chr> <chr> <chr> <chr> <dbl>
1 1 Prize: Best thing you can carry, but on… Episod… Charlotte… 1 11
2 1 Prize: Best thing you can carry, but on… Episod… Jamali Ma… 2 11
3 1 Prize: Best thing you can carry, but on… Episod… Lee Mack 4 11
4 1 Prize: Best thing you can carry, but on… Episod… Mike Wozn… 5 11
5 1 Prize: Best thing you can carry, but on… Episod… Sarah Ken… 3 11
6 2 Do the most impressive thing under the … Episod… Charlotte… 2 11
# ℹ 4 more rows
more succinct results
Task Description episode contestant score series1 Prize: Best thing… Episode 1… Charlotte… 1111 Prize: Best thing… Episode 1… Jamali Ma… 2111 Prize: Best thing… Episode 1… Lee Mack 4111 Prize: Best thing… Episode 1… Mike Wozn… 5111 Prize: Best thing… Episode 1… Sarah Ken… 3112 Do the most… Episode 1… Charlotte… 2112 Do the most… Episode 1… Jamali Ma… 3[1] 112 Do the most… Episode 1… Lee Mack 3112 Do the most… Episode 1… Mike Wozn… 5112 Do the most… Episode 1… Sarah Ken… 411
Currently, the episode column contains entries like
Currently, the episode column contains entries like:
"Episode 2: The pie whisperer. (4 August 2015)"
How would I extract just the episode number?
Extracting episode information
Currently, the episode column contains entries like:
"Episode 2: The pie whisperer. (4 August 2015)"
How would I extract just the episode number?
str_extract("Episode 2: The pie whisperer. (4 August 2015)", "\\d+")
[1] "2"
Extracting episode information
Currently, the episode column contains entries like:
"Episode 2: The pie whisperer. (4 August 2015)"
How would I extract the episode name?
Goal: find a pattern to match: anything that starts with a :, ends with a .
Let’s break down that task into pieces.
Extracting episode information
How can we find the period at the end of the sentence? What does each of these lines of code return?
str_extract("Episode 2: The pie whisperer. (4 August 2015)", ".")
str_extract("Episode 2: The pie whisperer. (4 August 2015)", ".+")
str_extract("Episode 2: The pie whisperer. (4 August 2015)", "\\.")
Extracting episode information - solution
str_extract("Episode 2: The pie whisperer. (4 August 2015)", ".")
[1] "E"
str_extract("Episode 2: The pie whisperer. (4 August 2015)", ".+")
[1] "Episode 2: The pie whisperer. (4 August 2015)"
We use an escape character when we actually want to choose a period:
str_extract("Episode 2: The pie whisperer. (4 August 2015)", "\\.")
[1] "."
Extracting episode information
Goal: find a pattern to match: anything that starts with a :, ends with a .
str_extract("Episode 2: The pie whisperer. (4 August 2015)",":.+\\.")
[1] ": The pie whisperer."
Lookaround (again)
Figure 4: Image credit: Stefan Judis https://www.stefanjudis.com/blog/a-regular-expression-lookahead-lookbehind-cheat-sheet/
Lookbehinds
(?<=) is a positive lookbehind. It is used to identify expressions which are preceded by a particular expression.
str_extract("Episode 2: The pie whisperer. (4 August 2015)", "(?<=: ).+")
[1] "The pie whisperer. (4 August 2015)"
str_extract("Episode 2: The pie whisperer. (4 August 2015)", "(?<=\\. ).+")
[1] "(4 August 2015)"
Lookaheads
(?=) is a positive lookahead. It is used to identify expressions which are followed by a particular expression.
str_extract("Episode 2: The pie whisperer. (4 August 2015)", ".+(?=\\.)")
[1] "Episode 2: The pie whisperer"
str_extract("Episode 2: The pie whisperer. (4 August 2015)", ".+(?=:)")
[1] "Episode 2"
Extracting episode information
Getting everything between the : and the .
str_extract("Episode 2: The pie whisperer. (4 August 2015)", "(?<=: ).+(?=\\.)")
[1] "The pie whisperer"
Extracting air date
I want to extract just the air date. What pattern do I want to match?
str_extract("Episode 2: The pie whisperer. (4 August 2015)", ...)
Extracting air date
str_extract("Episode 2: The pie whisperer. (4 August 2015)", "(?<=\\().+(?=\\))")
[1] "4 August 2015"
Wrangling the episode info
Currently:
# A tibble: 270 × 1
episode
<chr>
1 Episode 1: It's not your fault. (18 March 2021)
2 Episode 1: It's not your fault. (18 March 2021)
3 Episode 1: It's not your fault. (18 March 2021)
4 Episode 1: It's not your fault. (18 March 2021)
5 Episode 1: It's not your fault. (18 March 2021)
6 Episode 1: It's not your fault. (18 March 2021)
# ℹ 264 more rows
# A tibble: 270 × 3
episode episode_name air_date
<chr> <chr> <chr>
1 1 It's not your fault 18 March 2021
2 1 It's not your fault 18 March 2021
3 1 It's not your fault 18 March 2021
4 1 It's not your fault 18 March 2021
5 1 It's not your fault 18 March 2021
6 1 It's not your fault 18 March 2021
# ℹ 264 more rows
# A tibble: 270 × 3
episode episode_name air_date
<chr> <chr> <chr>
1 1 It's not your fault 18 March 2021
2 1 It's not your fault 18 March 2021
3 1 It's not your fault 18 March 2021
4 1 It's not your fault 18 March 2021
5 1 It's not your fault 18 March 2021
6 1 It's not your fault 18 March 2021
# ℹ 264 more rows
Regular expressions and SQL
Back to the IMDb database…
SELECT production_year, titleFROM titleWHERE kind_id =1AND title REGEXP '(?i)star'LIMIT0, 20;
Displaying records 1 - 10
production_year
title
2005
"Dancing with the Stars" (I)
2005
"Dancing with the Stars" (II)
2005
"Dancing with the Stars" (III)
2017
"Girl Starter" (II)
2001
"Popstars" (I)
2001
"Popstars" (II)
2002
"Popstars" (I)
2000
"Popstars" (I)
1959
"Startime" (II)
1959
"Startime" (I)
Course project
Don’t forget, next week, each person will be working on their own mini project!
Using SQL queries and joins to wrangle complicated data tables.
Writing regular expressions to parse observations.
Creating a SQL database.
email jo.hardin@pomona.edu by Tuesday, Jan 16 with an idea of what you plan to do.
Question of interest that you hope to address.
Holistic description of the dataset(s) (a few sentences).
Description of the observational units and columns in each data table.