Lab 1 - working with Databases

Author

your name goes here

Published

January 8, 2024

library(tidyverse)
library(DBI)
library(RMariaDB)
library(dbplyr)
library(mdsr)

Today’s lab will provide practice working with the different tools we can use to implement SQL code. We haven’t covered much SQL syntax, so the focus will be more on the tools than on writing code. You should, however, be trying to understand the SQL code as you go along.

The goals for lab 1 include:

Advice for turning in the assignment

  • render early and often. In fact, go ahead and render your .qmd file right now. Maybe set a timer so that you render every 5 minutes. Do not wait until you are done with the assignment to render

  • save the .Rproj file somewhere you can find it. Don’t keep everything in your downloads folder. Maybe make a folder called SDS261 or something. That folder could live on your Desktop. Or maybe in your Dropbox.

Assignment

Preliminary: You’ll need to create a quarto document. To do so click on: New file -> quarto document.

  • The document should be saved in the R Project as lab1-sds261-yourlastname-yourfirstname.qmd.
  • The yaml at the top of the file should look like this (change the author and date fields):
---
title: 'Lab 1 - working with Databases'
author: 'your name goes here'
date: 'due 10am Tuesday, January 9, 2024'
execute:
  echo: true
  warning: false
  message: false
---

For much of the assignment, Consider the task of figuring out how many flights came into or flew out of Bradley International Airport (BDL) last year.

  1. Set up a connection within an R chunk. See the README file.
con_air <- DBI::dbConnect(
  RMariaDB::MariaDB(),
  dbname = "airlines",
  host = Sys.getenv("MDSR_HOST"),
  user = Sys.getenv("MDSR_USER"),
  password = Sys.getenv("MDSR_PWD")
)
  1. Pull in the flights table as a tbl (do not use collect()!). Use dplyr commands to calculate the number of flights that flew into or out of BDL in 2012. Then use dplyr::show_query() to output the SQL query. (All done inside an R chunk.)
flights <- tbl(con_air, "flights")
names(flights)
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "tailnum"        "flight"        
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "cancelled"      "diverted"       "hour"           "minute"        
[21] "time_hour"     
bdl_query <- flights |>
  filter(year == 2012 & (dest == "BDL" | origin == "BDL")) |>
  summarize(count = n())

show_query(bdl_query)
<SQL>
SELECT COUNT(*) AS `count`
FROM `flights`
WHERE (`year` = 2012.0 AND (`dest` = 'BDL' OR `origin` = 'BDL'))

There were 41,949 flights to or from BDL in 2012.

  1. Again, inside an R chunk, use DBI::dbGetQuery() to run the SQL code calculated in the previous question.
DBI::dbGetQuery(con_air, "SELECT COUNT(*) AS `count`
FROM `flights`
WHERE (`year` = 2012 AND (`dest` = 'BDL' OR `origin` = 'BDL'))")
  count
1 41949

There were 41,949 flights to or from BDL in 2012.

  1. Now, inside a SQL chunk, use the SQL commands directly (same task: how many flights in and out of BDL in 2012).
```{sql}
#| connection: con_air
#| unilur-solution: true

SELECT COUNT(*) AS `count`
FROM `flights`
WHERE (`year` = 2012 AND (`dest` = 'BDL' OR `origin` = 'BDL'))
```
1 records
count
41949

There were 41,949 flights to or from BDL in 2012.

  1. Rerun all three scenarios, using the tictoc R package to see which method calculates the desired number most quickly. Below is an example of how to use tictoc. For the SQL chunk, you’ll need to place tic() in an R chunk before the SQL chunk and toc() in an R chunk after the SQL chunk. Write 1-2 sentences describing the results.
library(tictoc)

tic()
rand_numbers <- rnorm(1000000)
quantile(rand_numbers)
           0%           25%           50%           75%          100% 
-4.7441954424 -0.6750814407  0.0006767307  0.6734974364  5.6445201338 
toc()
0.097 sec elapsed
  • Using R on the tbl.
tic()

flights <- tbl(con_air, "flights")

flights |>
  filter(year == 2012 & (dest == "BDL" | origin == "BDL")) |>
  summarize(count = n())
# Source:   SQL [1 x 1]
# Database: mysql  [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:NA/airlines]
    count
  <int64>
1   41949
toc()
5.526 sec elapsed
  • Using dbGetQuery().
tic()

DBI::dbGetQuery(con_air, "SELECT COUNT(*) AS `count`
FROM `flights`
WHERE (`year` = 2012 AND (`dest` = 'BDL' OR `origin` = 'BDL'))")
  count
1 41949
toc()
5.258 sec elapsed
  • Using SQL chunk.
tic()
```{sql}
#| connection: con_air
#| unilur-solution: true

SELECT COUNT(*) AS `count`
FROM `flights`
WHERE (`year` = 2012 AND (`dest` = 'BDL' OR `origin` = 'BDL'))
```
1 records
count
41949
toc()
5.515 sec elapsed

Running the SQL code (either directly through a SQL chunk or by using DBI::dbGetQuery()) is faster than when dbplyr is used to translate R code into SQL queries.

  1. Now create a tibble (not a tbl) that includes only flights in 2012 either to or from BDL (use collect()). After you have created the tibble, summarize the observations to get the count. Use tictoc to evaluate how long the process takes. Write 1-2 sentences describing the results.
tic()

flights_BDL <- flights |>
  filter(year == 2012 & (dest == "BDL" | origin == "BDL")) |>
  collect()

flights_BDL |>
  summarize(count = n())
# A tibble: 1 × 1
  count
  <int>
1 41949
toc()
5.538 sec elapsed

Pulling the table into R as a dataframe slows down the calculations quite a bit.

  1. DBeaver: open DBeaver from your applications. Following the instructions in the DBeaver online notes, set up a connection to the mdsr SQL server. Create a file called lab1-sds261-yourlastname-yourfirstname.sql and SAVE THE FILE in the lab1 R Project folder. Run the same SQL code as above. Did you get the same result? Were there any adjustments to the SQL code from above? Explain in a few sentences.

Yes, there were still 41,949 flights to and from BDL in 2012. Whew! It would have been very odd if there had been a different number because the database itself hasn’t changed at all, we are just sending the query using a different client.

I needed to remove the back ticks from the table and variable names. R uses backticks but DBeaver does not. Good to remember.

I had set up my connection to the entire mdsr database, so I needed to add the line USE airlines; to the top of my SQL query so that it would be able to find the flights table. I could have also specified where flights lives using: ... FROM airlines.flights....

  1. Reflect on at least 3 things that you learned during this lab. It could be about technical aspects. It could be about understanding databases. It could be about the difference between R and SQL. Anything you learned while doing the lab.
  • there are different ways to run the same SQL commands
  • some approaches are more efficient than other approaches
  • the RStudio IDE is flexible in that it allows both R and SQL code
  • DBeaver is a more typical SQL client in that many lines of SQL code can be written simultaneously, and the SQL connection says in tact for the entire set of queries.
  • the airlines database has a ginormous about of information which exists in four tables: airports, carriers, flights, and planes
  1. Always a good idea to terminate the SQL connection when you are done with it.
dbDisconnect(con_air, shutdown = TRUE)
  1. render - commit - push to GitHub. You should push three files, those with extensions .qmd, .html, and .sql. In a browser, look at your GitHub website to make sure that all three files were successfully pushed to the correct repo (yours for lab1).