2024-01-08
https://sds261-sql.netlify.app/
airlines
databaseConsider a database of US flights between 2010 and 2017. The flights are downloaded from the Bureau of Transportation Statistics, US Department of Transportation. The database is a superset of the nycflights13
R package that tracks only flights in and out of airports serving New York City in 2013.
To set up a SQL connection, you need the location of the server (host
) as well as a user
name and password
.
Hadley Wickham discusses how to use Sys.getenv: https://cran.r-project.org/web/packages/httr/vignettes/secrets.html
airlines
databasetbl
[1] NA 2
# Source: SQL [6 x 2]
# Database: mysql [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:NA/airlines]
carrier name
<chr> <chr>
1 02Q Titan Airways
2 04Q Tradewind Aviation
3 05Q Comlux Aviation, AG
4 06Q Master Top Linhas Aereas Ltd.
5 07Q Flair Airlines Ltd.
6 09Q Swift Air, LLC
tibble
The function collect()
copies a SQL table from it’s server location on disk to your local memory location in R.
carriers
take up?The data frame in R takes up 2 orders of magnitude of memory more than the table which just points to the object in SQL.
We will write SQL code in three distinct ways:
r
chunk.sql
chunk, we can write actual SQL code inside a quarto document.The function dbListTables()
in the DBI package will tell us what tables exist in the airlines database.
flights
data taken?Because flights
is not actually a data.frame
in R (but instead a tbl
in SQL), the work that was done above was actually performed in SQL. To see the SQL code, we can use the function show_query
.
LAX
and BOS
in 2012.dbplyr doesn’t translate every R command into SQL.
SQL is not a statistical software and doesn’t, for example, have a mechanism for creating data visualizations.
track which R commands are connected to SQL at the dbplyr reference sheet.
flights
data. year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2010 10 1 1 2100 181 159 2320
2 2010 10 1 1 1920 281 230 2214
3 2010 10 1 3 2355 8 339 334
4 2010 10 1 5 2200 125 41 2249
5 2010 10 1 7 2245 82 104 2347
6 2010 10 1 7 10 -3 451 500
7 2010 10 1 7 2150 137 139 2337
8 2010 10 1 8 15 -7 538 537
arr_delay carrier tailnum flight origin dest air_time distance cancelled
1 159 XE N11137 2558 EWR OMA 162 1133 0
2 256 B6 N659JB 562 FLL SWF 131 1119 0
3 5 B6 N563JB 701 JFK SJU 196 1597 0
4 112 XE N16559 5982 IAD BNA 82 542 0
5 77 OO N908SW 6433 LAX FAT 37 209 0
6 -9 AA N3FRAA 700 LAX DFW 150 1235 0
7 122 DL N347NW 1752 ATL IAD 70 533 0
8 1 CO N73283 1740 SMF IAH 193 1609 0
diverted hour minute time_hour
1 0 21 0 2010-10-01 21:00:00
2 0 19 20 2010-10-01 19:20:00
3 0 23 55 2010-10-01 23:55:00
4 0 22 0 2010-10-01 22:00:00
5 0 22 45 2010-10-01 22:45:00
6 0 0 10 2010-10-01 00:10:00
7 0 21 50 2010-10-01 21:50:00
8 0 0 15 2010-10-01 00:15:00
flights
table?sql
chunksSQL queries can be written directly inside a sql
chunk in RStudio.
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | cancelled | diverted | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 | 10 | 1 | 1 | 2100 | 181 | 159 | 2320 | 159 | XE | N11137 | 2558 | EWR | OMA | 162 | 1133 | 0 | 0 | 21 | 0 | 2010-10-01 21:00:00 |
2010 | 10 | 1 | 1 | 1920 | 281 | 230 | 2214 | 256 | B6 | N659JB | 562 | FLL | SWF | 131 | 1119 | 0 | 0 | 19 | 20 | 2010-10-01 19:20:00 |
2010 | 10 | 1 | 3 | 2355 | 8 | 339 | 334 | 5 | B6 | N563JB | 701 | JFK | SJU | 196 | 1597 | 0 | 0 | 23 | 55 | 2010-10-01 23:55:00 |
2010 | 10 | 1 | 5 | 2200 | 125 | 41 | 2249 | 112 | XE | N16559 | 5982 | IAD | BNA | 82 | 542 | 0 | 0 | 22 | 0 | 2010-10-01 22:00:00 |
2010 | 10 | 1 | 7 | 2245 | 82 | 104 | 2347 | 77 | OO | N908SW | 6433 | LAX | FAT | 37 | 209 | 0 | 0 | 22 | 45 | 2010-10-01 22:45:00 |
2010 | 10 | 1 | 7 | 10 | -3 | 451 | 500 | -9 | AA | N3FRAA | 700 | LAX | DFW | 150 | 1235 | 0 | 0 | 0 | 10 | 2010-10-01 00:10:00 |
2010 | 10 | 1 | 7 | 2150 | 137 | 139 | 2337 | 122 | DL | N347NW | 1752 | ATL | IAD | 70 | 533 | 0 | 0 | 21 | 50 | 2010-10-01 21:50:00 |
2010 | 10 | 1 | 8 | 15 | -7 | 538 | 537 | 1 | CO | N73283 | 1740 | SMF | IAH | 193 | 1609 | 0 | 0 | 0 | 15 | 2010-10-01 00:15:00 |
sql
chunksSQL queries can be written directly inside a sql
chunk in RStudio.
Always a good idea to terminate the SQL connection when you are done with it.
DBeaver is a free SQL client that supports MySQL
writing SQL code in R has some benefits (e.g., piping results tables into ggplot2 for visualizations)
using a SQL client that is designed for SQL queries has benefits as well.
to use DBeaver, download the client onto your computer and open it from your Applications.
On GitHub (on the web) edit the README document and Commit it with a message describing what you did.
Then, in RStudio also edit the README document with a different change.
As you work in teams you are likely to run into merge conflicts, learning how to resolve them properly will be very important.