SQL Extras

Jo Hardin

2024-01-18

Back to the flights

The examples below use the airlines database, including the flights, carriers, airports, and planes tables.

Efficiencies

Reconsider this analogy:

Each library (database) has books (tables). Each book (table) has pages (rows). Each page (row) has a unique page number to identify it (key value); to find a particular page, you sort through the page numbers (key values). But it isn’t immediately obvious where the particular page of interest is, you might have to page through the book a little bit to find the page of interest. It would be easier if you had several bookmarks throughout the book to anchor some of the page numbers. For example, if you want page 1047 and you have a bookmark on page 1050, you only have to turn back three pages. The bookmark is an index, it helps you find the desired rows much more quickly.1

KEYs

Keys are unique identifiers for each row, used primarily for connecting tables. Keys are generally not helpful for efficiency, but they are important for data integrity and relationships between tables. A key is a pointer that identifies a record. In practice, a key is one or more columns that are earmarked to uniquely identify a record in a table. Keys serve two main purposes:

  1. They provide constraints on the column such as that it can’t store duplicate or null values.
  2. They are also used to generate relationships among different tables.

INDEXes

  • By indexing the rows, SQL is able to optimize sorting and joining tables.

  • The index is created in advance (when the table is created) and saved to disk, which can take up substantial space on the disk.

  • Sometimes more than one variable is used to index the table.

  • There are trade-offs to having a lot of indexes (disk space but fast wrangling) versus a few indexes (slow wrangling but less space).

  • A table may have more than one index but you shouldn’t add indexes to every column in a table, as these have to be updated for every addition/update/delete to the column.

  • Indexes should be added to columns that are frequently included in queries.

SHOW INDEXES

Notice that the planes table has a single PRIMARY key. That primary key is used to index the table. The flights table has no PRIMARY key, but it does have six different indexes: Year, Date, Origin, Dest, Carrier, and tailNum.

SHOW INDEXES FROM planes;
1 records
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment Index_comment
planes 0 PRIMARY 1 tailnum A 3322 BTREE
SHOW INDEXES FROM flights;
8 records
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment Index_comment
flights 1 Year 1 year A 7 YES BTREE
flights 1 Date 1 year A 7 YES BTREE
flights 1 Date 2 month A 89 YES BTREE
flights 1 Date 3 day A 2712 YES BTREE
flights 1 Origin 1 origin A 2267 BTREE
flights 1 Dest 1 dest A 2267 BTREE
flights 1 Carrier 1 carrier A 134 BTREE
flights 1 tailNum 1 tailnum A 37862 YES BTREE

SHOW INDEXES

The values output by SHOW INDEXES are:1

  • Table: The name of the table.

  • Non_unique: 0 if the index cannot contain duplicates, 1 if it can.

  • Key_name: The name of the index. If the index is the primary key, the name is always PRIMARY.

  • Seq_in_index: The column sequence number in the index, starting with 1.

  • Column_name: The column name. See also the description for the Expression column.

  • Collation: How the column is sorted in the index. This can have values A (ascending), D (descending), or NULL (not sorted).

  • Cardinality: An estimate of the number of unique values in the index.

  • Sub_part: The index prefix. That is, the number of indexed characters if the column is only partly indexed, NULL if the entire column is indexed.

  • Packed: Indicates how the key is packed. NULL if it is not.

  • Null: Contains YES if the column may contain NULL values and ’’ if not.

  • Index_type: The index method used (BTREE, FULLTEXT, HASH, RTREE).

  • Comment: Information about the index not described in its own column, such as disabled if the index is disabled.

  • Index_comment: Any comment provided for the index with a COMMENT attribute when the index was created.

Partitioning

Another way to speed up query retrievals is to partition the data tables. If, for example, the SNL queries were always done by year, then the episodes table could be partitioned such that they are stored as separate tables (one per year). The partitioning functions as an index on year. The user would not be able to tell the difference between the unpartitioned episodes table and the partitioned one. However, queries done by year would be faster. Queries done grouped in another way would be slower.

Querying quickly

Indexes are built to accommodate the specific queries that are most likely to be run. However, you might not know which queries are going to be run, so it isn’t always obviously how to index a table.

SHOW INDEXES FROM flights;
8 records
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment Index_comment
flights 1 Year 1 year A 7 YES BTREE
flights 1 Date 1 year A 7 YES BTREE
flights 1 Date 2 month A 89 YES BTREE
flights 1 Date 3 day A 2712 YES BTREE
flights 1 Origin 1 origin A 2267 BTREE
flights 1 Dest 1 dest A 2267 BTREE
flights 1 Carrier 1 carrier A 134 BTREE
flights 1 tailNum 1 tailnum A 37862 YES BTREE

EXPLAIN with distance

EXPLAIN communicates how onerous the query is, without actually running it—saving you the time of having to wait for it to execute.

EXPLAIN SELECT * FROM flights WHERE distance > 3000;
1 records
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE flights p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,p31,p32 ALL 47932811 33.3 Using where

If we were to run a query for long flights using the distance column the server will have to inspect each of the 48 million rows, because distance is not indexed. A query on a non-indexed variable is the slowest possible search and is often called a table scan. The 48 million number that you see in the rows column is an estimate of the number of rows that MySQL will have to consult in order to process your query. In general, more rows mean a slower query.

EXPLAIN with year

A search for recent flights using the year column, which has an index built on it, considers many fewer rows (about 6.3 million, those flights in 2013).

EXPLAIN SELECT * FROM flights WHERE year = 2013;
1 records
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE flights p27 ALL Year,Date 6369482 100 Using where

Cardinality x rows

The Cardinality from SHOW INDEXES times the rows from EXPLAIN is roughly the total number of rows in the dataframe.

EXPLAIN with year and month

In a search by year and month, SQL uses the Date index. Only 700,000 rows are searched, those in June of 2013.

EXPLAIN SELECT * FROM flights WHERE year = 2013 AND month = 6;
1 records
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE flights p27 ref Year,Date Date 6 const,const 714535 100

EXPLAIN with month

If we search for particular months across all years, the indexing does not help at all. The query results in a table scan.

EXPLAIN SELECT * FROM flights WHERE month = 6;
1 records
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE flights p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,p31,p32 ALL 47932811 10 Using where

Although month is part of the Date index, it is the second column in the index, and thus it doesn’t help us when we aren’t filtering on year. Thus, if it were common for our users to search on month without year, it would probably be worth building an index on month. Were we to actually run these queries, there would be a significant difference in computational time.

EXPLAIN with JOIN - faster

The cardinality of the index on tailnum is large (the number of rows in flights associated with each unique value of tailnum is small).

EXPLAIN 
  SELECT * FROM planes p 
  LEFT JOIN flights o ON p.tailnum = o.TailNum
  WHERE manufacturer = 'BOEING';
2 records
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE p ALL 3322 10 Using where
1 SIMPLE o p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,p31,p32 ref tailNum tailNum 9 airlines.p.tailnum 1266 100

EXPLAIN with JOIN - slower

  • The cardinality of the index on year is small (the number of rows in flights associated with each unique year is large).
EXPLAIN 
  SELECT * FROM planes p 
  LEFT JOIN flights o ON p.Year = o.Year
  WHERE manufacturer = 'BOEING';
2 records
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE p ALL 3322 10 Using where
1 SIMPLE o p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,p31,p32 ref Year,Date Year 3 airlines.p.year 6450117 100 Using where

SQL in dbplyr

As mentioned previously, dbplyr doesn’t translate every R command into SQL. After all, SQL is not a statistical software and doesn’t, for example, have a mechanism for creating data visualizations. To track which R commands are connected to SQL see the dbplyr reference sheet.

Median

Let’s start with an example, calculating the median altitude in the airports table.1

airports <- tbl(con_air, "airports")

head(airports)
# Source:   SQL [6 x 9]
# Database: mysql  [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:NA/airlines]
  faa   name                           lat   lon   alt    tz dst   city  country
  <chr> <chr>                        <dbl> <dbl> <int> <int> <chr> <chr> <chr>  
1 04G   Lansdowne Airport             41.1 -80.6  1044    -5 A     Youn… United…
2 06A   Moton Field Municipal Airpo…  32.5 -85.7   264    -6 A     Tusk… United…
3 06C   Schaumburg Regional           42.0 -88.1   801    -6 A     Scha… United…
4 06N   Randall Airport               41.4 -74.4   523    -5 A     Midd… United…
5 09J   Jekyll Island Airport         31.1 -81.4    11    -5 A     Jeky… United…
6 0A9   Elizabethton Municipal Airp…  36.4 -82.2  1593    -5 A     Eliz… United…

Median

It looks like show_query() is providing SQL code for calculating the median!

median_query <- airports |>
  summarize(med_alt = median(alt, na.rm = TRUE))

show_query(median_query)
<SQL>
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY `alt`) AS `med_alt`
FROM `airports`

Median

But when the SQL code is run, it doesn’t seem to work.

SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY `alt`) AS `med_alt`
FROM `airports`;
Error: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'GROUP (ORDER BY `alt`) AS `med_alt`
FROM `airports`' at line 1 [1064]

Median

What happens when we computer the median on the tbl?

airports |>
  summarize(med_alt = median(alt, na.rm = TRUE))
Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'GROUP (ORDER BY `alt`) AS `med_alt`
FROM `airports`
LIMIT 7' at line 1 [1064]

Median

How can the median be calculated in SQL?

SET @row_index := -1;
SELECT AVG(subquery.alt) AS median_value
FROM (
SELECT @row_index:=@row_index + 1 AS row_index, alt
  FROM airports
  ORDER BY alt
) AS subquery
WHERE subquery.row_index IN (FLOOR(@row_index / 2), CEIL(@row_index / 2));
1 records
median_value
476

Median

Let’s break down what the code is doing… First, set the row_index to -1 and iterate through by adding +1 for each row. Then concatenate the row_index information onto our table of interest. (Basically, create a column that gives a row value to the dataset, sorted by altitude.)

SET @row_index := -1;
SELECT @row_index:=@row_index + 1 AS row_index, alt
  FROM airports
  ORDER BY alt
  LIMIT 10;
Displaying records 1 - 10
row_index alt
0 -54
1 -42
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0

Median

Next, filter the data to include only the middle row or two rows.

SET @row_index := -1;
SELECT *
FROM (
SELECT @row_index:=@row_index + 1 AS row_index, alt
  FROM airports
  ORDER BY alt
) AS subquery
WHERE subquery.row_index IN (FLOOR(@row_index / 2), CEIL(@row_index / 2));
2 records
row_index alt
728 474
729 477

Median

The last step is to average the middle row(s). If only one row is pulled out in the previous query, then only one row will be averaged (which the computer does happily).

SET @row_index := -1;
SELECT AVG(subquery.alt) AS median_value
FROM (
SELECT @row_index:=@row_index + 1 AS row_index, alt
  FROM airports
  ORDER BY alt
) AS subquery
WHERE subquery.row_index IN (FLOOR(@row_index / 2), CEIL(@row_index / 2));
1 records
median_value
476

Median

Take-aways:

  1. dbplyr is not able to translate the median() function into SQL.

  2. The median is actually really hard to calculate! In particular, it is hard to calculate the median in one pass through the data.

CASE WHEN and ifelse()

Consider the various R functions that create new variables based on an original variable.

airports |>
  mutate(sea = ifelse(alt > 500, "above sea", "near sea")) |>
  head(5)
# Source:   SQL [5 x 10]
# Database: mysql  [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:NA/airlines]
  faa   name                     lat   lon   alt    tz dst   city  country sea  
  <chr> <chr>                  <dbl> <dbl> <int> <int> <chr> <chr> <chr>   <chr>
1 04G   Lansdowne Airport       41.1 -80.6  1044    -5 A     Youn… United… abov…
2 06A   Moton Field Municipal…  32.5 -85.7   264    -6 A     Tusk… United… near…
3 06C   Schaumburg Regional     42.0 -88.1   801    -6 A     Scha… United… abov…
4 06N   Randall Airport         41.4 -74.4   523    -5 A     Midd… United… abov…
5 09J   Jekyll Island Airport   31.1 -81.4    11    -5 A     Jeky… United… near…

CASE WHEN and ifelse()

if_query <- airports |>
  mutate(sea = ifelse(alt > 500, "above sea", "near sea"))

show_query(if_query)
<SQL>
SELECT
  *,
  CASE WHEN (`alt` > 500.0) THEN 'above sea' WHEN NOT (`alt` > 500.0) THEN 'near sea' END AS `sea`
FROM `airports`

CASE WHEN and ifelse()

SELECT *,
CASE WHEN (`alt` > 500.0) THEN 'above sea' WHEN NOT (`alt` > 500.0) THEN 'near sea' END AS `sea`
FROM `airports` 
LIMIT 5;
5 records
faa name lat lon alt tz dst city country sea
04G Lansdowne Airport 41.1 -80.6 1044 -5 A Youngstown United States above sea
06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A Tuskegee United States near sea
06C Schaumburg Regional 42.0 -88.1 801 -6 A Schaumburg United States above sea
06N Randall Airport 41.4 -74.4 523 -5 A Middletown United States above sea
09J Jekyll Island Airport 31.1 -81.4 11 -5 A Jekyll Island United States near sea

CASE WHEN and case_when()

airports |>
  mutate(sea = case_when(
    alt < 500 ~ "near sea",
    alt < 2000 ~ "low alt",
    alt < 3000 ~ "mod alt",
    alt < 5500 ~ "high alt",
    alt > 5500 ~ "extreme alt")) |>
  head(5)
# Source:   SQL [5 x 10]
# Database: mysql  [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:NA/airlines]
  faa   name                     lat   lon   alt    tz dst   city  country sea  
  <chr> <chr>                  <dbl> <dbl> <int> <int> <chr> <chr> <chr>   <chr>
1 04G   Lansdowne Airport       41.1 -80.6  1044    -5 A     Youn… United… low …
2 06A   Moton Field Municipal…  32.5 -85.7   264    -6 A     Tusk… United… near…
3 06C   Schaumburg Regional     42.0 -88.1   801    -6 A     Scha… United… low …
4 06N   Randall Airport         41.4 -74.4   523    -5 A     Midd… United… low …
5 09J   Jekyll Island Airport   31.1 -81.4    11    -5 A     Jeky… United… near…

CASE WHEN and case_when()

cw_query <- airports |>
  mutate(sea = case_when(
    alt < 500 ~ "near sea",
    alt < 2000 ~ "low alt",
    alt < 3000 ~ "mod alt",
    alt < 5500 ~ "high alt",
    alt > 5500 ~ "extreme alt"))

show_query(cw_query)
<SQL>
SELECT
  *,
  CASE
WHEN (`alt` < 500.0) THEN 'near sea'
WHEN (`alt` < 2000.0) THEN 'low alt'
WHEN (`alt` < 3000.0) THEN 'mod alt'
WHEN (`alt` < 5500.0) THEN 'high alt'
WHEN (`alt` > 5500.0) THEN 'extreme alt'
END AS `sea`
FROM `airports`

CASE WHEN and case_when()

SELECT
  *,
  CASE
WHEN (`alt` < 500.0) THEN 'near sea'
WHEN (`alt` < 2000.0) THEN 'low alt'
WHEN (`alt` < 3000.0) THEN 'mod alt'
WHEN (`alt` < 5500.0) THEN 'high alt'
WHEN (`alt` > 5500.0) THEN 'extreme alt'
END AS `sea`
FROM `airports`
LIMIT 5;
5 records
faa name lat lon alt tz dst city country sea
04G Lansdowne Airport 41.1 -80.6 1044 -5 A Youngstown United States low alt
06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A Tuskegee United States near sea
06C Schaumburg Regional 42.0 -88.1 801 -6 A Schaumburg United States low alt
06N Randall Airport 41.4 -74.4 523 -5 A Middletown United States low alt
09J Jekyll Island Airport 31.1 -81.4 11 -5 A Jekyll Island United States near sea

CASE WHEN and cut()

airports |>
  mutate(sea = cut(
    alt,
    breaks = c(-Inf, 500, 2000, 3000, 5500, Inf),
    labels = c("near sea", "low alt", "mod alt", "high alt", "extreme alt")
  )
)|>
  head(5)
# Source:   SQL [5 x 10]
# Database: mysql  [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:NA/airlines]
  faa   name                     lat   lon   alt    tz dst   city  country sea  
  <chr> <chr>                  <dbl> <dbl> <int> <int> <chr> <chr> <chr>   <chr>
1 04G   Lansdowne Airport       41.1 -80.6  1044    -5 A     Youn… United… low …
2 06A   Moton Field Municipal…  32.5 -85.7   264    -6 A     Tusk… United… near…
3 06C   Schaumburg Regional     42.0 -88.1   801    -6 A     Scha… United… low …
4 06N   Randall Airport         41.4 -74.4   523    -5 A     Midd… United… low …
5 09J   Jekyll Island Airport   31.1 -81.4    11    -5 A     Jeky… United… near…

CASE WHEN and cut()

cw_query <- airports |>
  mutate(sea = cut(
    alt,
    breaks = c(-Inf, 500, 2000, 3000, 5500, Inf),
    labels = c("near sea", "low alt", "mod alt", "high alt", "extreme alt")
  )
)

show_query(cw_query)
<SQL>
SELECT
  *,
  CASE
WHEN (`alt` <= 500.0) THEN 'near sea'
WHEN (`alt` <= 2000.0) THEN 'low alt'
WHEN (`alt` <= 3000.0) THEN 'mod alt'
WHEN (`alt` <= 5500.0) THEN 'high alt'
WHEN (`alt` > 5500.0) THEN 'extreme alt'
END AS `sea`
FROM `airports`

CASE WHEN and cut()

SELECT
  *,
  CASE
WHEN (`alt` <= 500.0) THEN 'near sea'
WHEN (`alt` <= 2000.0) THEN 'low alt'
WHEN (`alt` <= 3000.0) THEN 'mod alt'
WHEN (`alt` <= 5500.0) THEN 'high alt'
WHEN (`alt` > 5500.0) THEN 'extreme alt'
END AS `sea`
FROM `airports`
LIMIT 5;
5 records
faa name lat lon alt tz dst city country sea
04G Lansdowne Airport 41.1 -80.6 1044 -5 A Youngstown United States low alt
06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A Tuskegee United States near sea
06C Schaumburg Regional 42.0 -88.1 801 -6 A Schaumburg United States low alt
06N Randall Airport 41.4 -74.4 523 -5 A Middletown United States low alt
09J Jekyll Island Airport 31.1 -81.4 11 -5 A Jekyll Island United States near sea

distinct()

How many distinct time zones are there in the airports table?

airports |>
  select(tz) |>
  distinct()
# Source:   SQL [?? x 1]
# Database: mysql  [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:NA/airlines]
     tz
  <int>
1    -5
2    -6
3    -8
4    -7
5    -9
6   -10
# ℹ more rows

distinct()

How many distinct time zones are there in the airports table?

dist_query <- airports |>
  select(tz) |>
  distinct()

show_query(dist_query)
<SQL>
SELECT DISTINCT `tz`
FROM `airports`

distinct()

How many distinct time zones are there in the airports table?

SELECT DISTINCT `tz`
FROM `airports`;
7 records
tz
-5
-6
-8
-7
-9
-10
8

LIMIT as head()

airports |>
  head(5)
# Source:   SQL [5 x 9]
# Database: mysql  [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:NA/airlines]
  faa   name                           lat   lon   alt    tz dst   city  country
  <chr> <chr>                        <dbl> <dbl> <int> <int> <chr> <chr> <chr>  
1 04G   Lansdowne Airport             41.1 -80.6  1044    -5 A     Youn… United…
2 06A   Moton Field Municipal Airpo…  32.5 -85.7   264    -6 A     Tusk… United…
3 06C   Schaumburg Regional           42.0 -88.1   801    -6 A     Scha… United…
4 06N   Randall Airport               41.4 -74.4   523    -5 A     Midd… United…
5 09J   Jekyll Island Airport         31.1 -81.4    11    -5 A     Jeky… United…

LIMIT as head()

head_query <- airports |>
  head(5)

show_query(head_query)
<SQL>
SELECT *
FROM `airports`
LIMIT 5

LIMIT as head()

SELECT *
FROM `airports`
LIMIT 5;
5 records
faa name lat lon alt tz dst city country
04G Lansdowne Airport 41.1 -80.6 1044 -5 A Youngstown United States
06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A Tuskegee United States
06C Schaumburg Regional 42.0 -88.1 801 -6 A Schaumburg United States
06N Randall Airport 41.4 -74.4 523 -5 A Middletown United States
09J Jekyll Island Airport 31.1 -81.4 11 -5 A Jekyll Island United States

ggplot()

airports |>
  ggplot(aes(x = lon, y = lat)) +
  geom_point()

ggplot()

airports |>
  filter(lon < 0) |>
  ggplot(aes(x = lon, y = lat)) +
  geom_point()

ggplot()

gg_query <- airports |>
  filter(lon < 0) |>
  ggplot(aes(x = lon, y = lat)) +
  geom_point()

show_query(gg_query)
Error in UseMethod("show_query"): no applicable method for 'show_query' applied to an object of class "c('gg', 'ggplot')"

Take-away message

  • dbplyr is awesome and can often be helpful in figuring out SQL syntax

  • sometimes dbplyr will provide SQL syntax that does not work (see the median example)

  • sometimes there is no SQL syntax to match the R task of interest (see ggplot)

Best practice

It is always a good idea to terminate the SQL connection when you are done with it.

dbDisconnect(con_air, shutdown = TRUE)

47

Pomona’s number is 47, and I wanted to have a 47th slide.

icon that says STATS, DATA, MATH but has a 47 replacing every instance of AT in the words.'.

Figure 1: image credit: Jo Hardin