SQL clauses

Jo Hardin

2024-01-09

Today’s example

NYC Taxis

Consider a database of taxi rides from the Yellow Cab company in NYC in March of 2014.

Image of yellow taxis in downtown New York City.

Figure 1: image credit: Mariordo (Mario Roberto Durán Ortiz), CC BY-SA 3.0

Establishing a SQL connection

con_taxi <- DBI::dbConnect(
  RMariaDB::MariaDB(),
  dbname = "nyctaxi",
  host = Sys.getenv("MDSR_HOST"),
  user = Sys.getenv("MDSR_USER"),
  password = Sys.getenv("MDSR_PWD")
)

SHOW TABLES

There is only one table in the nyctaxi database. It is called yellow_old.

```{sql}
#| connection: con_taxi

SHOW TABLES;
```
1 records
Tables_in_nyctaxi
yellow_old

DESCRIBE tables

Still using a {sql} chunk. The DESCRIBE command shows the 18 field names (variables) in the yellow_old table. Some of the variables are characters (text) and some are numeric (either double or bigint)

DESCRIBE yellow_old;
Displaying records 1 - 10
Field Type Null Key Default Extra
vendor_id text YES
pickup_datetime text YES
dropoff_datetime text YES
passenger_count bigint(20) YES
trip_distance double YES
pickup_longitude double YES
pickup_latitude double YES
rate_code bigint(20) YES
store_and_fwd_flag text YES
dropoff_longitude double YES

SQL clauses

Order of clauses

Queries in SQL start with the SELECT keyword and consist of several clauses, which must be written in the following order:

  • SELECT
  • FROM
  • JOIN
  • WHERE
  • GROUP BY
  • HAVING
  • ORDER BY
  • LIMIT

The clauses are similar to data wrangling verbs in R.

SELECT … FROM

  • start with a SELECT, include a corresponding FROM
  • columns may be specified or * returns every column

The shortest SQL query is the following SELECT command. Do not run this command!!! The yellow_old table has 15 million rows, and we do not want to look at them simultaneously.

DO NOT RUN:  SELECT * FROM yellow_old;

SELECT … FROM

Much better for big tables:

SELECT * FROM yellow_old LIMIT 0, 14;
Displaying records 1 - 10
vendor_id pickup_datetime dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount surcharge mta_tax tip_amount tolls_amount total_amount
CMT 2014-03-01 01:07:38 2014-03-01 01:16:26 1 2.0 -74.0 40.7 1 N -73.9 40.7 CRD 9.0 0.5 0.5 2.0 0 12.0
CMT 2014-03-01 01:08:03 2014-03-01 01:12:51 2 1.2 -74.0 40.7 1 N -74.0 40.7 CRD 6.0 0.5 0.5 1.0 0 8.0
CMT 2014-03-01 01:08:51 2014-03-01 01:13:18 3 0.5 -73.9 40.7 1 N -74.0 40.7 CRD 5.0 0.5 0.5 1.2 0 7.2
CMT 2014-03-01 01:09:20 2014-03-01 01:24:18 3 3.5 -74.0 40.7 1 N -74.0 40.8 CRD 14.0 0.5 0.5 3.0 0 18.0
CMT 2014-03-01 01:09:46 2014-03-01 01:22:34 1 1.8 -74.0 40.7 1 N -74.0 40.7 CRD 10.5 0.5 0.5 1.0 0 12.5
CMT 2014-03-01 01:12:41 2014-03-01 01:15:38 1 0.5 -74.0 40.7 1 N -74.0 40.7 CRD 4.0 0.5 0.5 0.5 0 5.5
CMT 2014-03-01 01:12:11 2014-03-01 01:27:38 2 3.7 -74.0 40.8 1 N -74.0 40.7 CRD 14.5 0.5 0.5 3.1 0 18.6
CMT 2014-03-01 01:13:55 2014-03-01 01:34:54 1 5.4 -74.0 40.8 1 N -74.0 40.7 CRD 20.0 0.5 0.5 3.0 0 24.0
CMT 2014-03-01 01:14:06 2014-03-01 01:28:25 1 3.5 -74.0 40.7 1 N -74.0 40.8 CRD 13.5 0.5 0.5 2.9 0 17.4

SELECT … FROM

How do we know how many taxi rides are recorded? Two different ways of counting the rows:

SELECT COUNT(*), SUM(1) FROM yellow_old LIMIT 0, 14;
1 records
COUNT(*) SUM(1)
15428128 15428128

Yikes, more than 15 million taxi rides!

SELECT … FROM

What is the length of a taxi ride (in time)? … unfortunately, pickup_datetime and dropoff_datetime are saved as character strings instead of in DateTime format.

SELECT
      pickup_datetime, dropoff_datetime,
      STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T") AS pickup,
      STR_TO_DATE(dropoff_datetime, "%Y-%m-%d %T") AS dropoff
   FROM yellow_old
   LIMIT 0, 10;
Displaying records 1 - 10
pickup_datetime dropoff_datetime pickup dropoff
2014-03-01 01:07:38 2014-03-01 01:16:26 2014-03-01 01:07:38 2014-03-01 01:16:26
2014-03-01 01:08:03 2014-03-01 01:12:51 2014-03-01 01:08:03 2014-03-01 01:12:51
2014-03-01 01:08:51 2014-03-01 01:13:18 2014-03-01 01:08:51 2014-03-01 01:13:18
2014-03-01 01:09:20 2014-03-01 01:24:18 2014-03-01 01:09:20 2014-03-01 01:24:18
2014-03-01 01:09:46 2014-03-01 01:22:34 2014-03-01 01:09:46 2014-03-01 01:22:34
2014-03-01 01:12:41 2014-03-01 01:15:38 2014-03-01 01:12:41 2014-03-01 01:15:38
2014-03-01 01:12:11 2014-03-01 01:27:38 2014-03-01 01:12:11 2014-03-01 01:27:38
2014-03-01 01:13:55 2014-03-01 01:34:54 2014-03-01 01:13:55 2014-03-01 01:34:54
2014-03-01 01:14:06 2014-03-01 01:28:25 2014-03-01 01:14:06 2014-03-01 01:28:25

SELECT … FROM

Why can’t we find the difference between the new two time variables?

SELECT
      pickup_datetime, dropoff_datetime,
      STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T") AS pickup,
      STR_TO_DATE(dropoff_datetime, "%Y-%m-%d %T") AS dropoff,
      TIMEDIFF(pickup, dropoff) AS length_time
   FROM yellow_old
   LIMIT 0, 10;
Error: Unknown column 'pickup' in 'field list' [1054]

SELECT … FROM

Solution 1: two layers of SELECT

  • first SELECT (i.e., inside) layer creates the new variables
  • second SELECT (i.e., outside) layer subtracts the two times
SELECT 
   pickup,
   dropoff, 
   TIMEDIFF(pickup, dropoff) AS length_time 
FROM (
   SELECT
      STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T") AS pickup,
      STR_TO_DATE(dropoff_datetime, "%Y-%m-%d %T") AS dropoff
   FROM yellow_old)
   AS subquery_table
LIMIT 0, 20;
Displaying records 1 - 10
pickup dropoff length_time
2014-03-01 01:07:38 2014-03-01 01:16:26 00:08:48
2014-03-01 01:08:03 2014-03-01 01:12:51 00:04:48
2014-03-01 01:08:51 2014-03-01 01:13:18 00:04:27
2014-03-01 01:09:20 2014-03-01 01:24:18 00:14:58
2014-03-01 01:09:46 2014-03-01 01:22:34 00:12:48
2014-03-01 01:12:41 2014-03-01 01:15:38 00:02:57
2014-03-01 01:12:11 2014-03-01 01:27:38 00:15:27
2014-03-01 01:13:55 2014-03-01 01:34:54 00:20:59
2014-03-01 01:14:06 2014-03-01 01:28:25 00:14:19

SELECT … FROM

Solution 2: apply the STR_TO_DATE() function inside the TIMEDIFF() function

SELECT 
   pickup_datetime,
   dropoff_datetime, 
   TIMEDIFF(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T"), 
            STR_TO_DATE(dropoff_datetime, "%Y-%m-%d %T")) AS length_time 
FROM yellow_old
LIMIT 0, 20;
Displaying records 1 - 10
pickup_datetime dropoff_datetime length_time
2014-03-01 01:07:38 2014-03-01 01:16:26 00:08:48
2014-03-01 01:08:03 2014-03-01 01:12:51 00:04:48
2014-03-01 01:08:51 2014-03-01 01:13:18 00:04:27
2014-03-01 01:09:20 2014-03-01 01:24:18 00:14:58
2014-03-01 01:09:46 2014-03-01 01:22:34 00:12:48
2014-03-01 01:12:41 2014-03-01 01:15:38 00:02:57
2014-03-01 01:12:11 2014-03-01 01:27:38 00:15:27
2014-03-01 01:13:55 2014-03-01 01:34:54 00:20:59
2014-03-01 01:14:06 2014-03-01 01:28:25 00:14:19

Important note:

There is a distinction between clauses that operate on the variables of the original table versus those that operate on the variables of the results set.

pickup_datetime and dropoff_datetime are columns in the original table - they are written to disk on the SQL server.

pickup, dropoff, and length_time exist only in the results set, which is passed from the server (SQL server) to the client (e.g., RStudio or DBeaver) and is not written to disk.

SELECT DISTINCT

Returns only unique rows.

SELECT DISTINCT payment_type
FROM yellow_old
LIMIT 0, 20;
6 records
payment_type
CRD
CSH
NOC
DIS
UNK
SELECT DISTINCT vendor_id, payment_type
FROM yellow_old
LIMIT 0, 20;
8 records
vendor_id payment_type
CMT CRD
CMT CSH
CMT NOC
CMT DIS
VTS CRD
VTS CSH
VTS UNK

In case you are curious:

  • VTS is Verifone Transportation Systems and CMT is Mobile Knowledge Systems Inc.
  • CRD is credit card; CSH is cash; NOC is no charge; DIS is dispute

WHERE

The WHERE clause is analogous to the filter() function in dplyr. However, keep in mind that there are two SQL commands that resemble the dplyr filter() function. WHERE operates on the original data in the table and HAVING operates on the result set.

WHERE

What was the fare for those taxi rides where the tip_amount was more than $10 and the person used cash?

Note that in SQL the equality logical is = and in R the equality logical is ==.

SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount > 10
   AND payment_type = "CSH"
LIMIT 0, 10;
Displaying records 1 - 10
payment_type fare_amount tip_amount total_amount
CSH 65.5 15.3 91.8
CSH 52.0 11.6 69.4
CSH 52.0 11.6 69.4
CSH 55.0 16.2 81.2
CSH 71.5 20.0 103.5
CSH 70.0 16.2 97.1
CSH 95.0 21.9 131.2
CSH 62.5 15.5 93.0
CSH 66.0 15.0 90.0
CSH 65.0 13.2 79.2

WHERE

BETWEEN can be used to specify a range of values for a numeric value. BETWEEN is inclusive.

SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount BETWEEN 10 and 12
   AND payment_type = "CSH"
LIMIT 0, 10;
Displaying records 1 - 10
payment_type fare_amount tip_amount total_amount
CSH 52.0 11.6 69.4
CSH 52.0 11.6 69.4
CSH 88.0 10.0 107.0
CSH 72.0 10.0 94.0
CSH 64.5 10.0 85.5
CSH 66.0 12.0 93.0
CSH 52.0 11.6 69.4
CSH 69.0 10.0 88.0
CSH 90.0 10.0 100.0
CSH 52.0 11.6 69.4

WHERE

IN is similar to the dplyr %in% function which specifies distinct values for the variable.

SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount IN (10, 12)
   AND payment_type = "CSH"
LIMIT 0, 10;
Displaying records 1 - 10
payment_type fare_amount tip_amount total_amount
CSH 88.0 10 107.0
CSH 72.0 10 94.0
CSH 64.5 10 85.5
CSH 66.0 12 93.0
CSH 69.0 10 88.0
CSH 90.0 10 100.0
CSH 74.5 10 90.3
CSH 89.0 10 118.1
CSH 52.0 10 67.8
CSH 66.0 12 90.0

WHERE

AND takes precedent over OR in the order of operations, when there are no parentheses.

SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount BETWEEN 10 and 12 OR 
      total_amount BETWEEN 100 and 112 AND 
      payment_type = "CSH"
LIMIT 0, 10;
Displaying records 1 - 10
payment_type fare_amount tip_amount total_amount
CRD 52.0 10.5 63.0
CRD 35.0 10.2 51.0
CRD 52.0 11.6 69.4
CRD 30.5 10.8 47.2
CRD 52.0 10.5 63.0
CRD 52.0 10.5 63.0
CRD 52.0 11.6 69.4
CRD 52.0 11.6 69.4
CRD 52.0 11.6 69.4
CRD 52.0 11.6 69.4

WHERE

Parentheses take precedent over AND and OR.

SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE (tip_amount BETWEEN 10 and 12 OR 
      total_amount BETWEEN 100 and 112 ) AND 
      payment_type = "CSH"
LIMIT 0, 10;
Displaying records 1 - 10
payment_type fare_amount tip_amount total_amount
CSH 107.0 0.0 108.0
CSH 92.5 0.0 103.5
CSH 99.5 0.0 105.3
CSH 92.0 0.0 106.3
CSH 103.0 0.0 109.3
CSH 107.0 0.0 107.0
CSH 104.5 0.0 105.5
CSH 112.0 0.0 112.0
CSH 52.0 11.6 69.4
CSH 52.0 11.6 69.4

WHERE

IS NULL not = NULL (because NULL indicates unknown)

SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE payment_type IS NULL
LIMIT 0, 10;
1 records
payment_type fare_amount tip_amount total_amount
SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE payment_type = NULL
LIMIT 0, 10;
0 records
payment_type fare_amount tip_amount total_amount

A NULL example

The logic of NULL:1

  • If you do anything with NULL, you’ll just get NULL. For instance if \(x\) is NULL, then \(x > 3\), \(1 = x\), and \(x + 4\) all evaluate to NULL. Even \(x =\) NULL evaluates to NULL! if you want to check whether \(x\) is NULL, use x IS NULL or x IS NOT NULL.
  • NULL short-circuits with boolean operators. That means a boolean expression involving NULL will evaluate to:
    • TRUE, if it’d evaluate to TRUE regardless of whether the NULL value is really TRUE or FALSE.
    • FALSE, if it’d evaluate to FALSE regardless of whether the NULL value is really TRUE or FALSE.
    • Or NULL, if it depends on the NULL value.

A NULL example

Consider the following table and SQL query to evaluate WHERE age <= 20 OR num_dogs = 3:

SELECT * FROM (
   SELECT 'Ace' AS name, 20 AS age, 4 as num_dogs
   UNION
   SELECT 'Ada' AS name, NULL AS age, 3 as num_dogs   
   UNION
   SELECT 'Ben' AS name, NULL AS age, NULL as num_dogs
   UNION
   SELECT 'Cho' AS name, 27 AS age, NULL as num_dogs
   ) AS temptable;
4 records
name age num_dogs
Ace 20 4
Ada 3
Ben
Cho 27

A NULL example

Where does the WHERE clause do? It tells us that we only want to keep the rows satisfying the age <= 20 OR num_dogs = 3. Let’s consider each row one at a time:

  • For Ace, age <= 20 evaluates to TRUE so the claim is satisfied.
  • For Ada, age <= 20 evaluates to NULL but num_dogs = 3 evaluates to TRUE so the claim is satisfied.
  • For Ben, age <= 20 evaluates to NULL and num_dogs = 3 evaluates to NULL so the overall expression is NULL which has a FALSE value.
  • For Cho, age <= 20 evaluates to FALSE and num_dogs = 3 evaluates to NULL so the overall expression evaluates to NULL (because it depends on the value of the NULL).

Thus we keep only Ace and Ada.

A NULL example

SELECT * FROM (
   SELECT 'Ace' AS name, 20 AS age, 4 as num_dogs
   UNION
   SELECT 'Ada' AS name, NULL AS age, 3 as num_dogs   
   UNION
   SELECT 'Ben' AS name, NULL AS age, NULL as num_dogs
   UNION
   SELECT 'Cho' AS name, 27 AS age, NULL as num_dogs
   ) AS temptable
WHERE age <= 20 OR num_dogs = 3;
2 records
name age num_dogs
Ace 20 4
Ada 3

GROUP BY

The GROUP BY clause will direct SQL to carry out the query separately for each category in the grouped variable.

  • aggregate functions include COUNT(), SUM(), MAX(), MIN(), and AVG().
SELECT COUNT(*) AS num_transactions, 
       SUM(1) AS num_transactions_also,
       SUM(2) AS double_transactions,
       payment_type 
FROM yellow_old
WHERE tip_amount BETWEEN 10 and 20
GROUP BY payment_type;
5 records
num_transactions num_transactions_also double_transactions payment_type
213872 213872 427744 CRD
78 78 156 CSH
3 3 6 DIS
7 7 14 NOC
609 609 1218 UNK

GROUP BY

For those people who tipped between $10 and $20, what was the lowest and highest fare for each of the types of payments?

SELECT COUNT(*) AS num_transactions, 
       MIN(fare_amount) AS lowest_fare,
       MAX(fare_amount) AS highest_fare,
       payment_type 
FROM yellow_old
WHERE tip_amount BETWEEN 10 and 20
GROUP BY payment_type;
5 records
num_transactions lowest_fare highest_fare payment_type
213872 0.0 370.0 CRD
78 52.0 102.0 CSH
3 52.0 79.5 DIS
7 58.0 94.0 NOC
609 4.5 147.0 UNK

GROUP BY

GROUP BY will work applied to multiple columns.

What is wday?

SELECT COUNT(*) AS num_transactions, 
       MIN(fare_amount) AS lowest_fare,
       MAX(fare_amount) AS highest_fare,
       payment_type,
       DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday;
Displaying records 1 - 10
num_transactions lowest_fare highest_fare payment_type wday
1
1247397 2.5 477 CRD Friday
1278362 -612.4 500 CRD Monday
1533796 2.5 420 CRD Saturday
1324394 2.5 480 CRD Sunday
1258098 2.5 500 CRD Thursday
1121081 2.5 500 CRD Tuesday
1192892 2.5 400 CRD Wednesday
860920 2.5 444 CSH Friday
918653 0.0 873 CSH Monday

ORDER BY

ORDER BY allows us to look at interesting aspects of the data by sorting the data.

SELECT COUNT(*) AS num_transactions, 
       MIN(fare_amount) AS lowest_fare,
       MAX(fare_amount) AS highest_fare,
       payment_type,
       DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday
ORDER BY lowest_fare ASC;
Displaying records 1 - 10
num_transactions lowest_fare highest_fare payment_type wday
1
1278362 -612.4 500 CRD Monday
918653 0.0 873 CSH Monday
5440 0.0 950 NOC Monday
1537 0.0 102 DIS Monday
1533796 2.5 420 CRD Saturday
1121081 2.5 500 CRD Tuesday
1192892 2.5 400 CRD Wednesday
1258098 2.5 500 CRD Thursday
1247397 2.5 477 CRD Friday

WHAT?!?!! How in the world was one of the fares -$612.40?

ORDER BY

SELECT COUNT(*) AS num_transactions, 
       MIN(fare_amount) AS lowest_fare,
       MAX(fare_amount) AS highest_fare,
       payment_type,
       DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday
ORDER BY highest_fare DESC;
Displaying records 1 - 10
num_transactions lowest_fare highest_fare payment_type wday
5440 0.0 950 NOC Monday
918653 0.0 873 CSH Monday
1278362 -612.4 500 CRD Monday
1121081 2.5 500 CRD Tuesday
1258098 2.5 500 CRD Thursday
1324394 2.5 480 CRD Sunday
1247397 2.5 477 CRD Friday
1222 2.5 475 DIS Tuesday
813813 2.5 475 CSH Thursday
860920 2.5 444 CSH Friday

$950 is a lot to pay for a cab ride! But in NYC, I’d believe it.

ORDER BY

SELECT COUNT(*) AS num_transactions, 
       MIN(fare_amount) AS lowest_fare,
       MAX(fare_amount) AS highest_fare,
       payment_type,
       DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday
ORDER BY wday, payment_type;
Displaying records 1 - 10
num_transactions lowest_fare highest_fare payment_type wday
1
1247397 2.5 477 CRD Friday
860920 2.5 444 CSH Friday
1592 2.5 255 DIS Friday
5252 2.5 229 NOC Friday
10131 2.5 130 UNK Friday
1278362 -612.4 500 CRD Monday
918653 0.0 873 CSH Monday
1537 0.0 102 DIS Monday
5440 0.0 950 NOC Monday

ORDER BY

Note that both GROUP BY and ORDER BY evaluate the data after it has been retrieved. Therefore, the functions operate on the results set, not the original rows of the data.

We are able to GROUP BY and ORDER BY on the new variables we had created, wday.

HAVING

Recall that WHERE acts only on the original data. If we are interested in rides that took place on Friday, we need to use the derived variable wday instead of the raw variable pickup_datetime. Fortunately, HAVING works on the results set.

HAVING

SELECT COUNT(*) AS num_transactions, 
       MIN(fare_amount) AS lowest_fare,
       MAX(fare_amount) AS highest_fare,
       payment_type,
       DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday
HAVING wday = 'Friday';
5 records
num_transactions lowest_fare highest_fare payment_type wday
1247397 2.5 477 CRD Friday
860920 2.5 444 CSH Friday
1592 2.5 255 DIS Friday
5252 2.5 229 NOC Friday
10131 2.5 130 UNK Friday

HAVING

While it worked out quite well for us that HAVING was able to filter the data based on the results set, the use of HAVING was quite onerous because the entire data set was considered before the filter was applied. That is, if the filter can be done on the original data using WHERE, the query will be much faster and more efficient.

Note: HAVING requires a GROUP BY clause. And the variable(s) used in HAVING must also be part of the GROUP BY clause.

LIMIT

LIMIT truncates the query to specified rows. The first number is the offset (i.e., the number of rows to skip), the second number is the (maximum) number of rows to return. Here, we return rows 15428119 through 15428128.

The first number is optional.

SELECT * FROM yellow_old LIMIT 15428118, 10;
Displaying records 1 - 10
vendor_id pickup_datetime dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount surcharge mta_tax tip_amount tolls_amount total_amount
CMT 2014-03-18 14:35:21 2014-03-18 14:52:01 1 2.7 0 0.0 1 N 0 0.0 CRD 13.0 0 0.5 2.50 0.00 16.0
CMT 2014-03-18 14:08:23 2014-03-18 14:19:29 2 1.3 0 0.0 1 N 0 0.0 CRD 9.0 0 0.5 1.90 0.00 11.4
CMT 2014-03-18 09:18:38 2014-03-18 09:19:41 1 0.2 -74 40.8 1 N -74 40.8 CRD 3.0 0 0.5 1.00 0.00 4.5
CMT 2014-03-18 06:28:12 2014-03-18 06:49:49 1 9.9 0 0.0 1 N 0 0.0 CRD 30.0 0 0.5 7.16 5.33 43.0
CMT 2014-03-18 17:39:28 2014-03-18 17:53:01 1 4.9 -74 40.8 1 N -74 40.7 CRD 16.5 1 0.5 3.00 0.00 21.0
CMT 2014-03-18 18:14:19 2014-03-18 18:27:22 1 0.3 -74 40.7 1 N -74 40.7 CRD 3.5 1 0.5 6.00 0.00 11.0
CMT 2014-03-18 10:12:33 2014-03-18 10:28:09 1 3.3 -74 40.8 1 N -74 40.8 CRD 13.5 0 0.5 4.20 0.00 18.2
CMT 2014-03-18 09:02:37 2014-03-18 09:16:29 1 6.2 -74 40.8 1 N -74 40.7 CRD 19.5 0 0.5 1.00 0.00 21.0
CMT 2014-03-18 10:10:19 2014-03-18 10:19:25 1 1.7 -74 40.8 1 N -74 40.8 CRD 8.5 0 0.5 1.80 0.00 10.8
CMT 2014-03-18 15:24:53 2014-03-18 15:42:42 1 1.7 -74 40.8 1 N -74 40.7 CRD 12.5 0 0.5 2.60 0.00 15.6

Saving SQL queries as R objects

If you are working in R to run SQL commands, you may want to use the query output for further analysis or visualizations.

  • use #|output.var: "name_of_variable" inside the {sql} chunk.
  • name_of_variable will then be available to be used in the R environment.
```{sql}
#| connection: con_taxi
#| label: new-table
#| output.var: "new_table"

SELECT *, DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old 
LIMIT 0, 1000;
```

Saving SQL queries as R objects

```{r}
new_table |>
  drop_na(wday) |>
  ggplot(aes(x = fare_amount, y = tip_amount, color = wday)) + 
  geom_point() 
```

Good practice

Always a good idea to terminate the SQL connection when you are done with it.

dbDisconnect(con_taxi, shutdown = TRUE)