SQL clauses

Jo Hardin

2024-01-09

Today’s example

NYC Taxis

Consider a database of taxi rides from the Yellow Cab company in NYC in March of 2014.

Image of yellow taxis in downtown New York City.

Figure 1: image credit: Mariordo (Mario Roberto Durán Ortiz), CC BY-SA 3.0

Establishing a SQL connection

con_taxi <- DBI::dbConnect(
  RMariaDB::MariaDB(),
  dbname = "nyctaxi",
  host = Sys.getenv("MDSR_HOST"),
  user = Sys.getenv("MDSR_USER"),
  password = Sys.getenv("MDSR_PWD")
)

SHOW TABLES

There is only one table in the nyctaxi database. It is called yellow_old.

```{sql}
#| connection: con_taxi

SHOW TABLES;
```

1 records
Tables_in_nyctaxi
yellow_old

DESCRIBE tables

Still using a {sql} chunk. The DESCRIBE command shows the 18 field names (variables) in the yellow_old table. Some of the variables are characters (text) and some are numeric (either double or bigint)

DESCRIBE yellow_old;

Displaying records 1 - 10
Field	Type	Null
vendor_id	text	YES
pickup_datetime	text	YES
dropoff_datetime	text	YES
passenger_count	bigint(20)	YES
trip_distance	double	YES
pickup_longitude	double	YES
pickup_latitude	double	YES
rate_code	bigint(20)	YES
store_and_fwd_flag	text	YES
dropoff_longitude	double	YES

SQL clauses

Order of clauses

Queries in SQL start with the SELECT keyword and consist of several clauses, which must be written in the following order:

SELECT
FROM
JOIN
WHERE

GROUP BY
HAVING
ORDER BY
LIMIT

The clauses are similar to data wrangling verbs in R.

SELECT … FROM

start with a SELECT, include a corresponding FROM
columns may be specified or * returns every column

The shortest SQL query is the following SELECT command. Do not run this command!!! The yellow_old table has 15 million rows, and we do not want to look at them simultaneously.

DO NOT RUN:  SELECT * FROM yellow_old;

SELECT … FROM

Much better for big tables:

SELECT * FROM yellow_old LIMIT 0, 14;

Displaying records 1 - 10
vendor_id	pickup_datetime	dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	rate_code	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	surcharge	mta_tax	tip_amount	tolls_amount	total_amount

CMT	2014-03-01 01:07:38	2014-03-01 01:16:26	1	2.0	-74.0	40.7	1	N	-73.9	40.7	CRD	9.0	0.5	0.5	2.0	0	12.0
CMT	2014-03-01 01:08:03	2014-03-01 01:12:51	2	1.2	-74.0	40.7	1	N	-74.0	40.7	CRD	6.0	0.5	0.5	1.0	0	8.0
CMT	2014-03-01 01:08:51	2014-03-01 01:13:18	3	0.5	-73.9	40.7	1	N	-74.0	40.7	CRD	5.0	0.5	0.5	1.2	0	7.2
CMT	2014-03-01 01:09:20	2014-03-01 01:24:18	3	3.5	-74.0	40.7	1	N	-74.0	40.8	CRD	14.0	0.5	0.5	3.0	0	18.0
CMT	2014-03-01 01:09:46	2014-03-01 01:22:34	1	1.8	-74.0	40.7	1	N	-74.0	40.7	CRD	10.5	0.5	0.5	1.0	0	12.5
CMT	2014-03-01 01:12:41	2014-03-01 01:15:38	1	0.5	-74.0	40.7	1	N	-74.0	40.7	CRD	4.0	0.5	0.5	0.5	0	5.5
CMT	2014-03-01 01:12:11	2014-03-01 01:27:38	2	3.7	-74.0	40.8	1	N	-74.0	40.7	CRD	14.5	0.5	0.5	3.1	0	18.6
CMT	2014-03-01 01:13:55	2014-03-01 01:34:54	1	5.4	-74.0	40.8	1	N	-74.0	40.7	CRD	20.0	0.5	0.5	3.0	0	24.0
CMT	2014-03-01 01:14:06	2014-03-01 01:28:25	1	3.5	-74.0	40.7	1	N	-74.0	40.8	CRD	13.5	0.5	0.5	2.9	0	17.4

SELECT … FROM

How do we know how many taxi rides are recorded? Two different ways of counting the rows:

SELECT COUNT(*), SUM(1) FROM yellow_old LIMIT 0, 14;

1 records
COUNT(*)	SUM(1)
15428128	15428128

Yikes, more than 15 million taxi rides!

SELECT … FROM

What is the length of a taxi ride (in time)? … unfortunately, pickup_datetime and dropoff_datetime are saved as character strings instead of in DateTime format.

SELECT
      pickup_datetime, dropoff_datetime,
      STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T") AS pickup,
      STR_TO_DATE(dropoff_datetime, "%Y-%m-%d %T") AS dropoff
   FROM yellow_old
   LIMIT 0, 10;

Displaying records 1 - 10
pickup_datetime	dropoff_datetime	pickup	dropoff

2014-03-01 01:07:38	2014-03-01 01:16:26	2014-03-01 01:07:38	2014-03-01 01:16:26
2014-03-01 01:08:03	2014-03-01 01:12:51	2014-03-01 01:08:03	2014-03-01 01:12:51
2014-03-01 01:08:51	2014-03-01 01:13:18	2014-03-01 01:08:51	2014-03-01 01:13:18
2014-03-01 01:09:20	2014-03-01 01:24:18	2014-03-01 01:09:20	2014-03-01 01:24:18
2014-03-01 01:09:46	2014-03-01 01:22:34	2014-03-01 01:09:46	2014-03-01 01:22:34
2014-03-01 01:12:41	2014-03-01 01:15:38	2014-03-01 01:12:41	2014-03-01 01:15:38
2014-03-01 01:12:11	2014-03-01 01:27:38	2014-03-01 01:12:11	2014-03-01 01:27:38
2014-03-01 01:13:55	2014-03-01 01:34:54	2014-03-01 01:13:55	2014-03-01 01:34:54
2014-03-01 01:14:06	2014-03-01 01:28:25	2014-03-01 01:14:06	2014-03-01 01:28:25

SELECT … FROM

Why can’t we find the difference between the new two time variables?

SELECT
      pickup_datetime, dropoff_datetime,
      STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T") AS pickup,
      STR_TO_DATE(dropoff_datetime, "%Y-%m-%d %T") AS dropoff,
      TIMEDIFF(pickup, dropoff) AS length_time
   FROM yellow_old
   LIMIT 0, 10;

Error: Unknown column 'pickup' in 'field list' [1054]

SELECT … FROM

Solution 1: two layers of SELECT

first SELECT (i.e., inside) layer creates the new variables
second SELECT (i.e., outside) layer subtracts the two times

SELECT 
   pickup,
   dropoff, 
   TIMEDIFF(pickup, dropoff) AS length_time 
FROM (
   SELECT
      STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T") AS pickup,
      STR_TO_DATE(dropoff_datetime, "%Y-%m-%d %T") AS dropoff
   FROM yellow_old)
   AS subquery_table
LIMIT 0, 20;

Displaying records 1 - 10
pickup	dropoff	length_time

2014-03-01 01:07:38	2014-03-01 01:16:26	00:08:48
2014-03-01 01:08:03	2014-03-01 01:12:51	00:04:48
2014-03-01 01:08:51	2014-03-01 01:13:18	00:04:27
2014-03-01 01:09:20	2014-03-01 01:24:18	00:14:58
2014-03-01 01:09:46	2014-03-01 01:22:34	00:12:48
2014-03-01 01:12:41	2014-03-01 01:15:38	00:02:57
2014-03-01 01:12:11	2014-03-01 01:27:38	00:15:27
2014-03-01 01:13:55	2014-03-01 01:34:54	00:20:59
2014-03-01 01:14:06	2014-03-01 01:28:25	00:14:19

SELECT … FROM

Solution 2: apply the STR_TO_DATE() function inside the TIMEDIFF() function

SELECT 
   pickup_datetime,
   dropoff_datetime, 
   TIMEDIFF(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T"), 
            STR_TO_DATE(dropoff_datetime, "%Y-%m-%d %T")) AS length_time 
FROM yellow_old
LIMIT 0, 20;

Displaying records 1 - 10
pickup_datetime	dropoff_datetime	length_time

2014-03-01 01:07:38	2014-03-01 01:16:26	00:08:48
2014-03-01 01:08:03	2014-03-01 01:12:51	00:04:48
2014-03-01 01:08:51	2014-03-01 01:13:18	00:04:27
2014-03-01 01:09:20	2014-03-01 01:24:18	00:14:58
2014-03-01 01:09:46	2014-03-01 01:22:34	00:12:48
2014-03-01 01:12:41	2014-03-01 01:15:38	00:02:57
2014-03-01 01:12:11	2014-03-01 01:27:38	00:15:27
2014-03-01 01:13:55	2014-03-01 01:34:54	00:20:59
2014-03-01 01:14:06	2014-03-01 01:28:25	00:14:19

Important note:

There is a distinction between clauses that operate on the variables of the original table versus those that operate on the variables of the results set.

pickup_datetime and dropoff_datetime are columns in the original table - they are written to disk on the SQL server.

pickup, dropoff, and length_time exist only in the results set, which is passed from the server (SQL server) to the client (e.g., RStudio or DBeaver) and is not written to disk.

SELECT DISTINCT

Returns only unique rows.

SELECT DISTINCT payment_type
FROM yellow_old
LIMIT 0, 20;

6 records
payment_type

CRD
CSH
NOC
DIS
UNK

SELECT DISTINCT vendor_id, payment_type
FROM yellow_old
LIMIT 0, 20;

8 records
vendor_id	payment_type

CMT	CRD
CMT	CSH
CMT	NOC
CMT	DIS
VTS	CRD
VTS	CSH
VTS	UNK

In case you are curious:

VTS is Verifone Transportation Systems and CMT is Mobile Knowledge Systems Inc.
CRD is credit card; CSH is cash; NOC is no charge; DIS is dispute

WHERE

The WHERE clause is analogous to the filter() function in dplyr. However, keep in mind that there are two SQL commands that resemble the dplyr filter() function. WHERE operates on the original data in the table and HAVING operates on the result set.

WHERE

What was the fare for those taxi rides where the tip_amount was more than $10 and the person used cash?

Note that in SQL the equality logical is = and in R the equality logical is ==.

SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount > 10
   AND payment_type = "CSH"
LIMIT 0, 10;

Displaying records 1 - 10
payment_type	fare_amount	tip_amount	total_amount
CSH	65.5	15.3	91.8
CSH	52.0	11.6	69.4
CSH	52.0	11.6	69.4
CSH	55.0	16.2	81.2
CSH	71.5	20.0	103.5
CSH	70.0	16.2	97.1
CSH	95.0	21.9	131.2
CSH	62.5	15.5	93.0
CSH	66.0	15.0	90.0
CSH	65.0	13.2	79.2

WHERE

BETWEEN can be used to specify a range of values for a numeric value. BETWEEN is inclusive.

SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount BETWEEN 10 and 12
   AND payment_type = "CSH"
LIMIT 0, 10;

Displaying records 1 - 10
payment_type	fare_amount	tip_amount	total_amount
CSH	52.0	11.6	69.4
CSH	52.0	11.6	69.4
CSH	88.0	10.0	107.0
CSH	72.0	10.0	94.0
CSH	64.5	10.0	85.5
CSH	66.0	12.0	93.0
CSH	52.0	11.6	69.4
CSH	69.0	10.0	88.0
CSH	90.0	10.0	100.0
CSH	52.0	11.6	69.4

WHERE

IN is similar to the dplyr %in% function which specifies distinct values for the variable.

SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount IN (10, 12)
   AND payment_type = "CSH"
LIMIT 0, 10;

Displaying records 1 - 10
payment_type	fare_amount	tip_amount	total_amount
CSH	88.0	10	107.0
CSH	72.0	10	94.0
CSH	64.5	10	85.5
CSH	66.0	12	93.0
CSH	69.0	10	88.0
CSH	90.0	10	100.0
CSH	74.5	10	90.3
CSH	89.0	10	118.1
CSH	52.0	10	67.8
CSH	66.0	12	90.0

WHERE

AND takes precedent over OR in the order of operations, when there are no parentheses.

SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount BETWEEN 10 and 12 OR 
      total_amount BETWEEN 100 and 112 AND 
      payment_type = "CSH"
LIMIT 0, 10;

Displaying records 1 - 10
payment_type	fare_amount	tip_amount	total_amount
CRD	52.0	10.5	63.0
CRD	35.0	10.2	51.0
CRD	52.0	11.6	69.4
CRD	30.5	10.8	47.2
CRD	52.0	10.5	63.0
CRD	52.0	10.5	63.0
CRD	52.0	11.6	69.4
CRD	52.0	11.6	69.4
CRD	52.0	11.6	69.4
CRD	52.0	11.6	69.4

WHERE

Parentheses take precedent over AND and OR.

SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE (tip_amount BETWEEN 10 and 12 OR 
      total_amount BETWEEN 100 and 112 ) AND 
      payment_type = "CSH"
LIMIT 0, 10;

Displaying records 1 - 10
payment_type	fare_amount	tip_amount	total_amount
CSH	107.0	0.0	108.0
CSH	92.5	0.0	103.5
CSH	99.5	0.0	105.3
CSH	92.0	0.0	106.3
CSH	103.0	0.0	109.3
CSH	107.0	0.0	107.0
CSH	104.5	0.0	105.5
CSH	112.0	0.0	112.0
CSH	52.0	11.6	69.4
CSH	52.0	11.6	69.4

WHERE

IS NULL not = NULL (because NULL indicates unknown)

SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE payment_type IS NULL
LIMIT 0, 10;

1 records
payment_type	fare_amount	tip_amount	total_amount

SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE payment_type = NULL
LIMIT 0, 10;

0 records
payment_type	fare_amount	tip_amount	total_amount

A `NULL` example

The logic of NULL:¹

If you do anything with NULL, you’ll just get NULL. For instance if $x$ is NULL, then $x > 3$, $1 = x$, and $x + 4$ all evaluate to NULL. Even $x =$ NULL evaluates to NULL! if you want to check whether $x$ is NULL, use x IS NULL or x IS NOT NULL.
NULL short-circuits with boolean operators. That means a boolean expression involving NULL will evaluate to:
- TRUE, if it’d evaluate to TRUE regardless of whether the NULL value is really TRUE or FALSE.
- FALSE, if it’d evaluate to FALSE regardless of whether the NULL value is really TRUE or FALSE.
- Or NULL, if it depends on the NULL value.

A `NULL` example

Consider the following table and SQL query to evaluate WHERE age <= 20 OR num_dogs = 3:

SELECT * FROM (
   SELECT 'Ace' AS name, 20 AS age, 4 as num_dogs
   UNION
   SELECT 'Ada' AS name, NULL AS age, 3 as num_dogs   
   UNION
   SELECT 'Ben' AS name, NULL AS age, NULL as num_dogs
   UNION
   SELECT 'Cho' AS name, 27 AS age, NULL as num_dogs
   ) AS temptable;

4 records
name	age	num_dogs
Ace	20	4
Ada		3
Ben
Cho	27

A `NULL` example

Where does the WHERE clause do? It tells us that we only want to keep the rows satisfying the age <= 20 OR num_dogs = 3. Let’s consider each row one at a time:

For Ace, age <= 20 evaluates to TRUE so the claim is satisfied.
For Ada, age <= 20 evaluates to NULL but num_dogs = 3 evaluates to TRUE so the claim is satisfied.
For Ben, age <= 20 evaluates to NULL and num_dogs = 3 evaluates to NULL so the overall expression is NULL which has a FALSE value.
For Cho, age <= 20 evaluates to FALSE and num_dogs = 3 evaluates to NULL so the overall expression evaluates to NULL (because it depends on the value of the NULL).

Thus we keep only Ace and Ada.

A `NULL` example

SELECT * FROM (
   SELECT 'Ace' AS name, 20 AS age, 4 as num_dogs
   UNION
   SELECT 'Ada' AS name, NULL AS age, 3 as num_dogs   
   UNION
   SELECT 'Ben' AS name, NULL AS age, NULL as num_dogs
   UNION
   SELECT 'Cho' AS name, 27 AS age, NULL as num_dogs
   ) AS temptable
WHERE age <= 20 OR num_dogs = 3;

2 records
name	age	num_dogs
Ace	20	4
Ada		3

GROUP BY

The GROUP BY clause will direct SQL to carry out the query separately for each category in the grouped variable.

aggregate functions include COUNT(), SUM(), MAX(), MIN(), and AVG().

SELECT COUNT(*) AS num_transactions, 
       SUM(1) AS num_transactions_also,
       SUM(2) AS double_transactions,
       payment_type 
FROM yellow_old
WHERE tip_amount BETWEEN 10 and 20
GROUP BY payment_type;

5 records
num_transactions	num_transactions_also	double_transactions	payment_type
213872	213872	427744	CRD
78	78	156	CSH
3	3	6	DIS
7	7	14	NOC
609	609	1218	UNK

GROUP BY

For those people who tipped between $10 and $20, what was the lowest and highest fare for each of the types of payments?

SELECT COUNT(*) AS num_transactions, 
       MIN(fare_amount) AS lowest_fare,
       MAX(fare_amount) AS highest_fare,
       payment_type 
FROM yellow_old
WHERE tip_amount BETWEEN 10 and 20
GROUP BY payment_type;

5 records
num_transactions	lowest_fare	highest_fare	payment_type
213872	0.0	370.0	CRD
78	52.0	102.0	CSH
3	52.0	79.5	DIS
7	58.0	94.0	NOC
609	4.5	147.0	UNK

GROUP BY

GROUP BY will work applied to multiple columns.

What is wday?

SELECT COUNT(*) AS num_transactions, 
       MIN(fare_amount) AS lowest_fare,
       MAX(fare_amount) AS highest_fare,
       payment_type,
       DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday;

Displaying records 1 - 10
num_transactions	lowest_fare	highest_fare	payment_type	wday
1
1247397	2.5	477	CRD	Friday
1278362	-612.4	500	CRD	Monday
1533796	2.5	420	CRD	Saturday
1324394	2.5	480	CRD	Sunday
1258098	2.5	500	CRD	Thursday
1121081	2.5	500	CRD	Tuesday
1192892	2.5	400	CRD	Wednesday
860920	2.5	444	CSH	Friday
918653	0.0	873	CSH	Monday

ORDER BY

ORDER BY allows us to look at interesting aspects of the data by sorting the data.

SELECT COUNT(*) AS num_transactions, 
       MIN(fare_amount) AS lowest_fare,
       MAX(fare_amount) AS highest_fare,
       payment_type,
       DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday
ORDER BY lowest_fare ASC;

Displaying records 1 - 10
num_transactions	lowest_fare	highest_fare	payment_type	wday
1
1278362	-612.4	500	CRD	Monday
918653	0.0	873	CSH	Monday
5440	0.0	950	NOC	Monday
1537	0.0	102	DIS	Monday
1533796	2.5	420	CRD	Saturday
1121081	2.5	500	CRD	Tuesday
1192892	2.5	400	CRD	Wednesday
1258098	2.5	500	CRD	Thursday
1247397	2.5	477	CRD	Friday

WHAT?!?!! How in the world was one of the fares -$612.40?

ORDER BY

SELECT COUNT(*) AS num_transactions, 
       MIN(fare_amount) AS lowest_fare,
       MAX(fare_amount) AS highest_fare,
       payment_type,
       DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday
ORDER BY highest_fare DESC;

Displaying records 1 - 10
num_transactions	lowest_fare	highest_fare	payment_type	wday
5440	0.0	950	NOC	Monday
918653	0.0	873	CSH	Monday
1278362	-612.4	500	CRD	Monday
1121081	2.5	500	CRD	Tuesday
1258098	2.5	500	CRD	Thursday
1324394	2.5	480	CRD	Sunday
1247397	2.5	477	CRD	Friday
1222	2.5	475	DIS	Tuesday
813813	2.5	475	CSH	Thursday
860920	2.5	444	CSH	Friday

$950 is a lot to pay for a cab ride! But in NYC, I’d believe it.

ORDER BY

SELECT COUNT(*) AS num_transactions, 
       MIN(fare_amount) AS lowest_fare,
       MAX(fare_amount) AS highest_fare,
       payment_type,
       DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday
ORDER BY wday, payment_type;

Displaying records 1 - 10
num_transactions	lowest_fare	highest_fare	payment_type	wday
1
1247397	2.5	477	CRD	Friday
860920	2.5	444	CSH	Friday
1592	2.5	255	DIS	Friday
5252	2.5	229	NOC	Friday
10131	2.5	130	UNK	Friday
1278362	-612.4	500	CRD	Monday
918653	0.0	873	CSH	Monday
1537	0.0	102	DIS	Monday
5440	0.0	950	NOC	Monday

ORDER BY

Note that both GROUP BY and ORDER BY evaluate the data after it has been retrieved. Therefore, the functions operate on the results set, not the original rows of the data.

We are able to GROUP BY and ORDER BY on the new variables we had created, wday.

HAVING

Recall that WHERE acts only on the original data. If we are interested in rides that took place on Friday, we need to use the derived variable wday instead of the raw variable pickup_datetime. Fortunately, HAVING works on the results set.

HAVING

SELECT COUNT(*) AS num_transactions, 
       MIN(fare_amount) AS lowest_fare,
       MAX(fare_amount) AS highest_fare,
       payment_type,
       DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday
HAVING wday = 'Friday';

5 records
num_transactions	lowest_fare	highest_fare	payment_type	wday
1247397	2.5	477	CRD	Friday
860920	2.5	444	CSH	Friday
1592	2.5	255	DIS	Friday
5252	2.5	229	NOC	Friday
10131	2.5	130	UNK	Friday

HAVING

While it worked out quite well for us that HAVING was able to filter the data based on the results set, the use of HAVING was quite onerous because the entire data set was considered before the filter was applied. That is, if the filter can be done on the original data using WHERE, the query will be much faster and more efficient.

Note: HAVING requires a GROUP BY clause. And the variable(s) used in HAVING must also be part of the GROUP BY clause.

LIMIT

LIMIT truncates the query to specified rows. The first number is the offset (i.e., the number of rows to skip), the second number is the (maximum) number of rows to return. Here, we return rows 15428119 through 15428128.

The first number is optional.

SELECT * FROM yellow_old LIMIT 15428118, 10;

Displaying records 1 - 10
vendor_id	pickup_datetime	dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	rate_code	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	surcharge	mta_tax	tip_amount	tolls_amount	total_amount
CMT	2014-03-18 14:35:21	2014-03-18 14:52:01	1	2.7	0	0.0	1	N	0	0.0	CRD	13.0	0	0.5	2.50	0.00	16.0
CMT	2014-03-18 14:08:23	2014-03-18 14:19:29	2	1.3	0	0.0	1	N	0	0.0	CRD	9.0	0	0.5	1.90	0.00	11.4
CMT	2014-03-18 09:18:38	2014-03-18 09:19:41	1	0.2	-74	40.8	1	N	-74	40.8	CRD	3.0	0	0.5	1.00	0.00	4.5
CMT	2014-03-18 06:28:12	2014-03-18 06:49:49	1	9.9	0	0.0	1	N	0	0.0	CRD	30.0	0	0.5	7.16	5.33	43.0
CMT	2014-03-18 17:39:28	2014-03-18 17:53:01	1	4.9	-74	40.8	1	N	-74	40.7	CRD	16.5	1	0.5	3.00	0.00	21.0
CMT	2014-03-18 18:14:19	2014-03-18 18:27:22	1	0.3	-74	40.7	1	N	-74	40.7	CRD	3.5	1	0.5	6.00	0.00	11.0
CMT	2014-03-18 10:12:33	2014-03-18 10:28:09	1	3.3	-74	40.8	1	N	-74	40.8	CRD	13.5	0	0.5	4.20	0.00	18.2
CMT	2014-03-18 09:02:37	2014-03-18 09:16:29	1	6.2	-74	40.8	1	N	-74	40.7	CRD	19.5	0	0.5	1.00	0.00	21.0
CMT	2014-03-18 10:10:19	2014-03-18 10:19:25	1	1.7	-74	40.8	1	N	-74	40.8	CRD	8.5	0	0.5	1.80	0.00	10.8
CMT	2014-03-18 15:24:53	2014-03-18 15:42:42	1	1.7	-74	40.8	1	N	-74	40.7	CRD	12.5	0	0.5	2.60	0.00	15.6

Saving SQL queries as R objects

If you are working in R to run SQL commands, you may want to use the query output for further analysis or visualizations.

use #|output.var: "name_of_variable" inside the {sql} chunk.
name_of_variable will then be available to be used in the R environment.

```{sql}
#| connection: con_taxi
#| label: new-table
#| output.var: "new_table"

SELECT *, DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old 
LIMIT 0, 1000;
```

Saving SQL queries as R objects

```{r}
new_table |>
  drop_na(wday) |>
  ggplot(aes(x = fare_amount, y = tip_amount, color = wday)) + 
  geom_point() 
```

Good practice

Always a good idea to terminate the SQL connection when you are done with it.

dbDisconnect(con_taxi, shutdown = TRUE)

SQL clauses

Today’s example

NYC Taxis

Establishing a SQL connection

SHOW TABLES

DESCRIBE tables

SQL clauses

Order of clauses

SELECT … FROM

SELECT … FROM

SELECT … FROM

SELECT … FROM

SELECT … FROM

SELECT … FROM

SELECT … FROM

Important note:

SELECT DISTINCT

WHERE

WHERE

WHERE

WHERE

WHERE

WHERE

WHERE

A NULL example

A NULL example

A NULL example

A NULL example

GROUP BY

GROUP BY

GROUP BY

ORDER BY

ORDER BY

ORDER BY

ORDER BY

HAVING

HAVING

HAVING

LIMIT

Saving SQL queries as R objects

Saving SQL queries as R objects

Good practice

A `NULL` example

A `NULL` example

A `NULL` example

A `NULL` example