2024-01-09
Consider a database of taxi rides from the Yellow Cab company in NYC in March of 2014.

Figure 1: image credit: Mariordo (Mario Roberto Durán Ortiz), CC BY-SA 3.0
There is only one table in the nyctaxi database. It is called yellow_old.
Still using a {sql} chunk. The DESCRIBE command shows the 18 field names (variables) in the yellow_old table. Some of the variables are characters (text) and some are numeric (either double or bigint)
| Field | Type | Null | Key | Default | Extra |
|---|---|---|---|---|---|
| vendor_id | text | YES | |||
| pickup_datetime | text | YES | |||
| dropoff_datetime | text | YES | |||
| passenger_count | bigint(20) | YES | |||
| trip_distance | double | YES | |||
| pickup_longitude | double | YES | |||
| pickup_latitude | double | YES | |||
| rate_code | bigint(20) | YES | |||
| store_and_fwd_flag | text | YES | |||
| dropoff_longitude | double | YES |
Queries in SQL start with the SELECT keyword and consist of several clauses, which must be written in the following order:
SELECTFROMJOINWHEREGROUP BYHAVINGORDER BYLIMITThe clauses are similar to data wrangling verbs in R.
SELECT, include a corresponding FROM* returns every columnThe shortest SQL query is the following SELECT command. Do not run this command!!! The yellow_old table has 15 million rows, and we do not want to look at them simultaneously.
Much better for big tables:
| vendor_id | pickup_datetime | dropoff_datetime | passenger_count | trip_distance | pickup_longitude | pickup_latitude | rate_code | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | payment_type | fare_amount | surcharge | mta_tax | tip_amount | tolls_amount | total_amount |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CMT | 2014-03-01 01:07:38 | 2014-03-01 01:16:26 | 1 | 2.0 | -74.0 | 40.7 | 1 | N | -73.9 | 40.7 | CRD | 9.0 | 0.5 | 0.5 | 2.0 | 0 | 12.0 |
| CMT | 2014-03-01 01:08:03 | 2014-03-01 01:12:51 | 2 | 1.2 | -74.0 | 40.7 | 1 | N | -74.0 | 40.7 | CRD | 6.0 | 0.5 | 0.5 | 1.0 | 0 | 8.0 |
| CMT | 2014-03-01 01:08:51 | 2014-03-01 01:13:18 | 3 | 0.5 | -73.9 | 40.7 | 1 | N | -74.0 | 40.7 | CRD | 5.0 | 0.5 | 0.5 | 1.2 | 0 | 7.2 |
| CMT | 2014-03-01 01:09:20 | 2014-03-01 01:24:18 | 3 | 3.5 | -74.0 | 40.7 | 1 | N | -74.0 | 40.8 | CRD | 14.0 | 0.5 | 0.5 | 3.0 | 0 | 18.0 |
| CMT | 2014-03-01 01:09:46 | 2014-03-01 01:22:34 | 1 | 1.8 | -74.0 | 40.7 | 1 | N | -74.0 | 40.7 | CRD | 10.5 | 0.5 | 0.5 | 1.0 | 0 | 12.5 |
| CMT | 2014-03-01 01:12:41 | 2014-03-01 01:15:38 | 1 | 0.5 | -74.0 | 40.7 | 1 | N | -74.0 | 40.7 | CRD | 4.0 | 0.5 | 0.5 | 0.5 | 0 | 5.5 |
| CMT | 2014-03-01 01:12:11 | 2014-03-01 01:27:38 | 2 | 3.7 | -74.0 | 40.8 | 1 | N | -74.0 | 40.7 | CRD | 14.5 | 0.5 | 0.5 | 3.1 | 0 | 18.6 |
| CMT | 2014-03-01 01:13:55 | 2014-03-01 01:34:54 | 1 | 5.4 | -74.0 | 40.8 | 1 | N | -74.0 | 40.7 | CRD | 20.0 | 0.5 | 0.5 | 3.0 | 0 | 24.0 |
| CMT | 2014-03-01 01:14:06 | 2014-03-01 01:28:25 | 1 | 3.5 | -74.0 | 40.7 | 1 | N | -74.0 | 40.8 | CRD | 13.5 | 0.5 | 0.5 | 2.9 | 0 | 17.4 |
How do we know how many taxi rides are recorded? Two different ways of counting the rows:
Yikes, more than 15 million taxi rides!
What is the length of a taxi ride (in time)? … unfortunately, pickup_datetime and dropoff_datetime are saved as character strings instead of in DateTime format.
SELECT
pickup_datetime, dropoff_datetime,
STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T") AS pickup,
STR_TO_DATE(dropoff_datetime, "%Y-%m-%d %T") AS dropoff
FROM yellow_old
LIMIT 0, 10;| pickup_datetime | dropoff_datetime | pickup | dropoff |
|---|---|---|---|
| 2014-03-01 01:07:38 | 2014-03-01 01:16:26 | 2014-03-01 01:07:38 | 2014-03-01 01:16:26 |
| 2014-03-01 01:08:03 | 2014-03-01 01:12:51 | 2014-03-01 01:08:03 | 2014-03-01 01:12:51 |
| 2014-03-01 01:08:51 | 2014-03-01 01:13:18 | 2014-03-01 01:08:51 | 2014-03-01 01:13:18 |
| 2014-03-01 01:09:20 | 2014-03-01 01:24:18 | 2014-03-01 01:09:20 | 2014-03-01 01:24:18 |
| 2014-03-01 01:09:46 | 2014-03-01 01:22:34 | 2014-03-01 01:09:46 | 2014-03-01 01:22:34 |
| 2014-03-01 01:12:41 | 2014-03-01 01:15:38 | 2014-03-01 01:12:41 | 2014-03-01 01:15:38 |
| 2014-03-01 01:12:11 | 2014-03-01 01:27:38 | 2014-03-01 01:12:11 | 2014-03-01 01:27:38 |
| 2014-03-01 01:13:55 | 2014-03-01 01:34:54 | 2014-03-01 01:13:55 | 2014-03-01 01:34:54 |
| 2014-03-01 01:14:06 | 2014-03-01 01:28:25 | 2014-03-01 01:14:06 | 2014-03-01 01:28:25 |
Why can’t we find the difference between the new two time variables?
Solution 1: two layers of SELECT
SELECT (i.e., inside) layer creates the new variablesSELECT (i.e., outside) layer subtracts the two timesSELECT
pickup,
dropoff,
TIMEDIFF(pickup, dropoff) AS length_time
FROM (
SELECT
STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T") AS pickup,
STR_TO_DATE(dropoff_datetime, "%Y-%m-%d %T") AS dropoff
FROM yellow_old)
AS subquery_table
LIMIT 0, 20;| pickup | dropoff | length_time |
|---|---|---|
| 2014-03-01 01:07:38 | 2014-03-01 01:16:26 | 00:08:48 |
| 2014-03-01 01:08:03 | 2014-03-01 01:12:51 | 00:04:48 |
| 2014-03-01 01:08:51 | 2014-03-01 01:13:18 | 00:04:27 |
| 2014-03-01 01:09:20 | 2014-03-01 01:24:18 | 00:14:58 |
| 2014-03-01 01:09:46 | 2014-03-01 01:22:34 | 00:12:48 |
| 2014-03-01 01:12:41 | 2014-03-01 01:15:38 | 00:02:57 |
| 2014-03-01 01:12:11 | 2014-03-01 01:27:38 | 00:15:27 |
| 2014-03-01 01:13:55 | 2014-03-01 01:34:54 | 00:20:59 |
| 2014-03-01 01:14:06 | 2014-03-01 01:28:25 | 00:14:19 |
Solution 2: apply the STR_TO_DATE() function inside the TIMEDIFF() function
SELECT
pickup_datetime,
dropoff_datetime,
TIMEDIFF(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T"),
STR_TO_DATE(dropoff_datetime, "%Y-%m-%d %T")) AS length_time
FROM yellow_old
LIMIT 0, 20;| pickup_datetime | dropoff_datetime | length_time |
|---|---|---|
| 2014-03-01 01:07:38 | 2014-03-01 01:16:26 | 00:08:48 |
| 2014-03-01 01:08:03 | 2014-03-01 01:12:51 | 00:04:48 |
| 2014-03-01 01:08:51 | 2014-03-01 01:13:18 | 00:04:27 |
| 2014-03-01 01:09:20 | 2014-03-01 01:24:18 | 00:14:58 |
| 2014-03-01 01:09:46 | 2014-03-01 01:22:34 | 00:12:48 |
| 2014-03-01 01:12:41 | 2014-03-01 01:15:38 | 00:02:57 |
| 2014-03-01 01:12:11 | 2014-03-01 01:27:38 | 00:15:27 |
| 2014-03-01 01:13:55 | 2014-03-01 01:34:54 | 00:20:59 |
| 2014-03-01 01:14:06 | 2014-03-01 01:28:25 | 00:14:19 |
There is a distinction between clauses that operate on the variables of the original table versus those that operate on the variables of the results set.
pickup_datetime and dropoff_datetime are columns in the original table - they are written to disk on the SQL server.
pickup, dropoff, and length_time exist only in the results set, which is passed from the server (SQL server) to the client (e.g., RStudio or DBeaver) and is not written to disk.
Returns only unique rows.
| payment_type |
|---|
| CRD |
| CSH |
| NOC |
| DIS |
| UNK |
| vendor_id | payment_type |
|---|---|
| CMT | CRD |
| CMT | CSH |
| CMT | NOC |
| CMT | DIS |
| VTS | CRD |
| VTS | CSH |
| VTS | UNK |
In case you are curious:
VTS is Verifone Transportation Systems and CMT is Mobile Knowledge Systems Inc.CRD is credit card; CSH is cash; NOC is no charge; DIS is disputeThe WHERE clause is analogous to the filter() function in dplyr. However, keep in mind that there are two SQL commands that resemble the dplyr filter() function. WHERE operates on the original data in the table and HAVING operates on the result set.
What was the fare for those taxi rides where the tip_amount was more than $10 and the person used cash?
Note that in SQL the equality logical is = and in R the equality logical is ==.
SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount > 10
AND payment_type = "CSH"
LIMIT 0, 10;| payment_type | fare_amount | tip_amount | total_amount |
|---|---|---|---|
| CSH | 65.5 | 15.3 | 91.8 |
| CSH | 52.0 | 11.6 | 69.4 |
| CSH | 52.0 | 11.6 | 69.4 |
| CSH | 55.0 | 16.2 | 81.2 |
| CSH | 71.5 | 20.0 | 103.5 |
| CSH | 70.0 | 16.2 | 97.1 |
| CSH | 95.0 | 21.9 | 131.2 |
| CSH | 62.5 | 15.5 | 93.0 |
| CSH | 66.0 | 15.0 | 90.0 |
| CSH | 65.0 | 13.2 | 79.2 |
BETWEEN can be used to specify a range of values for a numeric value. BETWEEN is inclusive.
SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount BETWEEN 10 and 12
AND payment_type = "CSH"
LIMIT 0, 10;| payment_type | fare_amount | tip_amount | total_amount |
|---|---|---|---|
| CSH | 52.0 | 11.6 | 69.4 |
| CSH | 52.0 | 11.6 | 69.4 |
| CSH | 88.0 | 10.0 | 107.0 |
| CSH | 72.0 | 10.0 | 94.0 |
| CSH | 64.5 | 10.0 | 85.5 |
| CSH | 66.0 | 12.0 | 93.0 |
| CSH | 52.0 | 11.6 | 69.4 |
| CSH | 69.0 | 10.0 | 88.0 |
| CSH | 90.0 | 10.0 | 100.0 |
| CSH | 52.0 | 11.6 | 69.4 |
IN is similar to the dplyr %in% function which specifies distinct values for the variable.
SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount IN (10, 12)
AND payment_type = "CSH"
LIMIT 0, 10;| payment_type | fare_amount | tip_amount | total_amount |
|---|---|---|---|
| CSH | 88.0 | 10 | 107.0 |
| CSH | 72.0 | 10 | 94.0 |
| CSH | 64.5 | 10 | 85.5 |
| CSH | 66.0 | 12 | 93.0 |
| CSH | 69.0 | 10 | 88.0 |
| CSH | 90.0 | 10 | 100.0 |
| CSH | 74.5 | 10 | 90.3 |
| CSH | 89.0 | 10 | 118.1 |
| CSH | 52.0 | 10 | 67.8 |
| CSH | 66.0 | 12 | 90.0 |
AND takes precedent over OR in the order of operations, when there are no parentheses.
SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount BETWEEN 10 and 12 OR
total_amount BETWEEN 100 and 112 AND
payment_type = "CSH"
LIMIT 0, 10;| payment_type | fare_amount | tip_amount | total_amount |
|---|---|---|---|
| CRD | 52.0 | 10.5 | 63.0 |
| CRD | 35.0 | 10.2 | 51.0 |
| CRD | 52.0 | 11.6 | 69.4 |
| CRD | 30.5 | 10.8 | 47.2 |
| CRD | 52.0 | 10.5 | 63.0 |
| CRD | 52.0 | 10.5 | 63.0 |
| CRD | 52.0 | 11.6 | 69.4 |
| CRD | 52.0 | 11.6 | 69.4 |
| CRD | 52.0 | 11.6 | 69.4 |
| CRD | 52.0 | 11.6 | 69.4 |
Parentheses take precedent over AND and OR.
SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE (tip_amount BETWEEN 10 and 12 OR
total_amount BETWEEN 100 and 112 ) AND
payment_type = "CSH"
LIMIT 0, 10;| payment_type | fare_amount | tip_amount | total_amount |
|---|---|---|---|
| CSH | 107.0 | 0.0 | 108.0 |
| CSH | 92.5 | 0.0 | 103.5 |
| CSH | 99.5 | 0.0 | 105.3 |
| CSH | 92.0 | 0.0 | 106.3 |
| CSH | 103.0 | 0.0 | 109.3 |
| CSH | 107.0 | 0.0 | 107.0 |
| CSH | 104.5 | 0.0 | 105.5 |
| CSH | 112.0 | 0.0 | 112.0 |
| CSH | 52.0 | 11.6 | 69.4 |
| CSH | 52.0 | 11.6 | 69.4 |
IS NULL not = NULL (because NULL indicates unknown)
NULL exampleThe logic of NULL:1
NULL, you’ll just get NULL. For instance if x is NULL, then x>3, 1=x, and x+4 all evaluate to NULL. Even x= NULL evaluates to NULL! if you want to check whether x is NULL, use x IS NULL or x IS NOT NULL.NULL short-circuits with boolean operators. That means a boolean expression involving NULL will evaluate to:
NULL value is really TRUE or FALSE.NULL value is really TRUE or FALSE.NULL, if it depends on the NULL value.NULL exampleConsider the following table and SQL query to evaluate WHERE age <= 20 OR num_dogs = 3:
SELECT * FROM (
SELECT 'Ace' AS name, 20 AS age, 4 as num_dogs
UNION
SELECT 'Ada' AS name, NULL AS age, 3 as num_dogs
UNION
SELECT 'Ben' AS name, NULL AS age, NULL as num_dogs
UNION
SELECT 'Cho' AS name, 27 AS age, NULL as num_dogs
) AS temptable;| name | age | num_dogs |
|---|---|---|
| Ace | 20 | 4 |
| Ada | 3 | |
| Ben | ||
| Cho | 27 |
NULL exampleWhere does the WHERE clause do? It tells us that we only want to keep the rows satisfying the age <= 20 OR num_dogs = 3. Let’s consider each row one at a time:
age <= 20 evaluates to TRUE so the claim is satisfied.age <= 20 evaluates to NULL but num_dogs = 3 evaluates to TRUE so the claim is satisfied.age <= 20 evaluates to NULL and num_dogs = 3 evaluates to NULL so the overall expression is NULL which has a FALSE value.age <= 20 evaluates to FALSE and num_dogs = 3 evaluates to NULL so the overall expression evaluates to NULL (because it depends on the value of the NULL).Thus we keep only Ace and Ada.
NULL exampleSELECT * FROM (
SELECT 'Ace' AS name, 20 AS age, 4 as num_dogs
UNION
SELECT 'Ada' AS name, NULL AS age, 3 as num_dogs
UNION
SELECT 'Ben' AS name, NULL AS age, NULL as num_dogs
UNION
SELECT 'Cho' AS name, 27 AS age, NULL as num_dogs
) AS temptable
WHERE age <= 20 OR num_dogs = 3;| name | age | num_dogs |
|---|---|---|
| Ace | 20 | 4 |
| Ada | 3 |
The GROUP BY clause will direct SQL to carry out the query separately for each category in the grouped variable.
COUNT(), SUM(), MAX(), MIN(), and AVG().SELECT COUNT(*) AS num_transactions,
SUM(1) AS num_transactions_also,
SUM(2) AS double_transactions,
payment_type
FROM yellow_old
WHERE tip_amount BETWEEN 10 and 20
GROUP BY payment_type;| num_transactions | num_transactions_also | double_transactions | payment_type |
|---|---|---|---|
| 213872 | 213872 | 427744 | CRD |
| 78 | 78 | 156 | CSH |
| 3 | 3 | 6 | DIS |
| 7 | 7 | 14 | NOC |
| 609 | 609 | 1218 | UNK |
For those people who tipped between $10 and $20, what was the lowest and highest fare for each of the types of payments?
SELECT COUNT(*) AS num_transactions,
MIN(fare_amount) AS lowest_fare,
MAX(fare_amount) AS highest_fare,
payment_type
FROM yellow_old
WHERE tip_amount BETWEEN 10 and 20
GROUP BY payment_type;| num_transactions | lowest_fare | highest_fare | payment_type |
|---|---|---|---|
| 213872 | 0.0 | 370.0 | CRD |
| 78 | 52.0 | 102.0 | CSH |
| 3 | 52.0 | 79.5 | DIS |
| 7 | 58.0 | 94.0 | NOC |
| 609 | 4.5 | 147.0 | UNK |
GROUP BY will work applied to multiple columns.
What is wday?
SELECT COUNT(*) AS num_transactions,
MIN(fare_amount) AS lowest_fare,
MAX(fare_amount) AS highest_fare,
payment_type,
DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday;| num_transactions | lowest_fare | highest_fare | payment_type | wday |
|---|---|---|---|---|
| 1 | ||||
| 1247397 | 2.5 | 477 | CRD | Friday |
| 1278362 | -612.4 | 500 | CRD | Monday |
| 1533796 | 2.5 | 420 | CRD | Saturday |
| 1324394 | 2.5 | 480 | CRD | Sunday |
| 1258098 | 2.5 | 500 | CRD | Thursday |
| 1121081 | 2.5 | 500 | CRD | Tuesday |
| 1192892 | 2.5 | 400 | CRD | Wednesday |
| 860920 | 2.5 | 444 | CSH | Friday |
| 918653 | 0.0 | 873 | CSH | Monday |
ORDER BY allows us to look at interesting aspects of the data by sorting the data.
SELECT COUNT(*) AS num_transactions,
MIN(fare_amount) AS lowest_fare,
MAX(fare_amount) AS highest_fare,
payment_type,
DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday
ORDER BY lowest_fare ASC;| num_transactions | lowest_fare | highest_fare | payment_type | wday |
|---|---|---|---|---|
| 1 | ||||
| 1278362 | -612.4 | 500 | CRD | Monday |
| 918653 | 0.0 | 873 | CSH | Monday |
| 5440 | 0.0 | 950 | NOC | Monday |
| 1537 | 0.0 | 102 | DIS | Monday |
| 1533796 | 2.5 | 420 | CRD | Saturday |
| 1121081 | 2.5 | 500 | CRD | Tuesday |
| 1192892 | 2.5 | 400 | CRD | Wednesday |
| 1258098 | 2.5 | 500 | CRD | Thursday |
| 1247397 | 2.5 | 477 | CRD | Friday |
WHAT?!?!! How in the world was one of the fares -$612.40?
SELECT COUNT(*) AS num_transactions,
MIN(fare_amount) AS lowest_fare,
MAX(fare_amount) AS highest_fare,
payment_type,
DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday
ORDER BY highest_fare DESC;| num_transactions | lowest_fare | highest_fare | payment_type | wday |
|---|---|---|---|---|
| 5440 | 0.0 | 950 | NOC | Monday |
| 918653 | 0.0 | 873 | CSH | Monday |
| 1278362 | -612.4 | 500 | CRD | Monday |
| 1121081 | 2.5 | 500 | CRD | Tuesday |
| 1258098 | 2.5 | 500 | CRD | Thursday |
| 1324394 | 2.5 | 480 | CRD | Sunday |
| 1247397 | 2.5 | 477 | CRD | Friday |
| 1222 | 2.5 | 475 | DIS | Tuesday |
| 813813 | 2.5 | 475 | CSH | Thursday |
| 860920 | 2.5 | 444 | CSH | Friday |
$950 is a lot to pay for a cab ride! But in NYC, I’d believe it.
SELECT COUNT(*) AS num_transactions,
MIN(fare_amount) AS lowest_fare,
MAX(fare_amount) AS highest_fare,
payment_type,
DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday
ORDER BY wday, payment_type;| num_transactions | lowest_fare | highest_fare | payment_type | wday |
|---|---|---|---|---|
| 1 | ||||
| 1247397 | 2.5 | 477 | CRD | Friday |
| 860920 | 2.5 | 444 | CSH | Friday |
| 1592 | 2.5 | 255 | DIS | Friday |
| 5252 | 2.5 | 229 | NOC | Friday |
| 10131 | 2.5 | 130 | UNK | Friday |
| 1278362 | -612.4 | 500 | CRD | Monday |
| 918653 | 0.0 | 873 | CSH | Monday |
| 1537 | 0.0 | 102 | DIS | Monday |
| 5440 | 0.0 | 950 | NOC | Monday |
Note that both GROUP BY and ORDER BY evaluate the data after it has been retrieved. Therefore, the functions operate on the results set, not the original rows of the data.
We are able to GROUP BY and ORDER BY on the new variables we had created, wday.
Recall that WHERE acts only on the original data. If we are interested in rides that took place on Friday, we need to use the derived variable wday instead of the raw variable pickup_datetime. Fortunately, HAVING works on the results set.
SELECT COUNT(*) AS num_transactions,
MIN(fare_amount) AS lowest_fare,
MAX(fare_amount) AS highest_fare,
payment_type,
DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY payment_type, wday
HAVING wday = 'Friday';| num_transactions | lowest_fare | highest_fare | payment_type | wday |
|---|---|---|---|---|
| 1247397 | 2.5 | 477 | CRD | Friday |
| 860920 | 2.5 | 444 | CSH | Friday |
| 1592 | 2.5 | 255 | DIS | Friday |
| 5252 | 2.5 | 229 | NOC | Friday |
| 10131 | 2.5 | 130 | UNK | Friday |
While it worked out quite well for us that HAVING was able to filter the data based on the results set, the use of HAVING was quite onerous because the entire data set was considered before the filter was applied. That is, if the filter can be done on the original data using WHERE, the query will be much faster and more efficient.
Note: HAVING requires a GROUP BY clause. And the variable(s) used in HAVING must also be part of the GROUP BY clause.
LIMIT truncates the query to specified rows. The first number is the offset (i.e., the number of rows to skip), the second number is the (maximum) number of rows to return. Here, we return rows 15428119 through 15428128.
The first number is optional.
| vendor_id | pickup_datetime | dropoff_datetime | passenger_count | trip_distance | pickup_longitude | pickup_latitude | rate_code | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | payment_type | fare_amount | surcharge | mta_tax | tip_amount | tolls_amount | total_amount |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CMT | 2014-03-18 14:35:21 | 2014-03-18 14:52:01 | 1 | 2.7 | 0 | 0.0 | 1 | N | 0 | 0.0 | CRD | 13.0 | 0 | 0.5 | 2.50 | 0.00 | 16.0 |
| CMT | 2014-03-18 14:08:23 | 2014-03-18 14:19:29 | 2 | 1.3 | 0 | 0.0 | 1 | N | 0 | 0.0 | CRD | 9.0 | 0 | 0.5 | 1.90 | 0.00 | 11.4 |
| CMT | 2014-03-18 09:18:38 | 2014-03-18 09:19:41 | 1 | 0.2 | -74 | 40.8 | 1 | N | -74 | 40.8 | CRD | 3.0 | 0 | 0.5 | 1.00 | 0.00 | 4.5 |
| CMT | 2014-03-18 06:28:12 | 2014-03-18 06:49:49 | 1 | 9.9 | 0 | 0.0 | 1 | N | 0 | 0.0 | CRD | 30.0 | 0 | 0.5 | 7.16 | 5.33 | 43.0 |
| CMT | 2014-03-18 17:39:28 | 2014-03-18 17:53:01 | 1 | 4.9 | -74 | 40.8 | 1 | N | -74 | 40.7 | CRD | 16.5 | 1 | 0.5 | 3.00 | 0.00 | 21.0 |
| CMT | 2014-03-18 18:14:19 | 2014-03-18 18:27:22 | 1 | 0.3 | -74 | 40.7 | 1 | N | -74 | 40.7 | CRD | 3.5 | 1 | 0.5 | 6.00 | 0.00 | 11.0 |
| CMT | 2014-03-18 10:12:33 | 2014-03-18 10:28:09 | 1 | 3.3 | -74 | 40.8 | 1 | N | -74 | 40.8 | CRD | 13.5 | 0 | 0.5 | 4.20 | 0.00 | 18.2 |
| CMT | 2014-03-18 09:02:37 | 2014-03-18 09:16:29 | 1 | 6.2 | -74 | 40.8 | 1 | N | -74 | 40.7 | CRD | 19.5 | 0 | 0.5 | 1.00 | 0.00 | 21.0 |
| CMT | 2014-03-18 10:10:19 | 2014-03-18 10:19:25 | 1 | 1.7 | -74 | 40.8 | 1 | N | -74 | 40.8 | CRD | 8.5 | 0 | 0.5 | 1.80 | 0.00 | 10.8 |
| CMT | 2014-03-18 15:24:53 | 2014-03-18 15:42:42 | 1 | 1.7 | -74 | 40.8 | 1 | N | -74 | 40.7 | CRD | 12.5 | 0 | 0.5 | 2.60 | 0.00 | 15.6 |
If you are working in R to run SQL commands, you may want to use the query output for further analysis or visualizations.
#|output.var: "name_of_variable" inside the {sql} chunk.name_of_variable will then be available to be used in the R environment.
Always a good idea to terminate the SQL connection when you are done with it.