Lab 2 - SQL clauses

Author

your name goes here

Published

Invalid Date

library(tidyverse)
library(DBI)
library(RMariaDB)
library(dbplyr)
library(mdsr)

Today’s lab will provide practice working with SQL clauses in DBeaver. Don’t forget the importance of the order of the SQL clauses.

The goals for lab 2 include:

Advice for turning in the assignment

  • Be sure to indicate (in the .sql file) which problem is being answered with which SQL code. Use the following syntax to comment within a .sql file: /* here is where comments go */. Indeed, feel free to copy the question into the .sql file so that you have it for your own records.

  • save the .Rproj file somewhere you can find it. Don’t keep everything in your downloads folder. Maybe make a folder called SDS261 or something. That folder could live on your Desktop. Or maybe in your Dropbox.

  • The SQL document should be saved in the R Project as lab2-sds261-yourlastname-yourfirstname.sql. You will have to navigate to the R Project to save the DBeaver file in the correct place.

  • Connect to the nyctaxi database, which contains the yellow_old table. See README file (or lab 1) for connection details.

Assignment

  1. Select the trip_distance and total_amount columns from the yellow_old table. For safety (i.e., not crashing your computer), only return the first 5 rows.
SELECT trip_distance, total_amount
FROM yellow_old
LIMIT 0, 5;
5 records
trip_distance total_amount
NA NA
2.0 12.0
1.2 8.0
0.5 7.2
3.5 18.0
  1. Using the AVG() function, find the average total_amount paid in the yellow_old table.
SELECT AVG(total_amount)
FROM yellow_old
LIMIT 0, 5;
1 records
AVG(total_amount)
14.76179

The average total fare was $14.76

  1. Which type of payment_type had the highest total_amount? How much was it?
SELECT MAX(total_amount), payment_type
FROM yellow_old
GROUP BY payment_type
LIMIT 0, 5;
5 records
MAX(total_amount) payment_type
NA NA
540.00 CRD
1007.51 CSH
484.04 DIS
950.30 NOC

The highest total amount was a cash payment for $1007.51. Yikes, that’s a big bill for a taxi ride. Note that the second highest payment was $950.30 for a no charge ride. How does that happen?

  1. How many taxi trips happened on each day of the week? Sort your results so that the day of the week with the fewest rides is at the top of the output.
SELECT COUNT(*) AS num_transactions, 
       DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY wday
ORDER BY num_transactions;
8 records
num_transactions wday
1 NA
1887838 Tuesday
1984031 Wednesday
2088305 Thursday
2125292 Friday
2215255 Monday
2364039 Sunday
2763367 Saturday

Unsurprisingly, the most rides were on the weekends. Monday is a close third, possibly a reflection of how hard it is to get out of bed on Monday mornings.

  1. Which day of the week had the longest trip_distance? Sort the results to have the day of the week with the longest rides at the top of the output.
SELECT MAX(trip_distance) AS long_trip, 
       DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY wday
ORDER BY long_trip DESC;
8 records
long_trip wday
5005013.0 Monday
100.0 Wednesday
100.0 Friday
100.0 Sunday
97.4 Tuesday
96.1 Thursday
92.9 Saturday
NA NA

Holy cow, one of the trips is 5,005,013 miles! That can’t be right!??!

  1. How many different rate_codes are given in the dataset?
SELECT DISTINCT rate_code
FROM yellow_old
LIMIT 0, 20;
Displaying records 1 - 10
rate_code
NA
1
2
5
3
4
0
6
210
8
SELECT COUNT(*) AS num_code, rate_code
FROM yellow_old
GROUP BY rate_code
ORDER BY num_code DESC
LIMIT 0, 20;
Displaying records 1 - 10
num_code rate_code
15088481 1
268161 2
42092 5
22935 3
4676 4
1579 0
179 6
14 210
4 8
4 7

There are 11 different rate codes. Rate codes are different because the taxi charges you a different rate if, for example, you go to JFK or go very far away.

The second solution not only gave rate codes, but also the number of transactions for each rate code.

  1. Start with the following lines of query:
SELECT AVG(trip_distance) AS avg_trip, 
       DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
  • Can you use WHERE to subset only Saturday and Sunday to find the average trip distance across the weekend? Why or why not?
  • Can you use HAVING to subset only Saturday and Sunday to find the average trip distance across the weekend? Why or why not?
SELECT AVG(trip_distance) AS avg_trip, 
       DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
GROUP BY wday
HAVING wday = "Saturday" | wday = "Sunday"
LIMIT 0,8;
0 records
avg_trip wday

Neither WHERE nor HAVING can be used with this query to find the average trip distance across the weekend days.

WHERE cannot be used because wday is not in the original dataset. WHERE works on the original data.

HAVING cannot be used because the average has already been taken across the groups, so there isn’t any way to go back and re-calculate the average across Saturday and Sunday. HAVING works on the results set.