Today’s lab will provide practice working with SQL clauses in DBeaver. Don’t forget the importance of the order of the SQL clauses.
The goals for lab 2 include:
working with clauses in SQL.
understanding each of SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT.
practicing running SQL queries in a SQL client, DBeaver.
Advice for turning in the assignment
Be sure to indicate (in the .sql file) which problem is being answered with which SQL code. Use the following syntax to comment within a .sql file: /* here is where comments go */. Indeed, feel free to copy the question into the .sql file so that you have it for your own records.
save the .Rproj file somewhere you can find it. Don’t keep everything in your downloads folder. Maybe make a folder called SDS261 or something. That folder could live on your Desktop. Or maybe in your Dropbox.
The SQL document should be saved in the R Project as lab2-sds261-yourlastname-yourfirstname.sql. You will have to navigate to the R Project to save the DBeaver file in the correct place.
Connect to the nyctaxi database, which contains the yellow_old table. See README file (or lab 1) for connection details.
Assignment
Select the trip_distance and total_amount columns from the yellow_old table. For safety (i.e., not crashing your computer), only return the first 5 rows.
The highest total amount was a cash payment for $1007.51. Yikes, that’s a big bill for a taxi ride. Note that the second highest payment was $950.30 for a no charge ride. How does that happen?
How many taxi trips happened on each day of the week? Sort your results so that the day of the week with the fewest rides is at the top of the output.
Solution
SELECTCOUNT(*) AS num_transactions, DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wdayFROM yellow_oldGROUPBY wdayORDERBY num_transactions;
8 records
num_transactions
wday
1
NA
1887838
Tuesday
1984031
Wednesday
2088305
Thursday
2125292
Friday
2215255
Monday
2364039
Sunday
2763367
Saturday
Solution
Unsurprisingly, the most rides were on the weekends. Monday is a close third, possibly a reflection of how hard it is to get out of bed on Monday mornings.
Which day of the week had the longest trip_distance? Sort the results to have the day of the week with the longest rides at the top of the output.
Solution
SELECTMAX(trip_distance) AS long_trip, DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wdayFROM yellow_oldGROUPBY wdayORDERBY long_trip DESC;
8 records
long_trip
wday
5005013.0
Monday
100.0
Wednesday
100.0
Friday
100.0
Sunday
97.4
Tuesday
96.1
Thursday
92.9
Saturday
NA
NA
Solution
Holy cow, one of the trips is 5,005,013 miles! That can’t be right!??!
How many different rate_codes are given in the dataset?
SELECTCOUNT(*) AS num_code, rate_codeFROM yellow_oldGROUPBY rate_codeORDERBY num_code DESCLIMIT0, 20;
Displaying records 1 - 10
num_code
rate_code
15088481
1
268161
2
42092
5
22935
3
4676
4
1579
0
179
6
14
210
4
8
4
7
Solution
There are 11 different rate codes. Rate codes are different because the taxi charges you a different rate if, for example, you go to JFK or go very far away.
The second solution not only gave rate codes, but also the number of transactions for each rate code.
Start with the following lines of query:
SELECTAVG(trip_distance) AS avg_trip, DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
Can you use WHERE to subset only Saturday and Sunday to find the average trip distance across the weekend? Why or why not?
Can you use HAVING to subset only Saturday and Sunday to find the average trip distance across the weekend? Why or why not?
Solution
SELECTAVG(trip_distance) AS avg_trip, DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wdayFROM yellow_oldGROUPBY wdayHAVING wday ="Saturday" | wday ="Sunday"LIMIT0,8;
0 records
avg_trip
wday
Solution
Neither WHERE nor HAVING can be used with this query to find the average trip distance across the weekend days.
WHERE cannot be used because wday is not in the original dataset. WHERE works on the original data.
HAVING cannot be used because the average has already been taken across the groups, so there isn’t any way to go back and re-calculate the average across Saturday and Sunday. HAVING works on the results set.