Lab 3 - SQL joins

Author

your name goes here

Published

Invalid Date

library(tidyverse)
library(DBI)
library(RMariaDB)
library(dbplyr)
library(mdsr)

Today’s lab will provide practice working with SQL clauses in DBeaver. Don’t forget the importance of the order of the SQL clauses.

The goals for lab 3 include:

Advice for turning in the assignment

  • Be sure to indicate (in the .sql file) which problem is being answered with which SQL code. Use the following syntax to comment within a .sql file: /* here is where comments go */. Indeed, feel free to copy the question into the .sql file so that you have it for your own records.

  • save the .Rproj file somewhere you can find it. Don’t keep everything in your downloads folder. Maybe make a folder called SDS261 or something. That folder could live on your Desktop. Or maybe in your Dropbox.

  • The SQL document should be saved in the R Project as lab3-sds261-yourlastname-yourfirstname.sql. You will have to navigate to the R Project to save the DBeaver file in the correct place.

  • Connect to the yelp database, which contains the businesses, reviews, and users tables. See README file for connection details.

SHOW TABLES;
3 records
Tables_in_yelp
businesses
reviews
users

Assignment

  1. For each of the three tables in the yelp database, identify the number of records and the variables. (That means, write at least a sentence for each table.)
SELECT * FROM businesses LIMIT 0, 5;
5 records
id name stars open review_count city longitude latitude
rncjoVoEFUJGCUoC1JgnUA Peoria Income Tax Service 5 TRUE 3 Peoria -112.2416 33.58187
0FNFSzCFP_rGUoJx8W7tJg Bike Doctor 5 TRUE 5 Phoenix -112.1059 33.60405
3f_lyB6vFK48ukH6ScvLHg Valley Permaculture Alliance 5 TRUE 4 Phoenix -112.0739 33.46053
usAsSV36QmUej8–yvN-dg Food City 4 TRUE 5 Phoenix -112.0854 33.39221
PzOqRohWw7F7YEPBz6AubA Hot Bagels & Deli 4 TRUE 14 Glendale Az -112.2003 33.71280
SELECT * FROM reviews LIMIT 0, 5;
5 records
business_id user_id stars
9yKzy9PApeiPPOUJEtnvkg rLtl8ZkDX5vH5nAx9C3q5Q 5
ZRJwVLyzEJq1VAihDhYiow 0a2KyEL0d3Yb1V6aivbIuQ 5
6oRAC4uyJCsJl1X0WZpVSA 0hT2KtfLiobPvh6cDC8JQg 4
_1QQZuf4zZOyFCvXc0o6Vg uZetl9T0NcROGOyFfughhg 5
6ozycU1RpktNG2-1BroVtw vYmM4KTsC8ZfQBg-j5MWkw 5
SELECT * FROM users LIMIT 0, 5;
5 records
user_id name average_stars review_count
CR2y7yEm4X035ZMzrTtN9Q Jim 5 6
_9GXoHhdxc30ujPaQwh6Ew Kelle 1 2
8mM-nqxjg6pT04kwcjMbsw Stephanie 5 2
Ch6CdTR2IVaVANr-RglMOg T 5 2
NZrLmHRyiHmyT1JrfzkCOA Beth 1 1
SELECT COUNT(*) FROM businesses;
1 records
COUNT(*)
11537
SELECT COUNT(*) FROM reviews;
1 records
COUNT(*)
229907
SELECT COUNT(*) FROM users;
1 records
COUNT(*)
43873

businesses: 11,537 records with variables of id, name, stars, open, review_count, city, longitude, latitude

reviews: 229,907 records with variables of business_id, user_id, stars

users: 43,873 records with variables of user_id, name, average_stars, review_count

  1. Find the user with the most reviews. What is the person’s name, and how many reviews did they make?
SELECT user_id, name, average_stars, review_count
FROM users
ORDER BY review_count DESC
LIMIT 0, 5;
5 records
user_id name average_stars review_count
AIVQg9enGug5woxehjmlGg Kim 3.77 5807
6HBnx7fTfFlpWyez_P55xA Karen 3.63 2848
7FuLnS_-b79GG-33mwLaMg Andrew 3.64 2810
lxZSVeJz6KEBW1nlA3JKJg Shiho 3.84 2760
qbfQRHLvZk5WSkKY0l_lMw Stephy 3.87 2587

Kim has the most reviews (5,807 reviews!) with an average rating of 3.77 stars.

  1. What cities are represented in the businesses table? Find out by querying the number of businesses per city. Use LIMIT to start, just in case there are hundreds of cities. Write down a few of the city names. Do you notice anything interesting?
SELECT COUNT(*), city
FROM businesses
GROUP BY city
ORDER BY city
LIMIT 0, 100;
Displaying records 1 - 10
COUNT(*) city
4 Ahwatukee
34 Anthem
46 Apache Junction
129 Avondale
31 Buckeye
20 Carefree
48 Casa Grande
65 Cave Creek
865 Chandler
1 Charleston

There appear to be cities in Phoenix as well as in Pheonix (??). And a few of the city names include the AZ to indicate that they are in Arizona. Before really working with the data, we would want to clean up the city names.

  1. Find the businesses in “Carefree” (a city in Arizona). Write down the names of a few of the businesses.
SELECT id, name, stars, open, review_count, city
FROM businesses
WHERE city = "Carefree"
LIMIT 0, 50;
Displaying records 1 - 10
id name stars open review_count city
K0v87iFgB3sXSzw6e-BJ5w AZ Wine Company 5 TRUE 8 Carefree
nZ0mUQEdez-CHrykD2nNLA Carefree Resort & Conference Center 3 TRUE 30 Carefree
c4FLMLP7hMnAbSiW2o_o3A Saguaro Grille 3 FALSE 5 Carefree
QygMZmPO8A6OaJYO2a0pnw Studio C Hair Salon 5 TRUE 3 Carefree
Z85BGr-jnEWvoI4xAnoh3g Lowe’s Home Improvement Warehouse of Phoenix 4 TRUE 6 Carefree
SahvCnFp3OvUVZIAYCyckw 34 Easy St 4 FALSE 3 Carefree
jdStuaC_1leN_DNGcQ0yEw Latilla 5 FALSE 5 Carefree
AwUMl0PT3mdkPS7huzEUYQ Balloon Festival 1 TRUE 4 Carefree
MHPXDXVE_cUYodfPhkadeQ Black Mountain Coffee Shop 4 TRUE 11 Carefree
OvlAlAkiyyCILkq-TylBWg The Sundial Cafe 3 TRUE 7 Carefree
  1. Count the number of records resulting when the query connects the users with the reviews using a JOIN. How many records are there? What does that tell you about the two tables?
SELECT COUNT(*)
FROM users
JOIN reviews ON users.user_id = reviews.user_id
LIMIT 0, 100;
1 records
COUNT(*)
215879

Because there are fewer records in the join, we know that some reviews were done by users that are not in the users table.

  1. How many users wrote reviews in the reviews table but do not exist in the users table. Hint: use a RIGHT JOIN to keep all the reviews, and then look for the rows where the user_id from the users table IS NULL.
SELECT COUNT(DISTINCT reviews.user_id)
FROM users
RIGHT JOIN reviews ON users.user_id = reviews.user_id
WHERE users.user_id IS NULL;
1 records
COUNT(DISTINCT reviews.user_id)
2108

There are 2,108 users who wrote 14,028 reviews and who are not in the users table.

  1. Write a query to ask a question that you think is interesting and uses at least one join. Provide both the question (in words) as well as the SQL code and results.