Lab 5 - creating databases

Author

your name goes here

Published

Invalid Date

library(tidyverse)
library(DBI)
library(RMariaDB)
library(dbplyr)
library(mdsr)

Today’s lab will provide practice working with creating SQL databases using DuckDB.

The goals for lab 5 include:

Advice for turning in the assignment

  • Be sure to indicate (in the .qmd file) which problem is being answered with which code. A sentence or two with each response goes a long way toward your understanding!

  • save the .Rproj file somewhere you can find it. Don’t keep everything in your downloads folder. Maybe make a folder called SDS261 or something. That folder could live on your Desktop. Or maybe in your Dropbox.

  • The .qmd document should be saved in the R Project as lab5-sds261-yourlastname-yourfirstname.qmd.

  • Set up a connection to a database of your naming using DuckDB.

con_college <- DBI::dbConnect(duckdb::duckdb(),
                             dbdir = "you_name_the_db")

Assignment

con_college <- DBI::dbConnect(duckdb::duckdb(),
                             dbdir = "collegeDB")

The data we will work with today is on college tuition, pay, and diversity. The original data source is the US Department of Education, but it has been compiled as part of TidyTuesday on March 10, 2020. Tuition and fees are for 2018-19. Diversity is for 2014.

  1. Download the data onto your own computer. In the interest of time, let’s only use three tables: tuition_cost and salary_potential and diversity_school.
tuition_cost <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/tuition_cost.csv')

salary_potential <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/salary_potential.csv')

diversity_school <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/diversity_school.csv')
  1. Follow the first two steps in the notes: USE and DROP TABLE.
USE collegeDB;
DROP TABLE IF EXISTS salary;
DROP TABLE IF EXISTS diversity;
DROP TABLE IF EXISTS tuition;
  1. For each of the three tables, in the CREATE TABLE operation, carefully define variable types and identify any key variables. Unfortunately, FOREIGN KEYs don’t work here because there is not referential integrity across the tables.
CREATE TABLE tuition (
  name VARCHAR(255) NOT NULL DEFAULT '',
  state VARCHAR(255),
  state_code VARCHAR(255),
  type VARCHAR(255),
  degree_length VARCHAR(255),
  room_and_board VARCHAR(255),
  in_state_tuition DOUBLE,
  in_state_total DOUBLE,
  out_of_state_tuition DOUBLE,
  out_of_state_total DOUBLE,
  PRIMARY KEY (name, state)
);
CREATE TABLE salary (
  rank DOUBLE,
  name VARCHAR(255),
  state_name VARCHAR(255),
  early_career_pay DOUBLE,
  mid_career_pay DOUBLE,
  make_world_better_percent VARCHAR(255),
  stem_percent DOUBLE
);
CREATE TABLE diversity (
  name VARCHAR(255),
  total_enrollment DOUBLE,
  state VARCHAR(255),
  category VARCHAR(255),
  enrollment DOUBLE
);
  1. Use COPY to load the data from your computer. (Hint: if there is an NA value, SQL doesn’t know that it is actually NULL and wants the variable to be loaded as a character string.)
COPY tuition FROM 'data/tuition_cost.csv' HEADER;
COPY salary FROM 'data/salary_potential.csv' HEADER;
COPY diversity FROM 'data/diversity_school.csv' HEADER;
  1. Use SELECT * (with a LIMIT!) to make sure all three tables loaded correctly.
SELECT * FROM tuition LIMIT 10;
Displaying records 1 - 10
name state state_code type degree_length room_and_board in_state_tuition in_state_total out_of_state_tuition out_of_state_total
Aaniiih Nakoda College Montana MT Public 2 Year NA 2380 2380 2380 2380
Abilene Christian University Texas TX Private 4 Year 10350 34850 45200 34850 45200
Abraham Baldwin Agricultural College Georgia GA Public 2 Year 8474 4128 12602 12550 21024
Academy College Minnesota MN For Profit 2 Year NA 17661 17661 17661 17661
Academy of Art University California CA For Profit 4 Year 16648 27810 44458 27810 44458
Adams State University Colorado CO Public 4 Year 8782 9440 18222 20456 29238
Adelphi University New York NY Private 4 Year 16030 38660 54690 38660 54690
Adirondack Community College New York NY Public 2 Year 11660 5375 17035 9935 21595
Adrian College Michigan MI Private 4 Year 11318 37087 48405 37087 48405
Advanced Technology Institute Virginia VA For Profit 2 Year NA 13680 13680 13680 13680
SELECT * FROM salary LIMIT 10;
Displaying records 1 - 10
rank name state_name early_career_pay mid_career_pay make_world_better_percent stem_percent
1 Auburn University Alabama 54400 104500 51 31
2 University of Alabama in Huntsville Alabama 57500 103900 59 45
3 The University of Alabama Alabama 52300 97400 50 15
4 Tuskegee University Alabama 54500 93500 61 30
5 Samford University Alabama 48400 90500 52 3
6 Spring Hill College Alabama 46600 89100 53 12
7 Birmingham Southern College Alabama 49100 88300 48 27
8 University of Alabama at Birmingham Alabama 48600 87200 57 17
9 University of South Alabama Alabama 47700 86400 56 17
10 Alabama A&M University Alabama 48700 83500 58 20
SELECT * FROM diversity LIMIT 10;
Displaying records 1 - 10
name total_enrollment state category enrollment
University of Phoenix-Arizona 195059 Arizona Women 134722
University of Phoenix-Arizona 195059 Arizona American Indian / Alaska Native 876
University of Phoenix-Arizona 195059 Arizona Asian 1959
University of Phoenix-Arizona 195059 Arizona Black 31455
University of Phoenix-Arizona 195059 Arizona Hispanic 13984
University of Phoenix-Arizona 195059 Arizona Native Hawaiian / Pacific Islander 1019
University of Phoenix-Arizona 195059 Arizona White 58209
University of Phoenix-Arizona 195059 Arizona Two Or More Races 19039
University of Phoenix-Arizona 195059 Arizona Unknown 65163
University of Phoenix-Arizona 195059 Arizona Non-Resident Foreign 3355
  1. Add an INDEX to the tuition table on state. We won’t discuss until Thursday, but state is better than state_code because state allows you to join with other tables. state is better than type or degree_length because state has a much higher cardinality.
CREATE INDEX state_idx ON tuition (state);

Disconnect from the server