Lab 5 - creating databases

Author

your name goes here

Published

Invalid Date

Today’s lab will provide practice working with creating SQL databases using DuckDB.

The goals for lab 5 include:

Advice for turning in the assignment

  • Be sure to indicate (in the .qmd file) which problem is being answered with which code. A sentence or two with each response goes a long way toward your understanding!

  • save the .Rproj file somewhere you can find it. Don’t keep everything in your downloads folder. Maybe make a folder called SDS261 or something. That folder could live on your Desktop. Or maybe in your Dropbox.

  • The .qmd document should be saved in the R Project as lab5-sds261-yourlastname-yourfirstname.qmd.

  • Set up a connection to a database of your naming using DuckDB.

con_college <- DBI::dbConnect(duckdb::duckdb(),
                             dbdir = "you_name_the_db")

Assignment

The data we will work with today is on college tuition, pay, and diversity. The original data source is the US Department of Education, but it has been compiled as part of TidyTuesday on March 10, 2020. Tuition and fees are for 2018-19. Diversity is for 2014.

  1. Download the data onto your own computer. In the interest of time, let’s only use three tables: tuition_cost and salary_potential and diversity_school.
  1. Follow the first two steps in the notes: USE and DROP TABLE.
  1. For each of the three tables, in the CREATE TABLE operation, carefully define variable types and identify any key variables. Unfortunately, FOREIGN KEYs don’t work here because there is not referential integrity across the tables.
  1. Use COPY to load the data from your computer. (Hint: if there is an NA value, SQL doesn’t know that it is actually NULL and wants the variable to be loaded as a character string.)
  1. Use SELECT * (with a LIMIT!) to make sure all three tables loaded correctly.
  1. Add an INDEX to the tuition table on state. We won’t discuss until Thursday, but state is better than state_code because state allows you to join with other tables. state is better than type or degree_length because state has a much higher cardinality.

Disconnect from the server