con_college <- DBI::dbConnect(duckdb::duckdb(),
dbdir = "you_name_the_db")Lab 5 - creating databases
Today’s lab will provide practice working with creating SQL databases using DuckDB.
The goals for lab 5 include:
- copying data into a local folder.
- identifying keys for identifying unique records.
- practice with the loading steps:
USE,CREATE TABLE, andCOPY. - running SQL queries on the existing data to confirm correct loading.
CREATE INDEXfor a variable of interest.
Advice for turning in the assignment
Be sure to indicate (in the .qmd file) which problem is being answered with which code. A sentence or two with each response goes a long way toward your understanding!
save the .Rproj file somewhere you can find it. Don’t keep everything in your downloads folder. Maybe make a folder called
SDS261or something. That folder could live on your Desktop. Or maybe in your Dropbox.The .qmd document should be saved in the R Project as
lab5-sds261-yourlastname-yourfirstname.qmd.Set up a connection to a database of your naming using DuckDB.
Assignment
The data we will work with today is on college tuition, pay, and diversity. The original data source is the US Department of Education, but it has been compiled as part of TidyTuesday on March 10, 2020. Tuition and fees are for 2018-19. Diversity is for 2014.
- Download the data onto your own computer. In the interest of time, let’s only use three tables:
tuition_costandsalary_potentialanddiversity_school.
- Follow the first two steps in the notes:
USEandDROP TABLE.
- For each of the three tables, in the
CREATE TABLEoperation, carefully define variable types and identify any key variables. Unfortunately,FOREIGN KEYs don’t work here because there is not referential integrity across the tables.
- Use
COPYto load the data from your computer. (Hint: if there is anNAvalue, SQL doesn’t know that it is actuallyNULLand wants the variable to be loaded as a character string.)
- Use
SELECT *(with aLIMIT!) to make sure all three tables loaded correctly.
- Add an
INDEXto thetuitiontable onstate. We won’t discuss until Thursday, butstateis better thanstate_codebecausestateallows you to join with other tables.stateis better thantypeordegree_lengthbecausestatehas a much higher cardinality.