<- DBI::dbConnect(duckdb::duckdb(),
con_college dbdir = "you_name_the_db")
Lab 5 - creating databases
Today’s lab will provide practice working with creating SQL databases using DuckDB.
The goals for lab 5 include:
- copying data into a local folder.
- identifying keys for identifying unique records.
- practice with the loading steps:
USE
,CREATE TABLE
, andCOPY
. - running SQL queries on the existing data to confirm correct loading.
CREATE INDEX
for a variable of interest.
Advice for turning in the assignment
Be sure to indicate (in the .qmd file) which problem is being answered with which code. A sentence or two with each response goes a long way toward your understanding!
save the .Rproj file somewhere you can find it. Don’t keep everything in your downloads folder. Maybe make a folder called
SDS261
or something. That folder could live on your Desktop. Or maybe in your Dropbox.The .qmd document should be saved in the R Project as
lab5-sds261-yourlastname-yourfirstname.qmd
.Set up a connection to a database of your naming using DuckDB.
Assignment
The data we will work with today is on college tuition, pay, and diversity. The original data source is the US Department of Education, but it has been compiled as part of TidyTuesday on March 10, 2020. Tuition and fees are for 2018-19. Diversity is for 2014.
- Download the data onto your own computer. In the interest of time, let’s only use three tables:
tuition_cost
andsalary_potential
anddiversity_school
.
- Follow the first two steps in the notes:
USE
andDROP TABLE
.
- For each of the three tables, in the
CREATE TABLE
operation, carefully define variable types and identify any key variables. Unfortunately,FOREIGN KEY
s don’t work here because there is not referential integrity across the tables.
- Use
COPY
to load the data from your computer. (Hint: if there is anNA
value, SQL doesn’t know that it is actuallyNULL
and wants the variable to be loaded as a character string.)
- Use
SELECT *
(with aLIMIT
!) to make sure all three tables loaded correctly.
- Add an
INDEX
to thetuition
table onstate
. We won’t discuss until Thursday, butstate
is better thanstate_code
becausestate
allows you to join with other tables.state
is better thantype
ordegree_length
becausestate
has a much higher cardinality.