Today’s lab will provide practice working with creating SQL databases using DuckDB.
The goals for lab 5 include:
copying data into a local folder.
identifying keys for identifying unique records.
practice with the loading steps: USE, CREATE TABLE, and COPY.
running SQL queries on the existing data to confirm correct loading.
CREATE INDEX for a variable of interest.
Advice for turning in the assignment
Be sure to indicate (in the .qmd file) which problem is being answered with which code. A sentence or two with each response goes a long way toward your understanding!
save the .Rproj file somewhere you can find it. Don’t keep everything in your downloads folder. Maybe make a folder called SDS261 or something. That folder could live on your Desktop. Or maybe in your Dropbox.
The .qmd document should be saved in the R Project as lab5-sds261-yourlastname-yourfirstname.qmd.
Set up a connection to a database of your naming using DuckDB.
The data we will work with today is on college tuition, pay, and diversity. The original data source is the US Department of Education, but it has been compiled as part of TidyTuesday on March 10, 2020. Tuition and fees are for 2018-19. Diversity is for 2014.
Download the data onto your own computer. In the interest of time, let’s only use three tables: tuition_cost and salary_potential and diversity_school.
Follow the first two steps in the notes: USE and DROP TABLE.
Solution
USE collegeDB;
Solution
DROPTABLEIFEXISTS salary;
Solution
DROPTABLEIFEXISTS diversity;
Solution
DROPTABLEIFEXISTS tuition;
For each of the three tables, in the CREATE TABLE operation, carefully define variable types and identify any key variables. Unfortunately, FOREIGN KEYs don’t work here because there is not referential integrity across the tables.
CREATETABLE diversity ( name VARCHAR(255), total_enrollment DOUBLE, state VARCHAR(255),categoryVARCHAR(255), enrollment DOUBLE);
Use COPY to load the data from your computer. (Hint: if there is an NA value, SQL doesn’t know that it is actually NULL and wants the variable to be loaded as a character string.)
Use SELECT * (with a LIMIT!) to make sure all three tables loaded correctly.
Solution
SELECT*FROM tuition LIMIT10;
Displaying records 1 - 10
name
state
state_code
type
degree_length
room_and_board
in_state_tuition
in_state_total
out_of_state_tuition
out_of_state_total
Aaniiih Nakoda College
Montana
MT
Public
2 Year
NA
2380
2380
2380
2380
Abilene Christian University
Texas
TX
Private
4 Year
10350
34850
45200
34850
45200
Abraham Baldwin Agricultural College
Georgia
GA
Public
2 Year
8474
4128
12602
12550
21024
Academy College
Minnesota
MN
For Profit
2 Year
NA
17661
17661
17661
17661
Academy of Art University
California
CA
For Profit
4 Year
16648
27810
44458
27810
44458
Adams State University
Colorado
CO
Public
4 Year
8782
9440
18222
20456
29238
Adelphi University
New York
NY
Private
4 Year
16030
38660
54690
38660
54690
Adirondack Community College
New York
NY
Public
2 Year
11660
5375
17035
9935
21595
Adrian College
Michigan
MI
Private
4 Year
11318
37087
48405
37087
48405
Advanced Technology Institute
Virginia
VA
For Profit
2 Year
NA
13680
13680
13680
13680
Solution
SELECT*FROM salary LIMIT10;
Displaying records 1 - 10
rank
name
state_name
early_career_pay
mid_career_pay
make_world_better_percent
stem_percent
1
Auburn University
Alabama
54400
104500
51
31
2
University of Alabama in Huntsville
Alabama
57500
103900
59
45
3
The University of Alabama
Alabama
52300
97400
50
15
4
Tuskegee University
Alabama
54500
93500
61
30
5
Samford University
Alabama
48400
90500
52
3
6
Spring Hill College
Alabama
46600
89100
53
12
7
Birmingham Southern College
Alabama
49100
88300
48
27
8
University of Alabama at Birmingham
Alabama
48600
87200
57
17
9
University of South Alabama
Alabama
47700
86400
56
17
10
Alabama A&M University
Alabama
48700
83500
58
20
Solution
SELECT*FROM diversity LIMIT10;
Displaying records 1 - 10
name
total_enrollment
state
category
enrollment
University of Phoenix-Arizona
195059
Arizona
Women
134722
University of Phoenix-Arizona
195059
Arizona
American Indian / Alaska Native
876
University of Phoenix-Arizona
195059
Arizona
Asian
1959
University of Phoenix-Arizona
195059
Arizona
Black
31455
University of Phoenix-Arizona
195059
Arizona
Hispanic
13984
University of Phoenix-Arizona
195059
Arizona
Native Hawaiian / Pacific Islander
1019
University of Phoenix-Arizona
195059
Arizona
White
58209
University of Phoenix-Arizona
195059
Arizona
Two Or More Races
19039
University of Phoenix-Arizona
195059
Arizona
Unknown
65163
University of Phoenix-Arizona
195059
Arizona
Non-Resident Foreign
3355
Add an INDEX to the tuition table on state. We won’t discuss until Thursday, but state is better than state_code because state allows you to join with other tables. state is better than type or degree_length because state has a much higher cardinality.