syllabus
Data Science, the SQL
SDS 261, January 2024
Class: daily 10-11:30am
Lab: daily (not Fridays) 1:30-3pm
Office hours: daily 11:30-1:30pm
The course
Data Science, the SQL is a continuation of ideas learned in Foundations of Data Science. The course develops abilities for using SQL databases within the data science pipeline. The core of the course will focus on the why and the how associated with writing SELECT queries in SQL. Additional topics will include subqueries, indexes, keys, and regular expressions. Students will learn how to run SQL queries from both the RStudio IDE as well as from a relational database management system client like MySQL Workbench or DuckDB.
Student Learning Outcomes.
By the end of the term, students will:
Database Concepts: be able to explain basic database concepts such as tables, records, fields, and relationships.
Introduction to SQL: gain a fundamental understanding of Structured Query Language (SQL), including its history, purpose, and key components.
SQL Querying:
- Writing SQL Queries: learn how to write basic SQL queries to retrieve data from a single table.
- Filtering and Sorting Data: be able to use SQL to filter and sort data based on specific criteria.
- Joining Tables: understand how to perform inner and outer joins to combine data from multiple tables.
Creating Tables: be able to create a SQL database with multiple tables that link to one another using DuckDB.
Inserting and Updating Data: be able to use SQL to insert new records into a table and update existing records. Use SQL to delete records from a table.
Basics of Regular Expressions: understand the fundamental concepts of regular expressions. Identify and use basic metacharacters for pattern matching to write simple regular expressions for text search and matching.
Inclusion Goals1
In an ideal world, science would be objective. However, much of science is subjective and is historically built on a small subset of privileged voices. In this class, we will make an effort to recognize how science (and data science!) has played a role in both understanding diversity as well as in promoting systems of power and privilege. I acknowledge that there may be both overt and covert biases in the material due to the lens with which it was written, even though the material is primarily of a scientific nature. Integrating a diverse set of experiences is important for a more comprehensive understanding of science. I would like to discuss issues of diversity in statistics as part of the course from time to time.
Please contact me if you have any suggestions to improve the quality of the course materials.
Furthermore, I would like to create a learning environment for my students that supports a diversity of thoughts, perspectives and experiences, and honors your identities (including race, gender, class, sexuality, religion, ability, etc.) To help accomplish this:
- If you have a name and/or set of pronouns that differ from those that appear in your official records, please let me know!
- If you feel like your performance in the class is being impacted by your experiences outside of class, please don’t hesitate to come and talk with me. You can also relay information to me via your mentors. I want to be a resource for you.
I (like many people) am still in the process of learning about diverse perspectives and identities. If something was said in class (by anyone) that made you feel uncomfortable, please talk to me about it. As a participant in course discussions, you should also strive to honor the diversity of your classmates.
Technical Details
Text:
Modern Data Science with R, 3rd edition by Baumer, Kaplan, and Horton.
R for Data Science, 2nd edition by Wickham, Çetinkaya-Rundel, and Grolemund.
R links:
- Enough R
- R tutorial
- Great tutorials through the Coding Club
- A true beginner’s introduction to the tidyverse, the introverse.
- for a good start to R in general
- A fantastic ggplot2 tutorial
- Great tutorials through the Coding Club
- Google for R
- some R ideas that I wrote up
- Incredibly helpful cheatsheets from RStudio.
SQL links
- W3 schools Introduction to SQL
- W3 schools SQL Exercises, Practice, Solution
- R packages for working with databases
- Introduction to
dbplyr
Regular expression links
- stringr vignette
- stringr package
- Jenny Bryan et al.’s STAT 545 notes
- Hadley Wickham’s book R for Data Science
- regexpal
- RegExr
- RegexOne
Using R (through the RStudio IDE)
R will be used for many assignments. You can use R on the Smith server: https://rstudio.smith.edu/.
Alternatively, feel free to download both R and RStudio onto your own computer. R is freely available at http://www.r-project.org/; RStudio is also free and allows you to turn in all R assignments using Quarto http://rstudio.org/.
GitHub
Assignments will be turned in using GitHub. See instructions for using GitHub on the course website.
Important Features
Prerequisites:
The prerequisite for this class is SDS 192, Introduction to Data Science.
Labs:
Labs will take place on most days with the lab write-up due just before the following class period. See instructions for using GitHub on the course website for how to turn in assignments.
Grading:
The class expectations are that you show up for class and labs and turn in a final project. A successful final project is required to pass the class. Additionally, you should not miss more than 1 or 2 classes nor should you miss turning in more than 1 or 2 labs.
Footnotes
adapted from Monica Linden, Brown University↩︎