2024-01-16
DuckDB
in-process database management system that runs entirely on your own computer.
the data live in your storage (instead of your memory).
you don’t have to transfer queries or results over the internet.
the SQL dialect used in DuckDB is slightly different from MySQL
write SELECT * FROM table 10; instead of SELECT * FROM table 0, 10;
lots of different dialects, depending on the SQL server. Always be aware of the dialect you are using.
the database has been stored to a database directory called duck_datab which lives in the current R project.
can’t open it like a standard folder, but it is where DuckDB stores the database files.
Unlike R, when creating a new data table, SQL requires that you communicate each future variable (column) and that variable’s type. Variable types are not automatically generated!
Figure 1: image credit: NBC
Consider the Saturday Night Live datasets available on the snldb GitHub repo.
casts tableUse R to understand the data from casts.csv.
Rows: 614
Columns: 8
$ aid <chr> "A. Whitney Brown", "A. Whitney Brown", "A. Whitney Br…
$ sid <dbl> 11, 12, 13, 14, 15, 16, 5, 39, 40, 41, 42, 45, 46, 21,…
$ featured <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ first_epid <dbl> 19860222, NA, NA, NA, NA, NA, 19800409, 20140118, NA, …
$ last_epid <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ update_anchor <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ n_episodes <dbl> 8, 20, 13, 20, 20, 20, 5, 11, 21, 21, 21, 18, 17, 20, …
$ season_fraction <dbl> 0.444, 1.000, 1.000, 1.000, 1.000, 1.000, 0.250, 0.524…
INTEGER, SMALLINT, NUMERIC, DECIMAL, DOUBLE(precision, scale) precision = # sig digits, scale = # digits the follow decimal.CHAR, VARCHAR, BINARY, TEXTDATE, TIME, DATETIME, TIMESTAMP, YEARSMALLINT(1) in MySQL, BOOLEAN in DuckDBCHECK constraintsViolation will result in an error.
Each library (
database) has books (tables). Each book (table) has pages (rows). Each page (row) has a unique page number to identify it (keyvalue); to find a particular page, you sort through the page numbers (keyvalues). But it isn’t immediately obvious where the particular page of interest is, you might have to page through the book a little bit to find the page of interest. It would be easier if you had several bookmarks throughout the book to anchor some of the page numbers. For example, if you want page 1047 and you have a bookmark on page 1050, you only have to turn back three pages. The bookmark is anindex, it helps you find the desired rows much more quickly.1
Keys are unique identifiers for each row, used primarily for connecting tables. Keys are generally not helpful for efficiency, but they are important for data integrity and relationships between tables.
PRIMARY KEY is a column or set of columns that uniquely identify each row. Primary keys cannot be NULL.FOREIGN KEY is a column or set of columns that reference a primary key in a different table. A foreign key can be NULL.KEYsEither or both of the KEYs could be multiple columns.
INDEXesIndexes can be created on one or more variable. A table does not need to have an INDEX (or a KEY).
Importing .csv files as tables, a series of steps:1
USE statement that ensures we are in the right schema/database.DROP TABLE statements that drop any old tables with the same names as the ones we are going to create.CREATE TABLE statements that specify the table structures.COPY statements that read the data from the .csv files into the appropriate tables.USEUse the (local) database that we’ve called duck_datab.
Make sure to “refresh” the table, in case it already exists. However, be very careful with the DROP TABLE statement, as it removes the casts table.
CREATE TABLECarefully define the variable types, whether or not they allow missing values, and what a default value is for that variable. Additionally, identify the key for accessing information.
CREATE TABLE casts (
aid VARCHAR(255) NOT NULL DEFAULT '',
sid INTEGER NOT NULL DEFAULT 0,
featured BOOLEAN NOT NULL DEFAULT 'false',
first_epid INTEGER DEFAULT 0,
last_epid INTEGER DEFAULT 0,
update_anchor BOOLEAN NOT NULL DEFAULT 0,
n_episodes INTEGER NOT NULL DEFAULT 0,
season_fraction DECIMAL(21,20) NOT NULL DEFAULT 0,
PRIMARY KEY (sid, aid)
);COPYThe .csv file lives on my computer, so I load it in directly. [n.b., the statement to load in data is different in MySQL.]
SELECT| aid | sid | featured | first_epid | last_epid | update_anchor | n_episodes | season_fraction |
|---|---|---|---|---|---|---|---|
| A. Whitney Brown | 11 | TRUE | 19860222 | FALSE | 8 | 0.444 | |
| A. Whitney Brown | 12 | TRUE | FALSE | 20 | 1.000 | ||
| A. Whitney Brown | 13 | TRUE | FALSE | 13 | 1.000 | ||
| A. Whitney Brown | 14 | TRUE | FALSE | 20 | 1.000 | ||
| A. Whitney Brown | 15 | TRUE | FALSE | 20 | 1.000 | ||
| A. Whitney Brown | 16 | TRUE | FALSE | 20 | 1.000 | ||
| Alan Zweibel | 5 | TRUE | 19800409 | FALSE | 5 | 0.250 | |
| Sasheer Zamata | 39 | TRUE | 20140118 | FALSE | 11 | 0.524 |
Let’s make sure that the database exists and that the table in the database exists.
Let’s make sure that the database exists and that the table in the database exists.
Let’s make sure that the database exists and that the table in the database exists.
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| aid | VARCHAR | NO | PRI | '' | |
| sid | INTEGER | NO | PRI | 0 | |
| featured | BOOLEAN | NO | 'false' | ||
| first_epid | INTEGER | YES | 0 | ||
| last_epid | INTEGER | YES | 0 | ||
| update_anchor | BOOLEAN | NO | 0 | ||
| n_episodes | INTEGER | NO | 0 | ||
| season_fraction | DECIMAL(21,20) | NO | 0 |
It is always a good idea to terminate the SQL connection when you are done with it.