2024-01-16
DuckDB
in-process database management system that runs entirely on your own computer.
the data live in your storage (instead of your memory).
you don’t have to transfer queries or results over the internet.
the SQL dialect used in DuckDB is slightly different from MySQL
write SELECT * FROM table 10;
instead of SELECT * FROM table 0, 10;
lots of different dialects, depending on the SQL server. Always be aware of the dialect you are using.
the database has been stored to a database directory called duck_datab
which lives in the current R project.
can’t open it like a standard folder, but it is where DuckDB stores the database files.
Unlike R, when creating a new data table, SQL requires that you communicate each future variable (column) and that variable’s type. Variable types are not automatically generated!
Consider the Saturday Night Live datasets available on the snldb GitHub repo.
casts
tableUse R to understand the data from casts.csv
.
Rows: 614
Columns: 8
$ aid <chr> "A. Whitney Brown", "A. Whitney Brown", "A. Whitney Br…
$ sid <dbl> 11, 12, 13, 14, 15, 16, 5, 39, 40, 41, 42, 45, 46, 21,…
$ featured <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ first_epid <dbl> 19860222, NA, NA, NA, NA, NA, 19800409, 20140118, NA, …
$ last_epid <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ update_anchor <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ n_episodes <dbl> 8, 20, 13, 20, 20, 20, 5, 11, 21, 21, 21, 18, 17, 20, …
$ season_fraction <dbl> 0.444, 1.000, 1.000, 1.000, 1.000, 1.000, 0.250, 0.524…
INTEGER
, SMALLINT
, NUMERIC
, DECIMAL
, DOUBLE(precision, scale)
precision = # sig digits, scale = # digits the follow decimal.CHAR
, VARCHAR
, BINARY
, TEXT
DATE
, TIME
, DATETIME
, TIMESTAMP
, YEAR
SMALLINT(1)
in MySQL, BOOLEAN
in DuckDBCHECK
constraintsViolation will result in an error.
Each library (
database
) has books (table
s). Each book (table
) has pages (rows). Each page (row) has a unique page number to identify it (key
value); to find a particular page, you sort through the page numbers (key
values). But it isn’t immediately obvious where the particular page of interest is, you might have to page through the book a little bit to find the page of interest. It would be easier if you had several bookmarks throughout the book to anchor some of the page numbers. For example, if you want page 1047 and you have a bookmark on page 1050, you only have to turn back three pages. The bookmark is anindex
, it helps you find the desired rows much more quickly.1
Keys are unique identifiers for each row, used primarily for connecting tables. Keys are generally not helpful for efficiency, but they are important for data integrity and relationships between tables.
PRIMARY KEY
is a column or set of columns that uniquely identify each row. Primary keys cannot be NULL
.FOREIGN KEY
is a column or set of columns that reference a primary key in a different table. A foreign key can be NULL
.KEY
sEither or both of the KEY
s could be multiple columns.
INDEX
esIndexes can be created on one or more variable. A table does not need to have an INDEX
(or a KEY
).
Importing .csv files as tables, a series of steps:1
USE
statement that ensures we are in the right schema/database.DROP TABLE
statements that drop any old tables with the same names as the ones we are going to create.CREATE TABLE
statements that specify the table structures.COPY
statements that read the data from the .csv files into the appropriate tables.USE
Use the (local) database that we’ve called duck_datab
.
Make sure to “refresh” the table, in case it already exists. However, be very careful with the DROP TABLE
statement, as it removes the casts
table.
CREATE TABLE
Carefully define the variable types, whether or not they allow missing values, and what a default value is for that variable. Additionally, identify the key for accessing information.
CREATE TABLE casts (
aid VARCHAR(255) NOT NULL DEFAULT '',
sid INTEGER NOT NULL DEFAULT 0,
featured BOOLEAN NOT NULL DEFAULT 'false',
first_epid INTEGER DEFAULT 0,
last_epid INTEGER DEFAULT 0,
update_anchor BOOLEAN NOT NULL DEFAULT 0,
n_episodes INTEGER NOT NULL DEFAULT 0,
season_fraction DECIMAL(21,20) NOT NULL DEFAULT 0,
PRIMARY KEY (sid, aid)
);
COPY
The .csv file lives on my computer, so I load it in directly. [n.b., the statement to load in data is different in MySQL.]
SELECT
aid | sid | featured | first_epid | last_epid | update_anchor | n_episodes | season_fraction |
---|---|---|---|---|---|---|---|
A. Whitney Brown | 11 | TRUE | 19860222 | FALSE | 8 | 0.444 | |
A. Whitney Brown | 12 | TRUE | FALSE | 20 | 1.000 | ||
A. Whitney Brown | 13 | TRUE | FALSE | 13 | 1.000 | ||
A. Whitney Brown | 14 | TRUE | FALSE | 20 | 1.000 | ||
A. Whitney Brown | 15 | TRUE | FALSE | 20 | 1.000 | ||
A. Whitney Brown | 16 | TRUE | FALSE | 20 | 1.000 | ||
Alan Zweibel | 5 | TRUE | 19800409 | FALSE | 5 | 0.250 | |
Sasheer Zamata | 39 | TRUE | 20140118 | FALSE | 11 | 0.524 |
Let’s make sure that the database exists and that the table in the database exists.
Let’s make sure that the database exists and that the table in the database exists.
Let’s make sure that the database exists and that the table in the database exists.
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
aid | VARCHAR | NO | PRI | '' | |
sid | INTEGER | NO | PRI | 0 | |
featured | BOOLEAN | NO | 'false' | ||
first_epid | INTEGER | YES | 0 | ||
last_epid | INTEGER | YES | 0 | ||
update_anchor | BOOLEAN | NO | 0 | ||
n_episodes | INTEGER | NO | 0 | ||
season_fraction | DECIMAL(21,20) | NO | 0 |
It is always a good idea to terminate the SQL connection when you are done with it.