Creating Databases

Jo Hardin

2024-01-16

DuckDB

DuckDB

in-process database management system that runs entirely on your own computer.
the data live in your storage (instead of your memory).
you don’t have to transfer queries or results over the internet.

DuckDB caveat

the SQL dialect used in DuckDB is slightly different from MySQL
write SELECT * FROM table 10; instead of SELECT * FROM table 0, 10;
lots of different dialects, depending on the SQL server. Always be aware of the dialect you are using.

DuckDB via R

install.packages("duckdb")  # only once, in the Console, not in the .qmd or .Rmd file
library(duckdb)             # at the top of the .qmd or .Rmd file

library(DBI)                # we also still need the DBI package

con_duckdb <- DBI::dbConnect(duckdb::duckdb(),
                             dbdir = "duck_datab")

the database has been stored to a database directory called duck_datab which lives in the current R project.
can’t open it like a standard folder, but it is where DuckDB stores the database files.

Preparing to load data

Unlike R, when creating a new data table, SQL requires that you communicate each future variable (column) and that variable’s type. Variable types are not automatically generated!

Today’s example

Saturday Night Live

Figure 1: image credit: NBC

Consider the Saturday Night Live datasets available on the snldb GitHub repo.

`casts` table

Use R to understand the data from casts.csv.

glimpse(casts)

Rows: 614
Columns: 8
$ aid             <chr> "A. Whitney Brown", "A. Whitney Brown", "A. Whitney Br…
$ sid             <dbl> 11, 12, 13, 14, 15, 16, 5, 39, 40, 41, 42, 45, 46, 21,…
$ featured        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ first_epid      <dbl> 19860222, NA, NA, NA, NA, NA, 19800409, 20140118, NA, …
$ last_epid       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ update_anchor   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ n_episodes      <dbl> 8, 20, 13, 20, 20, 20, 5, 11, 21, 21, 21, 18, 17, 20, …
$ season_fraction <dbl> 0.444, 1.000, 1.000, 1.000, 1.000, 1.000, 0.250, 0.524…

Data types

Numbers: INTEGER, SMALLINT, NUMERIC, DECIMAL, DOUBLE(precision, scale) precision = # sig digits, scale = # digits the follow decimal.
String: CHAR, VARCHAR, BINARY, TEXT
Date: DATE, TIME, DATETIME, TIMESTAMP, YEAR
Boolean: SMALLINT(1) in MySQL, BOOLEAN in DuckDB

`CHECK` constraints

Violation will result in an error.

CREATE TABLE CountryListCensus (
    Id INT,
    CountryName VARCHAR(255) NOT NULL,
    CountryPopulation INT CHECK(CountryPopulation > 0),
    LastCensus DATE,
    NextCensus DATE,
    CHECK(LastCensus<NextCensus),
    PRIMARY KEY (Id)
);

Efficiencies

Each library (database) has books (tables). Each book (table) has pages (rows). Each page (row) has a unique page number to identify it (key value); to find a particular page, you sort through the page numbers (key values). But it isn’t immediately obvious where the particular page of interest is, you might have to page through the book a little bit to find the page of interest. It would be easier if you had several bookmarks throughout the book to anchor some of the page numbers. For example, if you want page 1047 and you have a bookmark on page 1050, you only have to turn back three pages. The bookmark is an index, it helps you find the desired rows much more quickly.¹

Key

Keys are unique identifiers for each row, used primarily for connecting tables. Keys are generally not helpful for efficiency, but they are important for data integrity and relationships between tables.

PRIMARY KEY is a column or set of columns that uniquely identify each row. Primary keys cannot be NULL.
FOREIGN KEY is a column or set of columns that reference a primary key in a different table. A foreign key can be NULL.

Creating `KEY`s

CREATE TABLE table1 (
  col1 ...,
  col2 ...,
  col3 ...,
  PRIMARY KEY col1,
  FOREIGN KEY col2 REFERENCES table2(table2col1)
);

Either or both of the KEYs could be multiple columns.

CREATE TABLE table1 (
  col1 ...,
  col2 ...,
  col3 ...,
  PRIMARY KEY (col1, col3),
  FOREIGN KEY (col1, col2) REFERENCES table2(table2col1, table2col4)
);

Creating `INDEX`es

Indexes can be created on one or more variable. A table does not need to have an INDEX (or a KEY).

CREATE INDEX name_of_index ON table (col1);

CREATE INDEX name_of_index ON table (col1, col2);

Loading data

Importing .csv files as tables, a series of steps:¹

a USE statement that ensures we are in the right schema/database.
a series of DROP TABLE statements that drop any old tables with the same names as the ones we are going to create.
a series of CREATE TABLE statements that specify the table structures.
a series of COPY statements that read the data from the .csv files into the appropriate tables.

Loading step 1, `USE`

Use the (local) database that we’ve called duck_datab.

```{sql}
#| connection: con_duckdb

USE duck_datab;
```

Loading step 2, refresh

Make sure to “refresh” the table, in case it already exists. However, be very careful with the DROP TABLE statement, as it removes the casts table.

DROP TABLE IF EXISTS casts;

Loading step 3, `CREATE TABLE`

Carefully define the variable types, whether or not they allow missing values, and what a default value is for that variable. Additionally, identify the key for accessing information.

CREATE TABLE casts (
  aid VARCHAR(255) NOT NULL DEFAULT '',
  sid INTEGER NOT NULL DEFAULT 0,
  featured BOOLEAN NOT NULL DEFAULT 'false',
  first_epid INTEGER DEFAULT 0,
  last_epid INTEGER DEFAULT 0,
  update_anchor BOOLEAN NOT NULL DEFAULT 0,
  n_episodes INTEGER NOT NULL DEFAULT 0,
  season_fraction DECIMAL(21,20) NOT NULL DEFAULT 0,
  PRIMARY KEY (sid, aid)
);

Loading step 4, `COPY`

The .csv file lives on my computer, so I load it in directly. [n.b., the statement to load in data is different in MySQL.]

COPY casts FROM 'data/casts.csv' HEADER;

Checking the loading, `SELECT`

SELECT * FROM casts LIMIT 8;

Table 1: After `CREATE TABLE` where variable types are set, the `COPY` command pulls the data into the table. `SELECT` shows us that the table is as expected.
aid	sid	featured	first_epid	update_anchor	n_episodes	season_fraction
A. Whitney Brown	11	TRUE	19860222	FALSE	8	0.444
A. Whitney Brown	12	TRUE		FALSE	20	1.000
A. Whitney Brown	13	TRUE		FALSE	13	1.000
A. Whitney Brown	14	TRUE		FALSE	20	1.000
A. Whitney Brown	15	TRUE		FALSE	20	1.000
A. Whitney Brown	16	TRUE		FALSE	20	1.000
Alan Zweibel	5	TRUE	19800409	FALSE	5	0.250
Sasheer Zamata	39	TRUE	20140118	FALSE	11	0.524

Check the database

Let’s make sure that the database exists and that the table in the database exists.

SHOW DATABASES;

1 records
database_name
duck_datab

Check the database

Let’s make sure that the database exists and that the table in the database exists.

SHOW TABLES;

1 records
name
casts

Check the database

Let’s make sure that the database exists and that the table in the database exists.

DESCRIBE casts;

Table 2: DESCRIBE variables in the casts table.
column_name	column_type	null	key	default
aid	VARCHAR	NO	PRI	''
sid	INTEGER	NO	PRI	0
featured	BOOLEAN	NO		'false'
first_epid	INTEGER	YES		0
last_epid	INTEGER	YES		0
update_anchor	BOOLEAN	NO		0
n_episodes	INTEGER	NO		0
season_fraction	DECIMAL(21,20)	NO		0

Best practice

It is always a good idea to terminate the SQL connection when you are done with it.

dbDisconnect(con_duckdb, shutdown = TRUE)

Creating Databases

DuckDB

DuckDB caveat

DuckDB via R

Preparing to load data

Today’s example

Saturday Night Live

casts table

Data types

CHECK constraints

Efficiencies

Key

Creating KEYs

Creating INDEXes

Loading data

Loading step 1, USE

Loading step 2, refresh

Loading step 3, CREATE TABLE

Loading step 4, COPY

Checking the loading, SELECT

Check the database

Check the database

Check the database

Best practice

`casts` table

`CHECK` constraints

Creating `KEY`s

Creating `INDEX`es

Loading step 1, `USE`

Loading step 3, `CREATE TABLE`

Loading step 4, `COPY`

Checking the loading, `SELECT`