Creating Databases

Jo Hardin

2024-01-16

DuckDB

DuckDB

  • in-process database management system that runs entirely on your own computer.

  • the data live in your storage (instead of your memory).

  • you don’t have to transfer queries or results over the internet.

DuckDB caveat

  • the SQL dialect used in DuckDB is slightly different from MySQL

  • write SELECT * FROM table 10; instead of SELECT * FROM table 0, 10;

  • lots of different dialects, depending on the SQL server. Always be aware of the dialect you are using.

DuckDB via R

install.packages("duckdb")  # only once, in the Console, not in the .qmd or .Rmd file
library(duckdb)             # at the top of the .qmd or .Rmd file

library(DBI)                # we also still need the DBI package
con_duckdb <- DBI::dbConnect(duckdb::duckdb(),
                             dbdir = "duck_datab")
  • the database has been stored to a database directory called duck_datab which lives in the current R project.

  • can’t open it like a standard folder, but it is where DuckDB stores the database files.

Preparing to load data

Unlike R, when creating a new data table, SQL requires that you communicate each future variable (column) and that variable’s type. Variable types are not automatically generated!

Today’s example

Saturday Night Live

Image says 'Saturday Night Live'.

Figure 1: image credit: NBC

Consider the Saturday Night Live datasets available on the snldb GitHub repo.

casts table

Use R to understand the data from casts.csv.

glimpse(casts)
Rows: 614
Columns: 8
$ aid             <chr> "A. Whitney Brown", "A. Whitney Brown", "A. Whitney Br…
$ sid             <dbl> 11, 12, 13, 14, 15, 16, 5, 39, 40, 41, 42, 45, 46, 21,…
$ featured        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ first_epid      <dbl> 19860222, NA, NA, NA, NA, NA, 19800409, 20140118, NA, …
$ last_epid       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ update_anchor   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ n_episodes      <dbl> 8, 20, 13, 20, 20, 20, 5, 11, 21, 21, 21, 18, 17, 20, …
$ season_fraction <dbl> 0.444, 1.000, 1.000, 1.000, 1.000, 1.000, 0.250, 0.524…

Data types

  • Numbers: INTEGER, SMALLINT, NUMERIC, DECIMAL, DOUBLE(precision, scale) precision = # sig digits, scale = # digits the follow decimal.
  • String: CHAR, VARCHAR, BINARY, TEXT
  • Date: DATE, TIME, DATETIME, TIMESTAMP, YEAR
  • Boolean: SMALLINT(1) in MySQL, BOOLEAN in DuckDB

CHECK constraints

Violation will result in an error.

CREATE TABLE CountryListCensus (
    Id INT,
    CountryName VARCHAR(255) NOT NULL,
    CountryPopulation INT CHECK(CountryPopulation > 0),
    LastCensus DATE,
    NextCensus DATE,
    CHECK(LastCensus<NextCensus),
    PRIMARY KEY (Id)
);

Efficiencies

Each library (database) has books (tables). Each book (table) has pages (rows). Each page (row) has a unique page number to identify it (key value); to find a particular page, you sort through the page numbers (key values). But it isn’t immediately obvious where the particular page of interest is, you might have to page through the book a little bit to find the page of interest. It would be easier if you had several bookmarks throughout the book to anchor some of the page numbers. For example, if you want page 1047 and you have a bookmark on page 1050, you only have to turn back three pages. The bookmark is an index, it helps you find the desired rows much more quickly.1

Key

Keys are unique identifiers for each row, used primarily for connecting tables. Keys are generally not helpful for efficiency, but they are important for data integrity and relationships between tables.

  • PRIMARY KEY is a column or set of columns that uniquely identify each row. Primary keys cannot be NULL.
  • FOREIGN KEY is a column or set of columns that reference a primary key in a different table. A foreign key can be NULL.

Creating KEYs

CREATE TABLE table1 (
  col1 ...,
  col2 ...,
  col3 ...,
  PRIMARY KEY col1,
  FOREIGN KEY col2 REFERENCES table2(table2col1)
);

Either or both of the KEYs could be multiple columns.

CREATE TABLE table1 (
  col1 ...,
  col2 ...,
  col3 ...,
  PRIMARY KEY (col1, col3),
  FOREIGN KEY (col1, col2) REFERENCES table2(table2col1, table2col4)
);

Creating INDEXes

Indexes can be created on one or more variable. A table does not need to have an INDEX (or a KEY).

CREATE INDEX name_of_index ON table (col1);
CREATE INDEX name_of_index ON table (col1, col2);

Loading data

Importing .csv files as tables, a series of steps:1

  1. a USE statement that ensures we are in the right schema/database.
  2. a series of DROP TABLE statements that drop any old tables with the same names as the ones we are going to create.
  3. a series of CREATE TABLE statements that specify the table structures.
  4. a series of COPY statements that read the data from the .csv files into the appropriate tables.

Loading step 1, USE

Use the (local) database that we’ve called duck_datab.

```{sql}
#| connection: con_duckdb

USE duck_datab;
```

Loading step 2, refresh

Make sure to “refresh” the table, in case it already exists. However, be very careful with the DROP TABLE statement, as it removes the casts table.

DROP TABLE IF EXISTS casts;

Loading step 3, CREATE TABLE

Carefully define the variable types, whether or not they allow missing values, and what a default value is for that variable. Additionally, identify the key for accessing information.

CREATE TABLE casts (
  aid VARCHAR(255) NOT NULL DEFAULT '',
  sid INTEGER NOT NULL DEFAULT 0,
  featured BOOLEAN NOT NULL DEFAULT 'false',
  first_epid INTEGER DEFAULT 0,
  last_epid INTEGER DEFAULT 0,
  update_anchor BOOLEAN NOT NULL DEFAULT 0,
  n_episodes INTEGER NOT NULL DEFAULT 0,
  season_fraction DECIMAL(21,20) NOT NULL DEFAULT 0,
  PRIMARY KEY (sid, aid)
);

Loading step 4, COPY

The .csv file lives on my computer, so I load it in directly. [n.b., the statement to load in data is different in MySQL.]

COPY casts FROM 'data/casts.csv' HEADER;

Checking the loading, SELECT

SELECT * FROM casts LIMIT 8;
Table 1: After CREATE TABLE where variable types are set, the COPY command pulls the data into the table. SELECT shows us that the table is as expected.
aid sid featured first_epid last_epid update_anchor n_episodes season_fraction
A. Whitney Brown 11 TRUE 19860222 FALSE 8 0.444
A. Whitney Brown 12 TRUE FALSE 20 1.000
A. Whitney Brown 13 TRUE FALSE 13 1.000
A. Whitney Brown 14 TRUE FALSE 20 1.000
A. Whitney Brown 15 TRUE FALSE 20 1.000
A. Whitney Brown 16 TRUE FALSE 20 1.000
Alan Zweibel 5 TRUE 19800409 FALSE 5 0.250
Sasheer Zamata 39 TRUE 20140118 FALSE 11 0.524

Check the database

Let’s make sure that the database exists and that the table in the database exists.

SHOW DATABASES;
1 records
database_name
duck_datab

Check the database

Let’s make sure that the database exists and that the table in the database exists.

SHOW TABLES;
1 records
name
casts

Check the database

Let’s make sure that the database exists and that the table in the database exists.

DESCRIBE casts;
Table 2: DESCRIBE variables in the casts table.
column_name column_type null key default extra
aid VARCHAR NO PRI ''
sid INTEGER NO PRI 0
featured BOOLEAN NO 'false'
first_epid INTEGER YES 0
last_epid INTEGER YES 0
update_anchor BOOLEAN NO 0
n_episodes INTEGER NO 0
season_fraction DECIMAL(21,20) NO 0

Best practice

It is always a good idea to terminate the SQL connection when you are done with it.

dbDisconnect(con_duckdb, shutdown = TRUE)