SQL joins

Jo Hardin

2024-01-10

Today’s example

IMDb

Consider a database of information from IMDb.

IMDb

21 tables in the imdb database! Lots of details on movies, TV series, video games, and more. Today interest is in movies (and details within, like actors and ratings).

SHOW TABLES;
Displaying records 1 - 10
Tables_in_imdb
aka_name
aka_title
cast_info
char_name
comp_cast_type
company_name
company_type
complete_cast
info_type
keyword

IMDb - glance through tables

SELECT * FROM kind_type
LIMIT 0, 10;
7 records
id kind
1 movie
2 tv series
3 tv movie
4 video movie
5 tv mini series
6 video game
7 episode

All the JOINs

  • JOIN (aka INNER JOIN): include all of the rows that exist in both tables
  • LEFT JOIN: include all of the rows in the first table.
  • RIGHT JOIN: include all of the rows in the second table.
  • FULL OUTER JOIN: include all rows in either table. (The functionality doesn’t exist in MySQL but can be created using joins and UNION.)
  • CROSS JOIN: match each row of the first table with each row in the second table.

All the JOINs

Venn diagrams of the different types of joins.

Venn diagrams describing different types of JOINs.

Figure 1: Venn diagrams describing different JOINs, image credit: phoenixNAP https://phoenixnap.com/kb/mysql-join

All the JOINs

Mini data tables of the different types of JOIN. (In SQL the missing values will be labeled as NULL (not NA).)

Mini data tables describing different types of JOINs.

Figure 2: Mini data tables describing different JOINs, image credit: Statistics Globe blog, https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti

All the JOINs

Each JOIN clause needs four specific pieces of information:

  1. The name of the first table you want to JOIN.
  2. The type of JOIN being used.
  3. The name of the second table you want to JOIN.
  4. The condition(s) under which you want the records in the first table to match records in the second table.

A toy example

Rock bands from the 60s.

band_members
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
band_instruments
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

The function sqldf() in the sqldf R package allows for SQL commands on R objects.

Rock bands - INNER JOIN

An inner join combines two datasets returning only the observations that exist in both of the original datasets.

sqldf::sqldf("SELECT star.name,
                     star.band,
                     inst.plays
              FROM band_members AS star
              JOIN band_instruments AS inst ON star.name = inst.name")
  name    band  plays
1 John Beatles guitar
2 Paul Beatles   bass

Rock bands - FULL JOIN

A full join combines two datasets returning every observation that exists in either one of the original datasets.

band_members |>
  full_join(band_instruments)
# A tibble: 4 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar

The full_join() function does not have an equivalent in MySQL. See notes for using JOINs and UNIONs to produce a full join.

Rock bands - LEFT JOIN

A left join combines two datasets returning every observation that exists in the left (or first) original dataset.

sqldf::sqldf("SELECT star.name,
                     star.band,
                      inst.plays
              FROM band_members AS star
              LEFT JOIN band_instruments AS inst 
              ON star.name = inst.name")
  name    band  plays
1 Mick  Stones   <NA>
2 John Beatles guitar
3 Paul Beatles   bass

Rock bands - RIGHT JOIN

A right join combines two datasets returning every observation that exists in the right (or second) original dataset.

sqldf::sqldf("SELECT inst.name,
                     star.band,
                      inst.plays
              FROM band_members AS star
              RIGHT JOIN band_instruments AS inst 
              ON star.name = inst.name")
   name    band  plays
1  John Beatles guitar
2  Paul Beatles   bass
3 Keith    <NA> guitar

Rock bands - CROSS JOIN

A right join combines two datasets returning every observation that exists in the right (or second) original dataset.

sqldf::sqldf("SELECT *
              FROM band_members AS star
              CROSS JOIN band_instruments AS inst")
  name    band  name  plays
1 Mick  Stones  John guitar
2 Mick  Stones  Paul   bass
3 Mick  Stones Keith guitar
4 John Beatles  John guitar
5 John Beatles  Paul   bass
6 John Beatles Keith guitar
7 Paul Beatles  John guitar
8 Paul Beatles  Paul   bass
9 Paul Beatles Keith guitar

JOIN IMDb title + votes

In the imdb database, the title table includes information about the 4,626,322 titles in the database, including the id, title, kind_id (indicator for the kind of ID it is), and production_year. It does not, however, include the review of the title.

SELECT * FROM title LIMIT 0, 10;
Displaying records 1 - 10
id title imdb_index kind_id production_year imdb_id phonetic_code episode_of_id season_nr episode_nr series_years md5sum
78460 Adults Recat to the Simpsons (30th Anniversary) 7 2017 A3432 78406 2ae09eed7d576cc2c24774fed5b18168
70273 (2016-05-18) 7 2016 68058 511dfc14cfff7589d29a95abb30cd66a
60105 (2014-04-11) 7 2014 59138 c6cdce7e667e07713e431805c407feed
32120 (2008-05-01) 7 2008 32060 100df65742caf5afd092b2e0ead67d8e
97554 Schmölders Traum 7 2001 S2543 97302 10 1 46862a2f96f9fb2d59e8c9a11ecfdd28
57966 (#1.1) 7 2013 57965 1 1 409c37703766c4b24f8a86162fd9cf85
76391 Anniversary 7 1971 A5162 76385 4 9 5e12ce73fac1d1dcf94136b6e9acd8f8
11952 Angus Black/Lester Barrie/DC Curry 7 2009 A5214 11937 4 7 9c38b9e5601dc154444b73b518034aa1
1554 New Orleans 7 2003 N6452 1508 2 11 621bea735740a547e862e4a3226f35d2
58442 Kiss Me Kate 7 2011 K2523 58436 1 10 293e8c75c7f35a4035abf617962be5a9

JOIN IMDb title + votes

The movie_info_idx table does not contain much information about each particular film. It does, however, have an indicator for the movie ID (given by movie_id) as well as the number of votes (given by info where type_id = 100).

SELECT * FROM movie_info_idx LIMIT 0, 6;
6 records
id movie_id info_type_id info note
1 1 99 31.2.1..2.
2 1 100 9
3 1 101 4.1
4 2 99 1000000102
5 2 100 61
6 2 101 6.4

JOIN IMDb title + votes

  • combine the titles with the number of votes so that each title with user votes is included (INNER JOIN)
SELECT title.id,
       title.title,
       movie_info_idx.info
FROM title
JOIN movie_info_idx ON title.id = movie_info_idx.movie_id 
WHERE title.production_year = 2015 
    AND title.kind_id = 1                  # movies only
    AND movie_info_idx.info_type_id = 100  # info_type is votes
    AND movie_info_idx.info > 150000       # at least 150,000 votes
ORDER BY movie_info_idx.info DESC
LIMIT 0, 20;
Displaying records 1 - 10
id title info
4260166 Star Wars: Episode VII - The Force Awakens 691691
3915213 Mad Max: Fury Road 666484
4389619 The Martian 583987
3313672 Avengers: Age of Ultron 540606
4414139 The Revenant 526189
3787790 Jurassic World 471237
3752999 Inside Out 443051
3292159 Ant-Man 390965
4364483 The Hateful Eight 363199
4251736 Spectre 319875

Some aspects of the query are worth pointing out:
* The variables in the output are given in the SELECT clause. The id and title (both from the title table) and the info from the movie_info_idx which represents the number of IMDb votes. * The variables are preceded by the table from which they came. While not always necessary, it is good practice so as to avoid confusion. * The JOIN happens by linking the id variable in the title table with the movie_id variable in the movie_info_idx table. * The LIMIT wasn’t necessary (there are only 12 observations), but it’s good practice so that we don’t end up with unwieldy query results. * The WHERE clause happens before the JOIN action, despite being written after. * In the WHERE clause, we keep only movies, only 2015 production year, and only at least 150,000 votes.

JOIN IMDb title + actress

cast_info contains the person_id and the movie_id

SELECT * FROM cast_info LIMIT 0, 10;
Displaying records 1 - 10
id person_id movie_id person_role_id note nr_order role_id
1 1 3432997 1 31 1
2 2 1901690 2 1
3 3 4027567 2 25 1
4 3 4282876 3 22 1
5 4 3542672 12 1
6 5 3331520 4 (as $hutter Boy) 10 1
7 5 4027191 2 (as $hutter Boy) 1 1
8 5 4195731 5 (uncredited) 1
9 5 4263956 6 (uncredited) 1
10 5 4267787 7 (uncredited) 1

person_role_id is 1 if actor and 2 if actress

JOIN IMDb title + actress

aka_name contains person_id and name (of actor)

SELECT * FROM aka_name LIMIT 0, 10;
Displaying records 1 - 10
id person_id name imdb_index name_pcode_cf name_pcode_nf surname_pcode md5sum
1 6188450 Smith, Jessica Noel S5325 J2542 S53 25c9d464e3ff2957533546aa92b397ed
2 5125059 Pain, L. $ham P545 L515 P5 569b1e885ccb51211c01753f0dad9b2c
3 5 Boy, $hutter B36 H361 B 35092b5604ce378fc48c8a6fc0038a49
4 4152053 Dollasign, Ty D4253 T3425 D425 0f565a2d8027cfb8ed6c5f4bba719fcd
5 4152053 Sign, Ty Dolla S2534 T3425 S25 2eded1b021b96333b4b74e0fec959650
6 6 Moore, Brandon M6165 B6535 M6 193a6f5adf4756320f622162d2475608
7 8 $torm, Country T6525 C5363 T65 1654400b707d34323ea392b87060e6cc
8 19 ‘Hooper’, Simon P.J. Kelly H1625 S5124 H16 3fd8885372c23f8c74e583da91d1fd05
9 19 Hooper H16 24ddc68ab605ee95857ad45b65ffa2d8
10 19 Kelly, Simon P.J. K4251 S5124 K4 33d976f22e276b73c61513bc5f6e72a6

JOIN IMDb title + actress

title contains id and title (of the movie)

SELECT * FROM title LIMIT 0, 10;
Displaying records 1 - 10
id title imdb_index kind_id production_year imdb_id phonetic_code episode_of_id season_nr episode_nr series_years md5sum
78460 Adults Recat to the Simpsons (30th Anniversary) 7 2017 A3432 78406 2ae09eed7d576cc2c24774fed5b18168
70273 (2016-05-18) 7 2016 68058 511dfc14cfff7589d29a95abb30cd66a
60105 (2014-04-11) 7 2014 59138 c6cdce7e667e07713e431805c407feed
32120 (2008-05-01) 7 2008 32060 100df65742caf5afd092b2e0ead67d8e
97554 Schmölders Traum 7 2001 S2543 97302 10 1 46862a2f96f9fb2d59e8c9a11ecfdd28
57966 (#1.1) 7 2013 57965 1 1 409c37703766c4b24f8a86162fd9cf85
76391 Anniversary 7 1971 A5162 76385 4 9 5e12ce73fac1d1dcf94136b6e9acd8f8
11952 Angus Black/Lester Barrie/DC Curry 7 2009 A5214 11937 4 7 9c38b9e5601dc154444b73b518034aa1
1554 New Orleans 7 2003 N6452 1508 2 11 621bea735740a547e862e4a3226f35d2
58442 Kiss Me Kate 7 2011 K2523 58436 1 10 293e8c75c7f35a4035abf617962be5a9

JOIN IMDb title + actress

Goal: identify the actresses in those movies with the highest number of votes.

SELECT t.title,
       idx.info,
       a.person_id,
       n.name
FROM title AS t
JOIN movie_info_idx AS idx ON t.id = idx.movie_id 
JOIN cast_info AS a ON idx.movie_id = a.movie_id
JOIN aka_name AS n ON a.person_id = n.person_id
WHERE t.production_year = 2015 
    AND t.kind_id = 1           # movies only
    AND idx.info_type_id = 100  # info_type is votes
    AND idx.info > 150000       # at least 150,000 votes
    AND a.role_id = 2           # actresses only
ORDER BY a.person_id
LIMIT 0, 50;
Displaying records 1 - 10
title info person_id name
Mission: Impossible - Rogue Nation 266759 2674172 Sofz
Mission: Impossible - Rogue Nation 266759 2674172 Sof
Mission: Impossible - Rogue Nation 266759 2674172 Sofz
Mission: Impossible - Rogue Nation 266759 2674172 Sof
Focus 172680 2678594 Labouisse, Kate
Mad Max: Fury Road 666484 2681098 Michelle, Debra
The Hunger Games: Mockingjay - Part 2 214569 2685496 Harris, Ahnie
The Hunger Games: Mockingjay - Part 2 214569 2685496 Harris, Jasmine
Creed 183904 2686218 Kareema, Kiana A.
Ant-Man 390965 2687271 Akana, Anna Kay

Connecting the most popular movies of 2015 with the actresses in those movies requires a series of JOINs. Note that to make the code less onerous, the title table has been aliased by t, the movie_info_idx table has been aliased by idx, the cast_info table has been aliased by a, and the aka_name table has been aliased by n.

There is a lot of data cleaning to do as some of the person_id values are one to many!! That is, the person_id matches multiple names in the aka_name database.

Other JOINs

Let’s look at RIGHT JOIN and LEFT JOIN using two new smaller tables.

  1. The first has seven movies in it (from 2015 with at least 400,000 IMDb votes).

  2. The second consists of almost 3 million actresses (person_role_id = 2). In order to find a subset of actresses, the person_id > 3900000 was set arbitrarily (in order to have a smaller group with which to work).

Using subqueries, we can JOIN the two datasets using different JOIN techniques.

movies:

SELECT t.id,
       t.title,
       idx.info,
       (SELECT COUNT(*)
       FROM title AS t
       JOIN movie_info_idx AS idx ON idx.movie_id = t.id
       WHERE t.production_year = 2015  
             AND t.kind_id = 1
             AND idx.info_type_id = 100
             AND idx.info > 400000) AS row_count
FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1             # movies only
    AND idx.info_type_id = 100    # info_type is votes
    AND idx.info > 400000         # at least 400,000 votes
ORDER BY idx.info DESC;
7 records
id title info row_count
4260166 Star Wars: Episode VII - The Force Awakens 691691 7
3915213 Mad Max: Fury Road 666484 7
4389619 The Martian 583987 7
3313672 Avengers: Age of Ultron 540606 7
4414139 The Revenant 526189 7
3787790 Jurassic World 471237 7
3752999 Inside Out 443051 7

actresses:


SELECT a.person_id,
       a.movie_id,
       n.name,
       (SELECT COUNT(*)
       FROM cast_info AS a
       JOIN aka_name AS n ON a.person_id = n.person_id
       WHERE a.person_role_id = 2  
             AND a.person_id > 390000) AS row_count
FROM cast_info AS a
JOIN aka_name AS n ON a.person_id = n.person_id
       WHERE a.person_role_id = 2  
             AND a.person_id > 3900000
LIMIT 0, 20;
Displaying records 1 - 10
person_id movie_id name row_count
3900141 759802 Simons, Rita Joanne 2904759
3902258 4365829 Singer, Rabbi Tovia 2904759
3902699 3109788 Singh, Sabine Erika 2904759
3903035 3215866 Val 2904759
3904831 2468067 Masha 2904759
3904928 3654347 Fei, Siu Yin 2904759
3904928 3654347 Hsiao, Yen-fei 2904759
3904928 3654347 Siu, Yinfei 2904759
3904928 3654347 Xiao, Yanfei 2904759
3904928 3654347 Yin-Fai, Siu 2904759

Inner JOIN

With an inner JOIN, there are 32 rows corresponding to all the actresses in the seven 2015 films with the most votes. Because the JOIN is an intersection of the two tables, only the actresses with person_id above 3900000 are included.

SELECT * FROM
(SELECT t.id,
       t.title
FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               # movies only
    AND idx.info_type_id = 100      # info_type is votes
    AND idx.info > 400000) AS movs  # at least 400,000 votes     
INNER JOIN (SELECT a.person_id,
       a.movie_id,
       n.name
    FROM cast_info AS a
    JOIN aka_name AS n ON a.person_id = n.person_id
    WHERE a.role_id = 2             # acresses only
        AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
LIMIT 0, 300;
Displaying records 1 - 10
id title person_id movie_id name
3313672 Avengers: Age of Ultron 3916648 3313672 Smulders, Jacoba Francisca Maria
3752999 Inside Out 4122876 3752999 Kuzniar, Lennon Wynn
3787790 Jurassic World 3938423 3787790 Norvell, Ingrid
3787790 Jurassic World 3938423 3787790 Norvell, Ingrid
3787790 Jurassic World 3950111 3787790 Sallaway, Seannon Jane
3787790 Jurassic World 4079047 3787790 Washington, Kelly Lynn
3787790 Jurassic World 4079047 3787790 Washington, Kelly Lynn
3787790 Jurassic World 4084626 3787790 Weeks, Jency
3787790 Jurassic World 4084626 3787790 Jenc
3787790 Jurassic World 4084626 3787790 Little J

RIGHT JOIN

With a RIGHT JOIN, there are more than 300 rows (the LIMIT clause keeps us from knowing how many rows, but there are a LOT!) corresponding to all the actresses whose person_id above 3900000 are included. Those actresses who acted in one of the seven top 2015 films are also included in the full results table, but they don’t happen to be in the truncated output here.

SELECT * FROM
(SELECT t.id,
       t.title
FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               # movies only
    AND idx.info_type_id = 100      # info_type is votes
    AND idx.info > 400000) AS movs  # at least 400,000 votes     
RIGHT JOIN (SELECT a.person_id,
       a.movie_id,
       n.name
    FROM cast_info AS a
    JOIN aka_name AS n ON a.person_id = n.person_id
    WHERE a.role_id = 2             # acresses only
        AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
LIMIT 0, 300;
Displaying records 1 - 10
id title person_id movie_id name
3900001 3355298 Simonis, Heidi
3900001 3509490 Simonis, Heidi
3900001 3739780 Simonis, Heidi
3900001 3943329 Simonis, Heidi
3900001 165963 Simonis, Heidi
3900001 268249 Simonis, Heidi
3900001 268370 Simonis, Heidi
3900001 268406 Simonis, Heidi
3900001 268475 Simonis, Heidi
3900001 279082 Simonis, Heidi

LEFT JOIN

With a LEFT JOIN, there are 33 rows corresponding to the actresses in the seven top 2015 movies. Only The Revenant did not have any actresses whose person_id is greater than 3900000.

SELECT * FROM
(SELECT t.id,
       t.title
FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               # movies only
    AND idx.info_type_id = 100      # info_type is votes
    AND idx.info > 400000) AS movs  # at least 400,000 votes     
LEFT JOIN (SELECT a.person_id,
       a.movie_id,
       n.name
    FROM cast_info AS a
    JOIN aka_name AS n ON a.person_id = n.person_id
    WHERE a.role_id = 2             # acresses only
        AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
LIMIT 0, 300;
Displaying records 1 - 10
id title person_id movie_id name
3313672 Avengers: Age of Ultron 3916648 3313672 Smulders, Jacoba Francisca Maria
3752999 Inside Out 4122876 3752999 Kuzniar, Lennon Wynn
3787790 Jurassic World 3938423 3787790 Norvell, Ingrid
3787790 Jurassic World 3938423 3787790 Norvell, Ingrid
3787790 Jurassic World 3950111 3787790 Sallaway, Seannon Jane
3787790 Jurassic World 4079047 3787790 Washington, Kelly Lynn
3787790 Jurassic World 4079047 3787790 Washington, Kelly Lynn
3787790 Jurassic World 4084626 3787790 Weeks, Jency
3787790 Jurassic World 4084626 3787790 Jenc
3787790 Jurassic World 4084626 3787790 Little J

Counting repeat actresses

We might, for example, want to know how many names / spellings of a name with a specific person_id (above 3900000) exist for each person_id in each of the top voted seven films of 2015.

SELECT acts.person_id, 
       COUNT(*) AS num_repeat_names
FROM (SELECT t.id,
       t.title
FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               # movies only
    AND idx.info_type_id = 100      # info_type is votes
    AND idx.info > 400000) AS movs  # at least 400,000 votes
JOIN (SELECT a.person_id,
       a.movie_id,
       n.name
    FROM cast_info AS a
    JOIN aka_name AS n ON a.person_id = n.person_id
    WHERE a.role_id = 2             # acresses only
        AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
GROUP BY acts.person_id;
Displaying records 1 - 10
person_id num_repeat_names
3916648 1
4122876 1
3938423 2
3950111 1
4079047 2
4084626 3
4099458 1
3958614 1
3990819 2
4081131 2

Counting number of actresses per film

We might, for example, want to know how many actresses with a specific person_id (above 3900000) are in each of the top voted seven films of 2015.

SELECT movs.id, 
       movs.title,
       COUNT(*) AS num_actress
FROM (SELECT t.id,
       t.title
FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               # movies only
    AND idx.info_type_id = 100      # info_type is votes
    AND idx.info > 400000) AS movs  # at least 400,000 votes
JOIN (SELECT a.person_id,
       a.movie_id,
       n.name
    FROM cast_info AS a
    JOIN aka_name AS n ON a.person_id = n.person_id
    WHERE a.role_id = 2             # acresses only
        AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
GROUP BY movs.id;
6 records
id title num_actress
3313672 Avengers: Age of Ultron 1
3752999 Inside Out 1
3787790 Jurassic World 9
3915213 Mad Max: Fury Road 5
4260166 Star Wars: Episode VII - The Force Awakens 15
4389619 The Martian 1

UNIONing

In SQL a UNION clause combines two different tables by their rows (whereas JOIN combines two tables by columns). Think about UNION similarly to the bind_rows() command in R.

Mini data tables describing how UNION binds rows and JOIN appends columns.

Figure 3: UNION binds rows while JOIN appends columns, image credit: Jane Williams https://blog.devart.com/mysql-union-tutorial-html.html

UNIONs

Silly example where the column names are ignored.

SELECT 
    1 AS bar,
    2 AS foo

UNION

SELECT 
    10 AS foo,
    20 AS bar;
2 records
bar foo
1 2
10 20

UNION

UNION is specifically designed to bind rows from two different SELECT queries where the variables have been selected in the same order.

Combine the top voted movies from 2015 with the top voted movies from 2019.

However, to account for time, we require the movies from 2015 to have more votes (400,000) than the movies from 2017 (200,000).

That is, the WHERE clause is different for the two subqueries.

UNION

(SELECT t.title, 
        t.production_year, 
        idx.info AS num_votes
    FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               
    AND idx.info_type_id = 100      
    AND idx.info > 400000)

UNION

(SELECT t.title, 
        t.production_year, 
        idx.info AS num_votes
    FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2017  
    AND t.kind_id = 1               
    AND idx.info_type_id = 100      
    AND idx.info > 200000)
LIMIT 0, 100;
Displaying records 1 - 10
title production_year num_votes
Avengers: Age of Ultron 2015 540606
Inside Out 2015 443051
Jurassic World 2015 471237
Mad Max: Fury Road 2015 666484
Star Wars: Episode VII - The Force Awakens 2015 691691
The Martian 2015 583987
The Revenant 2015 526189
Dunkirk 2017 229089
Guardians of the Galaxy Vol. 2 2017 281845
Logan 2017 397056

UNION ALL

If the goal is to include duplicates across two tables, use UNION ALL instead of UNION.

Let’s say that the first table is all movies with production year after 2012 and number of votes greater than 500,000. The second table is movies with production year equal to 2015 and number of votes greater than 400,000.

The Martian would be in both tables.

UNION ALL

With just UNION

(SELECT t.title,
        t.production_year, 
        idx.info AS num_votes
    FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year > 2012  
    AND t.kind_id = 1               
    AND idx.info_type_id = 100      
    AND idx.info > 500000)

UNION

(SELECT t.title, 
        t.production_year, 
        idx.info AS num_votes
    FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               
    AND idx.info_type_id = 100      
    AND idx.info > 400000)
ORDER BY production_year DESC, num_votes;
Displaying records 1 - 10
title production_year num_votes
Batman v Superman: Dawn of Justice 2016 500037
Deadpool 2016 673887
Inside Out 2015 443051
Jurassic World 2015 471237
The Revenant 2015 526189
Avengers: Age of Ultron 2015 540606
The Martian 2015 583987
Mad Max: Fury Road 2015 666484
Star Wars: Episode VII - The Force Awakens 2015 691691
Interstellar 2014 1102826

UNION ALL

When UNION ALL is applied, The Martian is listed twice in the results table.

(SELECT t.title,
        t.production_year, 
        idx.info AS num_votes
    FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year > 2012  
    AND t.kind_id = 1               
    AND idx.info_type_id = 100      
    AND idx.info > 500000)

UNION ALL

(SELECT t.title, 
        t.production_year, 
        idx.info AS num_votes
    FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               
    AND idx.info_type_id = 100      
    AND idx.info > 400000)
ORDER BY production_year DESC, num_votes;
Displaying records 1 - 10
title production_year num_votes
Batman v Superman: Dawn of Justice 2016 500037
Deadpool 2016 673887
Inside Out 2015 443051
Jurassic World 2015 471237
The Revenant 2015 526189
The Revenant 2015 526189
Avengers: Age of Ultron 2015 540606
Avengers: Age of Ultron 2015 540606
The Martian 2015 583987
The Martian 2015 583987

Best practice

It is always a good idea to terminate the SQL connection when you are done with it.

DBI::dbDisconnect(con_imdb, shutdown = TRUE)