Creating a Movie Dataset with R

Last updated on May 27, 2020 23 min read

In this post, I’ll show how I used R to create a movies dataset. I worked with IMDb (Internet Movie Database) data files available for download and information gathered through web scraping their site.

Load Packages

The following packages need to be installed and loaded in order to run the code written in this post.

library(tidyverse)
library(parallel)
library(rvest)
library(knitr)
library(kableExtra)

IMDb Datasets

IMDb (Internet Movie Database) is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews (Wikipedia contributors 2020).

Subsets of IMDb data are available for access to customers for personal and non-commercial use.

Download

I’ll download the following datasets:

title.basics.tsv.gz
title.ratings.tsv.gz
title.crew.tsv.gz
name.basics.tsv.gz

I’ll provide more information about them in the following sections.

Details

The following code creates a folder named data if it doesn’t already exist where I’ll place the downloaded files. For the purpose of this post, the download of each file will proceed if the file hasn’t been downloaded yet. You should take into account that the downloaded files may change depending on when they are downloaded (as new information is available every day).

# Create "data" folder if does not exist
if (!file.exists("data")) {
  dir.create("data")
}

# Download the files from IMDb if they haven't been downloaded
if (!file.exists("data/title.basics.tsv.gz")) {
download.file("https://datasets.imdbws.com/title.basics.tsv.gz",
              destfile = "data/title.basics.tsv.gz")
}

if (!file.exists("data/title.ratings.tsv.gz")) {
download.file("https://datasets.imdbws.com/title.ratings.tsv.gz",
              destfile = "data/title.ratings.tsv.gz")
}

if (!file.exists("data/title.crew.tsv.gz")) {
download.file("https://datasets.imdbws.com/title.crew.tsv.gz",
              destfile = "data/title.crew.tsv.gz")
}

if (!file.exists("data/name.basics.tsv.gz")) {
download.file("https://datasets.imdbws.com/name.basics.tsv.gz",
              destfile = "data/name.basics.tsv.gz")
}

Description, Load and Inspection

In this section I’ll describe what variables are included in each dataset, load them into R and take a look at their first few rows.

I created the following function to set the column width of a variable.

# Function to set kable column width
col_width <- function(kable_input, data, col, width_min) {
  column_spec(kable_input = kable_input, 
              column = which( colnames(data) == enexpr(col) ), 
              width_min = width_min)
}

`title.basics.tsv.gz`

This dataset contains the following information:

tconst (string): alphanumeric unique identifier of the title
titleType (string): the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
primaryTitle (string): the more popular title / the title used by the filmmakers on promotional materials at the point of release
originalTitle (string): original title, in the original language
isAdult (boolean):
- 0: non-adult title;
- 1: adult title
startYear (YYYY): represents the release year of a title. In the case of TV Series, it is the series start year
endYear (YYYY): TV Series end year. ‘\N’ for all other title types
runtimeMinutes: primary runtime of the title, in minutes
genres (string array): includes up to three genres associated with the title

I load the data.

# Load title.basics
title.basics <- gzfile("data/title.basics.tsv.gz") %>%
      read_delim(
            "\t", 
            escape_double = FALSE, 
            na = "\\N", 
            trim_ws = TRUE, 
            quote='',
            col_types = cols(
                  tconst = col_character(), 
                  titleType = col_character(),
                  primaryTitle = col_character(),
                  originalTitle = col_character(),
                  isAdult = col_logical(),
                  startYear = col_integer(),
                  endYear = col_integer(),                 
                  runtimeMinutes = col_integer(), 
                  genres = col_character()
                  )
            )

This is how this dataset looks like.

title.basics %>%
  head %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(title.basics, tconst, "3em") %>%
  col_width(title.basics, titleType, "3em") %>%
  col_width(title.basics, primaryTitle, "18em") %>%
  col_width(title.basics, originalTitle, "18em") %>%
  col_width(title.basics, isAdult, "3em") %>%
  col_width(title.basics, startYear, "3em") %>%
  col_width(title.basics, endYear, "3em") %>%
  col_width(title.basics, runtimeMinutes, "4em") %>%
  col_width(title.basics, genres, "5em")

tconst	titleType	primaryTitle	originalTitle	isAdult	startYear	endYear	runtimeMinutes	genres
tt0000001	short	Carmencita	Carmencita	FALSE	1894	NA	1	Documentary,Short
tt0000002	short	Le clown et ses chiens	Le clown et ses chiens	FALSE	1892	NA	5	Animation,Short
tt0000003	short	Pauvre Pierrot	Pauvre Pierrot	FALSE	1892	NA	4	Animation,Comedy,Romance
tt0000004	short	Un bon bock	Un bon bock	FALSE	1892	NA	12	Animation,Short
tt0000005	short	Blacksmith Scene	Blacksmith Scene	FALSE	1893	NA	1	Comedy,Short
tt0000006	short	Chinese Opium Den	Chinese Opium Den	FALSE	1894	NA	1	Short

`title.ratings.tsv.gz`

This dataset contains the following information:

tconst (string): alphanumeric unique identifier of the title
averageRating: weighted average of all the individual user ratings
numVotes - number of votes the title has received

I load the data.

# Load title.ratings
title.ratings <- gzfile("data/title.ratings.tsv.gz") %>%
      read_delim("\t", 
                 escape_double = FALSE, 
                 na = "\\N", 
                 trim_ws = TRUE, 
                 quote='',
                 col_types = cols(
                       tconst = col_character(), 
                       averageRating = col_double(),
                       numVotes = col_integer()
                       )
                 )

This is how this dataset looks like.

title.ratings %>%
  head %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(title.ratings, tconst, "3em") %>%
  col_width(title.ratings, averageRating, "3em") %>%
  col_width(title.ratings, numVotes, "3em")

tconst	averageRating	numVotes
tt0000001	5.6	1618
tt0000002	6.0	198
tt0000003	6.5	1298
tt0000004	6.1	122
tt0000005	6.1	2077
tt0000006	5.1	112

`title.crew.tsv.gz`

This dataset contains the following information:

tconst (string): alphanumeric unique identifier of the title
directors (array of nconsts): director(s) of the given title
writers (array of nconsts): writer(s) of the given title

In this post, I’m not interested in writers information, so I won’t be loading that column.

I load the data.

# Load title.crew, skipping writers column
title.crew <- gzfile("data/title.crew.tsv.gz") %>%
      read_delim("\t", 
              escape_double = FALSE, 
              na = "\\N", 
              trim_ws = TRUE, 
              quote='',
              col_types = cols(
                 tconst = col_character(),
                 directors = col_character(),
                 writers = col_skip()
                 )
              )

This is how this dataset looks like.

title.crew %>%
  head(8) %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(title.crew, directors, "14em")

tconst	directors
tt0000001	nm0005690
tt0000002	nm0721526
tt0000003	nm0721526
tt0000004	nm0721526
tt0000005	nm0005690
tt0000006	nm0005690
tt0000007	nm0374658,nm0005690
tt0000008	nm0005690

`name.basics.tsv.gz`

This dataset contains the following information:

nconst (string) - alphanumeric unique identifier of the name/person
primaryName (string): name by which the person is most often credited
birthYear: in YYYY format
deathYear: in YYYY format if applicable, else ‘\N’
primaryProfession (array of strings): the top-3 professions of the person
knownForTitles (array of tconsts): titles the person is known for

From this dataset, I’ll only use the first two columns (nconst and primaryName)

I load the data.

# Load name.basics
name.basics <- gzfile("data/name.basics.tsv.gz") %>%
      read_delim("\t", 
              escape_double = FALSE, 
              na = "\\N", 
              trim_ws = TRUE, 
              quote='',
              col_types = cols(
                 nconst = col_character(),
                 primaryName = col_character(),
                 birthYear = col_skip(),
                 deathYear = col_skip(),
                 primaryProfession = col_skip(),
                 knownForTitles = col_skip()
                 )
              )

This is how this dataset looks like.

name.basics %>%
  head %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(name.basics, nconst, "3em") %>%
  col_width(name.basics, primaryName, "10em")

nconst	primaryName
nm0000001	Fred Astaire
nm0000002	Lauren Bacall
nm0000003	Brigitte Bardot
nm0000004	John Belushi
nm0000005	Ingmar Bergman
nm0000006	Ingrid Bergman

I’ll use this dataset to replace the name codes found in title.crew.

Movies Dataset - Part I

Let’s start the creation of our movies dataset.

In Part I, I’ll work with the datasets downloaded in the previous section to create a movies dataset that will contain information about movies with the following characteristics:

have a release year (startYear) greater than or equal to 1970
have a runtime (runtimeMinutes) between 45m and 3h30min (210 min)
have an average rating (averageRating) greater than or equal to 4 on IMDb
at least 2500 votes (numVotes) on IMDb

The following code takes title.basics, applies some filters, adds title.ratings information, applies some filters again and finally, creates a new column, linkTitle, which contains the html code necessary to link a movie to its IMDb website.

Details

This is what the code does:

Begin with title.basics dataset.
Filter titles with:
- titleType == "movie",
- startYear >= 1970
- runtimeMinutes >= 45 & runtimeMinutes <= 210.
Add ratings information from title.ratings.
Delete duplicate titles.
Filter titles with:
- averageRating >= 4,
- numVotes >= 2500
Create linkTitle column.

If you want to change any of the parameters I defined in the filters, you can do it!

# Creation of the movies dataset
movies <- title.basics %>%
  
  # Apply filters
      filter(
            titleType == "movie",
            startYear >= 1970,
            runtimeMinutes >= 45 & runtimeMinutes <= 210
            ) %>%
  
  # Add ratings and votes
  left_join(title.ratings, by = "tconst") %>%
  
  # Remove duplicates
      distinct(tconst, .keep_all = TRUE) %>%
  
  # Apply filters
      filter(
            averageRating >= 4,
            numVotes >= 2500
      ) %>%
  # Create linkTitle
   mutate(
     linkTitle = str_c(
       "<a href='https://www.imdb.com/title/", tconst, "/'>", originalTitle,"</a>")
     )

Let’s take a look at our dataset so far.

movies %>%
  head %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(movies, tconst, "3em") %>%
  col_width(movies, titleType, "3em") %>%
  col_width(movies, primaryTitle, "18em") %>%
  col_width(movies, originalTitle, "18em") %>%
  col_width(movies, isAdult, "3em") %>%
  col_width(movies, startYear, "3em") %>%
  col_width(movies, endYear, "3em") %>%
  col_width(movies, runtimeMinutes, "4em") %>%
  col_width(movies, genres, "5em") %>%
  col_width(movies, averageRating, "3em") %>%
  col_width(movies, numVotes, "3em") %>%
  col_width(movies, linkTitle, "45em")

tconst	titleType	primaryTitle	originalTitle	isAdult	startYear	endYear	runtimeMinutes	genres	averageRating	numVotes	linkTitle
tt0035423	movie	Kate & Leopold	Kate & Leopold	FALSE	2001	NA	118	Comedy,Fantasy,Romance	6.4	77127	<a href=’https://www.imdb.com/title/tt0035423/'>Kate & Leopold</a>
tt0064106	movie	Le Boucher	Le boucher	FALSE	1970	NA	93	Drama,Thriller	7.5	7348	<a href=’https://www.imdb.com/title/tt0064106/'>Le boucher</a>
tt0064165	movie	The Things of Life	Les choses de la vie	FALSE	1970	NA	89	Drama,Romance	7.6	3664	<a href=’https://www.imdb.com/title/tt0064165/'>Les choses de la vie</a>
tt0064177	movie	Colossus: The Forbin Project	Colossus: The Forbin Project	FALSE	1970	NA	100	Sci-Fi,Thriller	7.1	7860	<a href=’https://www.imdb.com/title/tt0064177/'>Colossus: The Forbin Project</a>
tt0064285	movie	The Wild Child	L’enfant sauvage	FALSE	1970	NA	83	Drama	7.5	7238	<a href=’https://www.imdb.com/title/tt0064285/'>L'enfant sauvage</a>
tt0064437	movie	The Honeymoon Killers	The Honeymoon Killers	FALSE	1970	NA	107	Crime,Drama,Romance	7.0	4229	<a href=’https://www.imdb.com/title/tt0064437/'>The Honeymoon Killers</a>

To finish Part I, I’ll add directors information to our movies dataset.

This would have been pretty straightforward if title.crew had directors’ names instead of codes. As that’s not the case, some preprocessing is going to be necessary. Also, the fact that when a title has more than one director they are comma-separated in the same cell adds certain complexity to the task of replacing codes with names. But be not afraid, because the tidyverse is going to help us with this task!

The following code applies the desired change to title.crew.

Details

This is what the code does:

Begin with title.crew dataset.
Keep crew that corresponds to titles in movies to speed up calculations
Go from a dataset where we have one row per title with comma-separated directors codes, to a dataset where we have one row per director code, indicating the title they directed. This step creates a longer dataset which will have duplicate tconst values for titles with more than one director.
Add the name by directors’ code.
Return to a dataset with one row per title and comma-separated named directors.

title.crew <- title.crew %>%
  
  # Keep crew from titles in movies
   semi_join(movies, "tconst") %>%
  
  # One row for each director
   separate_rows(directors, sep = ",") %>%
  
  # Add name
   left_join(name.basics, by = c("directors" = "nconst")) %>%
  
  # Return to semicolon separated values
   group_by(tconst) %>%
   summarise(directors = str_c(primaryName, collapse = ";")) %>%
   ungroup()

## `summarise()` ungrouping output (override with `.groups` argument)

Let’s see how title.crew looks like now.

title.crew %>%
  head(8) %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(title.crew, directors, "18em")

tconst	directors
tt0035423	James Mangold
tt0064106	Claude Chabrol
tt0064165	Claude Sautet
tt0064177	Joseph Sargent
tt0064285	François Truffaut
tt0064437	Donald Volkman;Leonard Kastle
tt0064451	King Hu
tt0064791	René Clément

Now we can add this to movies.

# Add the column to movies
movies <- movies %>%
   left_join(title.crew, by = "tconst")

This is how movies looks like in the end of Part I.

movies %>%
  head %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(movies, tconst, "3em") %>%
  col_width(movies, titleType, "3em") %>%
  col_width(movies, primaryTitle, "18em") %>%
  col_width(movies, originalTitle, "18em") %>%
  col_width(movies, isAdult, "3em") %>%
  col_width(movies, startYear, "3em") %>%
  col_width(movies, endYear, "3em") %>%
  col_width(movies, runtimeMinutes, "4em") %>%
  col_width(movies, genres, "5em") %>%
  col_width(movies, averageRating, "3em") %>%
  col_width(movies, numVotes, "3em") %>%
  col_width(movies, linkTitle, "45em") %>%
  col_width(movies, directors, "18em")

tconst	titleType	primaryTitle	originalTitle	isAdult	startYear	endYear	runtimeMinutes	genres	averageRating	numVotes	linkTitle	directors
tt0035423	movie	Kate & Leopold	Kate & Leopold	FALSE	2001	NA	118	Comedy,Fantasy,Romance	6.4	77127	<a href=’https://www.imdb.com/title/tt0035423/'>Kate & Leopold</a>	James Mangold
tt0064106	movie	Le Boucher	Le boucher	FALSE	1970	NA	93	Drama,Thriller	7.5	7348	<a href=’https://www.imdb.com/title/tt0064106/'>Le boucher</a>	Claude Chabrol
tt0064165	movie	The Things of Life	Les choses de la vie	FALSE	1970	NA	89	Drama,Romance	7.6	3664	<a href=’https://www.imdb.com/title/tt0064165/'>Les choses de la vie</a>	Claude Sautet
tt0064177	movie	Colossus: The Forbin Project	Colossus: The Forbin Project	FALSE	1970	NA	100	Sci-Fi,Thriller	7.1	7860	<a href=’https://www.imdb.com/title/tt0064177/'>Colossus: The Forbin Project</a>	Joseph Sargent
tt0064285	movie	The Wild Child	L’enfant sauvage	FALSE	1970	NA	83	Drama	7.5	7238	<a href=’https://www.imdb.com/title/tt0064285/'>L'enfant sauvage</a>	François Truffaut
tt0064437	movie	The Honeymoon Killers	The Honeymoon Killers	FALSE	1970	NA	107	Crime,Drama,Romance	7.0	4229	<a href=’https://www.imdb.com/title/tt0064437/'>The Honeymoon Killers</a>	Donald Volkman;Leonard Kastle

Movies Dataset - Part II

So far so good. We used the downloaded files to create a movies dataset with a lot of information about movies with certain characteristics. But what if we want even more information?

Web scraping IMDb

If we take a look at any movie website in IMDb (for example, this one), we see that there is a lot more information related to that movie. I’m particularly interested in the following:

Metascore
The place or places where the production companies for that title are based, and therefore where the financing originated (Country)
Estimated production cost of the movie (Budget)
How much the movie took at the box office in the USA in its first weekend of release (Opening Weekend USA)
How much the movie made at the box office in USA (Gross USA)
How much the movie made at the box office all over the world, including the USA (Worldwide Gross)
Awards information
Full cast

How can we add that information to our dataset? Now is when web scraping joins the party. I’ll use the rvest package to extract the information mentioned above for each movie (when available). Also, as we have to do this for 16191 movies, I’ll be using the parallel package to speed up the work. Even with this approach, the code took around one hour and a half to run.

The following code creates an object named scraping that contains the information we want for each movie. Some preprocessing is going to be necessary before adding new columns to movies, but I’ll talk about it later.

Details

I had to retrieve information for a lot of movies, and look for it in two different websites (because full cast data wasn’t in the same title page as the other information). For that reason, I decided to run my code in parallel.

The first part of the code sets up clusters to run the work in parallel. I defined the number of cores to be used equal to the number of cores in my computer (it has eight cores) minus one, because that’s what is recommended. So I created seven clusters. Remember to stop the clusters when you finish the work.

I load the necessary libraries to run the code in each cluster by using the clusterEvalQ function. I assigned the result of that function to junk to prevent the print of some output.

Then we have the extraction function itself. The difference is that instead of using apply, we use parLapply.

The object scraping consists of a list of elements with length equal to the number of movies scraped, and each element of this list is a list of elements (the returned my_list object for each movie).

The scraping part is somewhat straightforward. We read the html, we select nodes and extract the text.

There is one final consideration regarding the scraping of box office information. Based on how the information is structured on IMDb, in this step I extracted more data than necessary. For each movie, the scrap_boxOffice object is a character vector whose length may vary from movie to movie (because not all movies have all data available) and each element of that vector contains different data. I’ll manage to keep the information I want and assign it to the corresponding vector in the next section.

Even though I tried to be as clear as possible, this might be a little confusing, but I hope that by inspecting the code you’ll get what is doing!

# Web scraping more information about movies

# Set the cluster to work in parallel
n_cores <- detectCores() - 1
clust <- makeCluster(n_cores)

# Load packages on each cluster
# Obs: library() returns a character vector of all the attached packages, but
# it does so invisibly. When executed via clusterEvalQ, it isn't invisible.
# To make it "invisible" again, I assigned the result of clusterEvalQ to junk. 

junk <- clusterEvalQ(clust, {
  library(dplyr)
  library(stringr)
  library(rvest)
})

# Create scraping object by running a parallel apply
scraping <- parLapply(clust, movies$tconst, function(x){
  
  # Read title page
  title_page <- read_html(str_c("https://www.imdb.com/title/", x)) 
  
  # Scrape metascore
  scrap_metascore <- title_page %>%
    html_nodes(".metacriticScore span") %>%
    html_text
  
  # Scrape box office (and country)
  scrap_boxOffice <- title_page %>%
    html_nodes(".txt-block") %>%
    html_text
  
  # Scrape awards
  scrap_awards <- title_page %>%
    html_nodes(".awards-blurb b") %>%
    html_text
  
  
  # Read cast page
  cast_page <- read_html(
    str_c("https://www.imdb.com/title/", x, "/fullcredits?ref_=tt_cl_sm#cast")
    )
  
  # Scrape cast
  scrap_cast <- cast_page %>%
    html_nodes(".primary_photo+ td a") %>%
    html_text %>%
    str_c(collapse = ";") %>%
    str_replace_all("\\n", "")
  
  # Output for each movie
  my_list <- list("scrap_cast" = scrap_cast,
                  "scrap_metascore" = scrap_metascore,
                  "scrap_boxOffice" = scrap_boxOffice,
                  "scrap_awards" = scrap_awards)
  return(my_list)
})

# Stop the cluster
stopCluster(clust)

Vector Assignment

The following code uses the scraping object to create the vectors of data that will be used to add columns to movies.

Details

This is what the code does:

Preallocates the corresponding vectors
Creates no_lenght_zero function to solve an issue realted to how data is scraped when there is no information
Fills preallocated vectors

boxOffice_list is a character vector where each element refers to a different type of data. It is somewhat structured though, so I used a combination of keep and str_detect functions to assign each target data to the corresponding vector.

# Number of movies
n_movies <- nrow(movies)

# Preallocate vectors
scrap_cast <- vector("character", length = n_movies)

scrap_metascore <- vector("character", length = n_movies)

scrap_country <- vector("character", length = n_movies)
scrap_budget <- vector("character", length = n_movies)
scrap_open_usa <- vector("character", length = n_movies)
scrap_gross_usa <- vector("character", length = n_movies)
scrap_gross_world <- vector("character", length = n_movies)

scrap_awards <- vector("character", length = n_movies)

# Function to fix length zero problem
no_lenght_zero <- function(x){
  ifelse(identical(x, character(0)), NA_character_, x)
}

# Assign extracted info
for (i in seq_along(scraping)){
  
  scrap_cast[[i]] <- scraping[[i]][["scrap_cast"]] %>%
    no_lenght_zero
  
  scrap_metascore[[i]] <- scraping[[i]][["scrap_metascore"]] %>%
    no_lenght_zero
  
  boxOffice_list <- scraping[[i]][["scrap_boxOffice"]]
  
  scrap_country[[i]] <- boxOffice_list %>%
    keep(~ str_detect(.x, "Country:")) %>%
    no_lenght_zero
  scrap_budget[[i]] <- boxOffice_list %>%
    keep(~ str_detect(.x, "Budget:")) %>%
    no_lenght_zero
  scrap_open_usa[[i]] <- boxOffice_list %>%
    keep(~ str_detect(.x, "Opening Weekend USA:")) %>%
    no_lenght_zero
  scrap_gross_usa[[i]] <- boxOffice_list %>%
    keep(~ str_detect(.x, "Gross USA:")) %>%
    no_lenght_zero
  scrap_gross_world[[i]] <- boxOffice_list %>%
    keep(~ str_detect(.x, "Cumulative Worldwide Gross:")) %>%
    no_lenght_zero
  
  scrap_awards[[i]] <- scraping[[i]][["scrap_awards"]] %>%
    no_lenght_zero
}

Add columns

Finally, we can use mutate to add the new columns to movies.

# Add scraped columns to data
movies <- movies %>%
  mutate(cast = scrap_cast,
         metascore_raw = scrap_metascore,
         country_raw = scrap_country,
         budget_raw = scrap_budget,
         open_usa_raw = scrap_open_usa,
         gross_usa_raw = scrap_gross_usa,
         gross_world_raw = scrap_gross_world,
         awards_raw = scrap_awards)

This is how movies looks like in the end of Part II.

Obs: I have taken the first 30 characters of cast and replace the rest with ... to make the table look better.

movies %>%
  head %>%
  mutate(cast = cast %>%
           str_sub(1, 30) %>%
           str_c("...")) %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(movies, tconst, "3em") %>%
  col_width(movies, titleType, "3em") %>%
  col_width(movies, primaryTitle, "18em") %>%
  col_width(movies, originalTitle, "18em") %>%
  col_width(movies, isAdult, "3em") %>%
  col_width(movies, startYear, "3em") %>%
  col_width(movies, endYear, "3em") %>%
  col_width(movies, runtimeMinutes, "4em") %>%
  col_width(movies, genres, "5em") %>%
  col_width(movies, averageRating, "3em") %>%
  col_width(movies, numVotes, "3em") %>%
  col_width(movies, linkTitle, "45em") %>%
  col_width(movies, directors, "18em") %>%
  col_width(movies, cast, "18em") %>%
  col_width(movies, metascore_raw, "3em") %>%
  col_width(movies, country_raw, "20em") %>%
  col_width(movies, budget_raw, "18em") %>%
  col_width(movies, open_usa_raw, "28em") %>%
  col_width(movies, gross_usa_raw, "18em") %>%
  col_width(movies, gross_world_raw, "25em") %>%
  col_width(movies, awards_raw, "18em")

tconst	titleType	primaryTitle	originalTitle	isAdult	startYear	endYear	runtimeMinutes	genres	averageRating	numVotes	linkTitle	directors	cast	metascore_raw	country_raw	budget_raw	open_usa_raw	gross_usa_raw	gross_world_raw	awards_raw
tt0035423	movie	Kate & Leopold	Kate & Leopold	FALSE	2001	NA	118	Comedy,Fantasy,Romance	6.4	77127	<a href=’https://www.imdb.com/title/tt0035423/'>Kate & Leopold</a>	James Mangold	Meg Ryan; Hugh Jackman; Liev …	44	Country: USA	Budget:$48,000,000 (estimated)	Opening Weekend USA: $9,725,408, 30 December 2001	Gross USA: $47,121,859	Cumulative Worldwide Gross: $76,019,048	Nominated for 1 Oscar.
tt0064106	movie	Le Boucher	Le boucher	FALSE	1970	NA	93	Drama,Thriller	7.5	7348	<a href=’https://www.imdb.com/title/tt0064106/'>Le boucher</a>	Claude Chabrol	Stéphane Audran; Jean Yanne; …	NA	Country: France \| Italy	NA	NA	Gross USA: $474,458	NA	Nominated for 1 BAFTA Film Award.
tt0064165	movie	The Things of Life	Les choses de la vie	FALSE	1970	NA	89	Drama,Romance	7.6	3664	<a href=’https://www.imdb.com/title/tt0064165/'>Les choses de la vie</a>	Claude Sautet	Michel Piccoli; Romy Schneide…	NA	Country: Switzerland \| France \| Italy	NA	NA	Gross USA: $5,063	Cumulative Worldwide Gross: $5,063	NA
tt0064177	movie	Colossus: The Forbin Project	Colossus: The Forbin Project	FALSE	1970	NA	100	Sci-Fi,Thriller	7.1	7860	<a href=’https://www.imdb.com/title/tt0064177/'>Colossus: The Forbin Project</a>	Joseph Sargent	Eric Braeden; Susan Clark; Go…	NA	Country: USA	NA	NA	NA	NA	NA
tt0064285	movie	The Wild Child	L’enfant sauvage	FALSE	1970	NA	83	Drama	7.5	7238	<a href=’https://www.imdb.com/title/tt0064285/'>L'enfant sauvage</a>	François Truffaut	Jean-Pierre Cargol; François …	94	Country: France	NA	Opening Weekend USA: $11,206, 25 April 1999	Gross USA: $65,560	Cumulative Worldwide Gross: $65,560	NA
tt0064437	movie	The Honeymoon Killers	The Honeymoon Killers	FALSE	1970	NA	107	Crime,Drama,Romance	7.0	4229	<a href=’https://www.imdb.com/title/tt0064437/'>The Honeymoon Killers</a>	Donald Volkman;Leonard Kastle	Shirley Stoler; Tony Lo Bianc…	NA	Country: USA	Budget:$150,000 (estimated)	NA	NA	NA	NA

Movies Dataset - Part III

In Part III, I’ll clean the columns created in Part II.

Metascore

metascore_raw is a character column. I used parse_number to create metascore, which is numeric.

# metascore to number
movies <- movies %>%
  mutate(metascore = metascore_raw %>%
           parse_number)

Country

country_raw needs some cleaning that can be done with stringr::str_replace_all function.

movies <- movies %>%
  mutate(country = country_raw %>%
           str_replace_all("Country:", "") %>%
           str_replace_all("\\n", "") %>%
           str_replace_all(" ", "") %>%
           str_replace_all("\\|", ";")
)

Box Office Information

By taking a fast look at this “money” variables, it seemed that the currency was dollars ($). Just to be sure, I created the function check_currency with the following logic: if the sum of values where I find the dollar sign ($) and NA values of a column is equal to the total number of rows, then all currencies are the same (and dollars).

check_currency <- function(data, col) {
  n_movies <- nrow(data)
  missing <- data %>%
    pull(enexpr(col)) %>%
    is.na %>%
    sum
  
  dollar_sign <- data %>%
    pull(enexpr(col)) %>%
    str_detect("\\$") %>%
    sum(na.rm = TRUE)
  
  identical(missing + dollar_sign, n_movies)
}

After applying this function to the corresponding columns, we see that not all currencies are the same. Later I checked the dataset and saw in fact that this was the case.

check_currency(movies, budget_raw)

## [1] FALSE

check_currency(movies, open_usa_raw)

## [1] TRUE

check_currency(movies, gross_usa_raw)

## [1] FALSE

check_currency(movies, gross_world_raw)

## [1] FALSE

I decided to keep only those values express in dollars, assigning NA otherwise. For that matter, I created the keep_dollar function.

keep_dollar <- function(col) {
  ifelse(str_detect(col, "\\$"), col, NA_real_)
}

Now we can create our new variables, through parse_number. This function works just fine, even with columns where there were more numbers present (for example, take a look at open_usa_raw and open_usa)

movies <- movies %>%
  mutate(budget = keep_dollar(budget_raw) %>%
           parse_number,
         open_usa = keep_dollar(open_usa_raw) %>%
           parse_number,
         gross_usa = keep_dollar(gross_usa_raw) %>%
           parse_number,
         gross_world = keep_dollar(gross_world_raw) %>%
           parse_number)

Awards information

Using the awards_raw column I created a few variables.

For every movie:

won_oscar, an indicator variable that takes the value 1 if the movie won any Oscar and 0 otherwise
oscars_won, the number of Oscars won by a movie

For movies that didn’t win an Oscar:

nominated_oscar, an indicator variable that takes the value 1 if the movie was nominated for any Oscar and 0 otherwise
oscars_nominations, the number of Oscar nominations

Obs: If a movie won at least one Oscar, we don’t know how many nominations it had. This has to do with the way I scraped this data.

movies <- movies %>%
  mutate(
    
    won_oscar = ifelse(
      test = str_detect(awards_raw, "Won") & 
        str_detect(awards_raw, "Oscar") &
        !is.na(awards_raw), 
      yes = 1, 
      no = 0
      ),
    
    oscars_won = ifelse(
      test = won_oscar == 1, 
      yes = parse_number(awards_raw),
      no = 0
    ),
    
    nominated_oscar = ifelse(
      test = str_detect(awards_raw, "Nominated") & 
        str_detect(awards_raw, "Oscar") &
        !is.na(awards_raw), 
      yes = 1, 
      no = 0
    ),
    
    oscars_nominations = ifelse(
      test = nominated_oscar == 1, 
      yes = parse_number(awards_raw),
      no = 0
    )
)

The final dataset looks like this. I kept the raw columns for you to see the differences with the cleaned ones.

Obs: I have taken the first 30 characters of cast and replace the rest with ... to make the table look better.

movies %>%
  head %>%
  mutate(cast = cast %>%
           str_sub(1, 30) %>%
           str_c("...")) %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(movies, tconst, "3em") %>%
  col_width(movies, titleType, "3em") %>%
  col_width(movies, primaryTitle, "18em") %>%
  col_width(movies, originalTitle, "18em") %>%
  col_width(movies, isAdult, "3em") %>%
  col_width(movies, startYear, "3em") %>%
  col_width(movies, endYear, "3em") %>%
  col_width(movies, runtimeMinutes, "4em") %>%
  col_width(movies, genres, "5em") %>%
  col_width(movies, averageRating, "3em") %>%
  col_width(movies, numVotes, "3em") %>%
  col_width(movies, linkTitle, "45em") %>%
  col_width(movies, directors, "18em") %>%
  col_width(movies, cast, "18em") %>%
  col_width(movies, metascore_raw, "3em") %>%
  col_width(movies, country_raw, "20em") %>%
  col_width(movies, budget_raw, "18em") %>%
  col_width(movies, open_usa_raw, "28em") %>%
  col_width(movies, gross_usa_raw, "18em") %>%
  col_width(movies, gross_world_raw, "25em") %>%
  col_width(movies, awards_raw, "18em") %>%
  col_width(movies, metascore, "3em") %>%
  col_width(movies, country, "9em") %>%
  col_width(movies, budget, "7em") %>%
  col_width(movies, open_usa, "7em") %>%
  col_width(movies, gross_usa, "7em") %>%
  col_width(movies, gross_world, "7em") %>%
  col_width(movies, won_oscar, "10em") %>%
  col_width(movies, oscars_won, "10em") %>%
  col_width(movies, nominated_oscar, "10em") %>%
  col_width(movies, oscars_nominations, "10em")

tconst	titleType	primaryTitle	originalTitle	isAdult	startYear	endYear	runtimeMinutes	genres	averageRating	numVotes	linkTitle	directors	cast	metascore_raw	country_raw	budget_raw	open_usa_raw	gross_usa_raw	gross_world_raw	awards_raw	metascore	country	budget	open_usa	gross_usa	gross_world	nominated_oscar	oscars_nominations
tt0035423	movie	Kate & Leopold	Kate & Leopold	FALSE	2001	NA	118	Comedy,Fantasy,Romance	6.4	77127	<a href=’https://www.imdb.com/title/tt0035423/'>Kate & Leopold</a>	James Mangold	Meg Ryan; Hugh Jackman; Liev …	44	Country: USA	Budget:$48,000,000 (estimated)	Opening Weekend USA: $9,725,408, 30 December 2001	Gross USA: $47,121,859	Cumulative Worldwide Gross: $76,019,048	Nominated for 1 Oscar.	44	USA	4.8e+07	9725408	47121859	76019048	1	1
tt0064106	movie	Le Boucher	Le boucher	FALSE	1970	NA	93	Drama,Thriller	7.5	7348	<a href=’https://www.imdb.com/title/tt0064106/'>Le boucher</a>	Claude Chabrol	Stéphane Audran; Jean Yanne; …	NA	Country: France \| Italy	NA	NA	Gross USA: $474,458	NA	Nominated for 1 BAFTA Film Award.	NA	France;Italy	NA	NA	474458	NA	0	0
tt0064165	movie	The Things of Life	Les choses de la vie	FALSE	1970	NA	89	Drama,Romance	7.6	3664	<a href=’https://www.imdb.com/title/tt0064165/'>Les choses de la vie</a>	Claude Sautet	Michel Piccoli; Romy Schneide…	NA	Country: Switzerland \| France \| Italy	NA	NA	Gross USA: $5,063	Cumulative Worldwide Gross: $5,063	NA	NA	Switzerland;France;Italy	NA	NA	5063	5063	0	0
tt0064177	movie	Colossus: The Forbin Project	Colossus: The Forbin Project	FALSE	1970	NA	100	Sci-Fi,Thriller	7.1	7860	<a href=’https://www.imdb.com/title/tt0064177/'>Colossus: The Forbin Project</a>	Joseph Sargent	Eric Braeden; Susan Clark; Go…	NA	Country: USA	NA	NA	NA	NA	NA	NA	USA	NA	NA	NA	NA	0	0
tt0064285	movie	The Wild Child	L’enfant sauvage	FALSE	1970	NA	83	Drama	7.5	7238	<a href=’https://www.imdb.com/title/tt0064285/'>L'enfant sauvage</a>	François Truffaut	Jean-Pierre Cargol; François …	94	Country: France	NA	Opening Weekend USA: $11,206, 25 April 1999	Gross USA: $65,560	Cumulative Worldwide Gross: $65,560	NA	94	France	NA	11206	65560	65560	0	0
tt0064437	movie	The Honeymoon Killers	The Honeymoon Killers	FALSE	1970	NA	107	Crime,Drama,Romance	7.0	4229	<a href=’https://www.imdb.com/title/tt0064437/'>The Honeymoon Killers</a>	Donald Volkman;Leonard Kastle	Shirley Stoler; Tony Lo Bianc…	NA	Country: USA	Budget:$150,000 (estimated)	NA	NA	NA	NA	NA	USA	1.5e+05	NA	NA	NA	0	0

Subproducts

With a few lines of codes, some useful datasets can be created.

List of different genres found in movies

list_genres <- movies %>%
  select(genres) %>%
  separate_rows(genres, sep = ",") %>%
  unique %>%
  arrange(genres)

List of different directors found in movies, sorted by name

list_directors <- movies %>%
  select(directors) %>%
  separate_rows(directors, sep = ";") %>%
  unique %>%
  arrange(directors)

List of different actors and actresses found in movies, sorted by name

list_cast <- movies %>%
  select(cast) %>%
  separate_rows(cast, sep = ";") %>%
  unique %>%
  arrange(cast)

Save Datasets

Finally, I’ll save the movies dataset and the subproducts list_genres, list_directors and list_cast in a folder named datasets.

# Create "datasets" folder if does not exist
if (!file.exists("datasets")) {
  dir.create("datasets")
}

# Save files
saveRDS(movies, "datasets/movies.rds")
saveRDS(list_genres, "datasets/list_genres.rds")
saveRDS(list_directors, "datasets/list_directors.rds")
saveRDS(list_cast, "datasets/list_cast.rds")

Download Created Datasets

You can download the created datasets. They are in .rds format.

Wikipedia contributors. 2020. “IMDb — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=IMDb&oldid=959154715.

Creating a Movie Dataset with R

Load Packages

IMDb Datasets

Download

Details

Description, Load and Inspection

`title.basics.tsv.gz`

`title.ratings.tsv.gz`

`title.crew.tsv.gz`

`name.basics.tsv.gz`

Movies Dataset - Part I

Details

Details

Movies Dataset - Part II

Web scraping IMDb

Details

Vector Assignment

Details

Add columns

Movies Dataset - Part III

Metascore

Country

Box Office Information

Awards information

Subproducts

Save Datasets

Download Created Datasets

Ivan Millanes

Analytics & AI Consultant