Creating a Movie Dataset with R

In this post, I’ll show how I used R to create a movies dataset. I worked with IMDb (Internet Movie Database) data files available for download and information gathered through web scraping their site.

Load Packages

The following packages need to be installed and loaded in order to run the code written in this post.

library(tidyverse)
library(parallel)
library(rvest)
library(knitr)
library(kableExtra)

IMDb Datasets

IMDb (Internet Movie Database) is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews (Wikipedia contributors 2020).

Subsets of IMDb data are available for access to customers for personal and non-commercial use.

Download

I’ll download the following datasets:

  • title.basics.tsv.gz
  • title.ratings.tsv.gz
  • title.crew.tsv.gz
  • name.basics.tsv.gz

I’ll provide more information about them in the following sections.

Details

The following code creates a folder named data if it doesn’t already exist where I’ll place the downloaded files. For the purpose of this post, the download of each file will proceed if the file hasn’t been downloaded yet. You should take into account that the downloaded files may change depending on when they are downloaded (as new information is available every day).

# Create "data" folder if does not exist
if (!file.exists("data")) {
  dir.create("data")
}

# Download the files from IMDb if they haven't been downloaded
if (!file.exists("data/title.basics.tsv.gz")) {
download.file("https://datasets.imdbws.com/title.basics.tsv.gz",
              destfile = "data/title.basics.tsv.gz")
}

if (!file.exists("data/title.ratings.tsv.gz")) {
download.file("https://datasets.imdbws.com/title.ratings.tsv.gz",
              destfile = "data/title.ratings.tsv.gz")
}

if (!file.exists("data/title.crew.tsv.gz")) {
download.file("https://datasets.imdbws.com/title.crew.tsv.gz",
              destfile = "data/title.crew.tsv.gz")
}

if (!file.exists("data/name.basics.tsv.gz")) {
download.file("https://datasets.imdbws.com/name.basics.tsv.gz",
              destfile = "data/name.basics.tsv.gz")
}

Description, Load and Inspection

In this section I’ll describe what variables are included in each dataset, load them into R and take a look at their first few rows.

I created the following function to set the column width of a variable.

# Function to set kable column width
col_width <- function(kable_input, data, col, width_min) {
  column_spec(kable_input = kable_input, 
              column = which( colnames(data) == enexpr(col) ), 
              width_min = width_min)
}

title.basics.tsv.gz

This dataset contains the following information:

  • tconst (string): alphanumeric unique identifier of the title
  • titleType (string): the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
  • primaryTitle (string): the more popular title / the title used by the filmmakers on promotional materials at the point of release
  • originalTitle (string): original title, in the original language
  • isAdult (boolean):
    • 0: non-adult title;
    • 1: adult title
  • startYear (YYYY): represents the release year of a title. In the case of TV Series, it is the series start year
  • endYear (YYYY): TV Series end year. ‘\N’ for all other title types
  • runtimeMinutes: primary runtime of the title, in minutes
  • genres (string array): includes up to three genres associated with the title

I load the data.

# Load title.basics
title.basics <- gzfile("data/title.basics.tsv.gz") %>%
      read_delim(
            "\t", 
            escape_double = FALSE, 
            na = "\\N", 
            trim_ws = TRUE, 
            quote='',
            col_types = cols(
                  tconst = col_character(), 
                  titleType = col_character(),
                  primaryTitle = col_character(),
                  originalTitle = col_character(),
                  isAdult = col_logical(),
                  startYear = col_integer(),
                  endYear = col_integer(),                 
                  runtimeMinutes = col_integer(), 
                  genres = col_character()
                  )
            )

This is how this dataset looks like.

title.basics %>%
  head %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(title.basics, tconst, "3em") %>%
  col_width(title.basics, titleType, "3em") %>%
  col_width(title.basics, primaryTitle, "18em") %>%
  col_width(title.basics, originalTitle, "18em") %>%
  col_width(title.basics, isAdult, "3em") %>%
  col_width(title.basics, startYear, "3em") %>%
  col_width(title.basics, endYear, "3em") %>%
  col_width(title.basics, runtimeMinutes, "4em") %>%
  col_width(title.basics, genres, "5em")
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres
tt0000001 short Carmencita Carmencita FALSE 1894 NA 1 Documentary,Short
tt0000002 short Le clown et ses chiens Le clown et ses chiens FALSE 1892 NA 5 Animation,Short
tt0000003 short Pauvre Pierrot Pauvre Pierrot FALSE 1892 NA 4 Animation,Comedy,Romance
tt0000004 short Un bon bock Un bon bock FALSE 1892 NA 12 Animation,Short
tt0000005 short Blacksmith Scene Blacksmith Scene FALSE 1893 NA 1 Comedy,Short
tt0000006 short Chinese Opium Den Chinese Opium Den FALSE 1894 NA 1 Short

title.ratings.tsv.gz

This dataset contains the following information:

  • tconst (string): alphanumeric unique identifier of the title
  • averageRating: weighted average of all the individual user ratings
  • numVotes - number of votes the title has received

I load the data.

# Load title.ratings
title.ratings <- gzfile("data/title.ratings.tsv.gz") %>%
      read_delim("\t", 
                 escape_double = FALSE, 
                 na = "\\N", 
                 trim_ws = TRUE, 
                 quote='',
                 col_types = cols(
                       tconst = col_character(), 
                       averageRating = col_double(),
                       numVotes = col_integer()
                       )
                 )

This is how this dataset looks like.

title.ratings %>%
  head %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(title.ratings, tconst, "3em") %>%
  col_width(title.ratings, averageRating, "3em") %>%
  col_width(title.ratings, numVotes, "3em")
tconst averageRating numVotes
tt0000001 5.6 1618
tt0000002 6.0 198
tt0000003 6.5 1298
tt0000004 6.1 122
tt0000005 6.1 2077
tt0000006 5.1 112

title.crew.tsv.gz

This dataset contains the following information:

  • tconst (string): alphanumeric unique identifier of the title
  • directors (array of nconsts): director(s) of the given title
  • writers (array of nconsts): writer(s) of the given title

In this post, I’m not interested in writers information, so I won’t be loading that column.

I load the data.

# Load title.crew, skipping writers column
title.crew <- gzfile("data/title.crew.tsv.gz") %>%
      read_delim("\t", 
              escape_double = FALSE, 
              na = "\\N", 
              trim_ws = TRUE, 
              quote='',
              col_types = cols(
                 tconst = col_character(),
                 directors = col_character(),
                 writers = col_skip()
                 )
              )

This is how this dataset looks like.

title.crew %>%
  head(8) %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(title.crew, directors, "14em")
tconst directors
tt0000001 nm0005690
tt0000002 nm0721526
tt0000003 nm0721526
tt0000004 nm0721526
tt0000005 nm0005690
tt0000006 nm0005690
tt0000007 nm0374658,nm0005690
tt0000008 nm0005690

name.basics.tsv.gz

This dataset contains the following information:

  • nconst (string) - alphanumeric unique identifier of the name/person
  • primaryName (string): name by which the person is most often credited
  • birthYear: in YYYY format
  • deathYear: in YYYY format if applicable, else ‘\N’
  • primaryProfession (array of strings): the top-3 professions of the person
  • knownForTitles (array of tconsts): titles the person is known for

From this dataset, I’ll only use the first two columns (nconst and primaryName)

I load the data.

# Load name.basics
name.basics <- gzfile("data/name.basics.tsv.gz") %>%
      read_delim("\t", 
              escape_double = FALSE, 
              na = "\\N", 
              trim_ws = TRUE, 
              quote='',
              col_types = cols(
                 nconst = col_character(),
                 primaryName = col_character(),
                 birthYear = col_skip(),
                 deathYear = col_skip(),
                 primaryProfession = col_skip(),
                 knownForTitles = col_skip()
                 )
              )

This is how this dataset looks like.

name.basics %>%
  head %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(name.basics, nconst, "3em") %>%
  col_width(name.basics, primaryName, "10em")
nconst primaryName
nm0000001 Fred Astaire
nm0000002 Lauren Bacall
nm0000003 Brigitte Bardot
nm0000004 John Belushi
nm0000005 Ingmar Bergman
nm0000006 Ingrid Bergman

I’ll use this dataset to replace the name codes found in title.crew.

Movies Dataset - Part I

Let’s start the creation of our movies dataset.

In Part I, I’ll work with the datasets downloaded in the previous section to create a movies dataset that will contain information about movies with the following characteristics:

  • have a release year (startYear) greater than or equal to 1970
  • have a runtime (runtimeMinutes) between 45m and 3h30min (210 min)
  • have an average rating (averageRating) greater than or equal to 4 on IMDb
  • at least 2500 votes (numVotes) on IMDb

The following code takes title.basics, applies some filters, adds title.ratings information, applies some filters again and finally, creates a new column, linkTitle, which contains the html code necessary to link a movie to its IMDb website.

Details

This is what the code does:

  • Begin with title.basics dataset.
  • Filter titles with:
    • titleType == "movie",
    • startYear >= 1970
    • runtimeMinutes >= 45 & runtimeMinutes <= 210.
  • Add ratings information from title.ratings.
  • Delete duplicate titles.
  • Filter titles with:
    • averageRating >= 4,
    • numVotes >= 2500
  • Create linkTitle column.

If you want to change any of the parameters I defined in the filters, you can do it!

# Creation of the movies dataset
movies <- title.basics %>%
  
  # Apply filters
      filter(
            titleType == "movie",
            startYear >= 1970,
            runtimeMinutes >= 45 & runtimeMinutes <= 210
            ) %>%
  
  # Add ratings and votes
  left_join(title.ratings, by = "tconst") %>%
  
  # Remove duplicates
      distinct(tconst, .keep_all = TRUE) %>%
  
  # Apply filters
      filter(
            averageRating >= 4,
            numVotes >= 2500
      ) %>%
  # Create linkTitle
   mutate(
     linkTitle = str_c(
       "<a href='https://www.imdb.com/title/", tconst, "/'>", originalTitle,"</a>")
     )

Let’s take a look at our dataset so far.

movies %>%
  head %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(movies, tconst, "3em") %>%
  col_width(movies, titleType, "3em") %>%
  col_width(movies, primaryTitle, "18em") %>%
  col_width(movies, originalTitle, "18em") %>%
  col_width(movies, isAdult, "3em") %>%
  col_width(movies, startYear, "3em") %>%
  col_width(movies, endYear, "3em") %>%
  col_width(movies, runtimeMinutes, "4em") %>%
  col_width(movies, genres, "5em") %>%
  col_width(movies, averageRating, "3em") %>%
  col_width(movies, numVotes, "3em") %>%
  col_width(movies, linkTitle, "45em")
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres averageRating numVotes linkTitle
tt0035423 movie Kate & Leopold Kate & Leopold FALSE 2001 NA 118 Comedy,Fantasy,Romance 6.4 77127 <a href=’https://www.imdb.com/title/tt0035423/'>Kate & Leopold</a>
tt0064106 movie Le Boucher Le boucher FALSE 1970 NA 93 Drama,Thriller 7.5 7348 <a href=’https://www.imdb.com/title/tt0064106/'>Le boucher</a>
tt0064165 movie The Things of Life Les choses de la vie FALSE 1970 NA 89 Drama,Romance 7.6 3664 <a href=’https://www.imdb.com/title/tt0064165/'>Les choses de la vie</a>
tt0064177 movie Colossus: The Forbin Project Colossus: The Forbin Project FALSE 1970 NA 100 Sci-Fi,Thriller 7.1 7860 <a href=’https://www.imdb.com/title/tt0064177/'>Colossus: The Forbin Project</a>
tt0064285 movie The Wild Child L’enfant sauvage FALSE 1970 NA 83 Drama 7.5 7238 <a href=’https://www.imdb.com/title/tt0064285/'>L'enfant sauvage</a>
tt0064437 movie The Honeymoon Killers The Honeymoon Killers FALSE 1970 NA 107 Crime,Drama,Romance 7.0 4229 <a href=’https://www.imdb.com/title/tt0064437/'>The Honeymoon Killers</a>

To finish Part I, I’ll add directors information to our movies dataset.

This would have been pretty straightforward if title.crew had directors’ names instead of codes. As that’s not the case, some preprocessing is going to be necessary. Also, the fact that when a title has more than one director they are comma-separated in the same cell adds certain complexity to the task of replacing codes with names. But be not afraid, because the tidyverse is going to help us with this task!

The following code applies the desired change to title.crew.

Details

This is what the code does:

  • Begin with title.crew dataset.
  • Keep crew that corresponds to titles in movies to speed up calculations
  • Go from a dataset where we have one row per title with comma-separated directors codes, to a dataset where we have one row per director code, indicating the title they directed. This step creates a longer dataset which will have duplicate tconst values for titles with more than one director.
  • Add the name by directors’ code.
  • Return to a dataset with one row per title and comma-separated named directors.
title.crew <- title.crew %>%
  
  # Keep crew from titles in movies
   semi_join(movies, "tconst") %>%
  
  # One row for each director
   separate_rows(directors, sep = ",") %>%
  
  # Add name
   left_join(name.basics, by = c("directors" = "nconst")) %>%
  
  # Return to semicolon separated values
   group_by(tconst) %>%
   summarise(directors = str_c(primaryName, collapse = ";")) %>%
   ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)

Let’s see how title.crew looks like now.

title.crew %>%
  head(8) %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(title.crew, directors, "18em")
tconst directors
tt0035423 James Mangold
tt0064106 Claude Chabrol
tt0064165 Claude Sautet
tt0064177 Joseph Sargent
tt0064285 François Truffaut
tt0064437 Donald Volkman;Leonard Kastle
tt0064451 King Hu
tt0064791 René Clément

Now we can add this to movies.

# Add the column to movies
movies <- movies %>%
   left_join(title.crew, by = "tconst")

This is how movies looks like in the end of Part I.

movies %>%
  head %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(movies, tconst, "3em") %>%
  col_width(movies, titleType, "3em") %>%
  col_width(movies, primaryTitle, "18em") %>%
  col_width(movies, originalTitle, "18em") %>%
  col_width(movies, isAdult, "3em") %>%
  col_width(movies, startYear, "3em") %>%
  col_width(movies, endYear, "3em") %>%
  col_width(movies, runtimeMinutes, "4em") %>%
  col_width(movies, genres, "5em") %>%
  col_width(movies, averageRating, "3em") %>%
  col_width(movies, numVotes, "3em") %>%
  col_width(movies, linkTitle, "45em") %>%
  col_width(movies, directors, "18em")
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres averageRating numVotes linkTitle directors
tt0035423 movie Kate & Leopold Kate & Leopold FALSE 2001 NA 118 Comedy,Fantasy,Romance 6.4 77127 <a href=’https://www.imdb.com/title/tt0035423/'>Kate & Leopold</a> James Mangold
tt0064106 movie Le Boucher Le boucher FALSE 1970 NA 93 Drama,Thriller 7.5 7348 <a href=’https://www.imdb.com/title/tt0064106/'>Le boucher</a> Claude Chabrol
tt0064165 movie The Things of Life Les choses de la vie FALSE 1970 NA 89 Drama,Romance 7.6 3664 <a href=’https://www.imdb.com/title/tt0064165/'>Les choses de la vie</a> Claude Sautet
tt0064177 movie Colossus: The Forbin Project Colossus: The Forbin Project FALSE 1970 NA 100 Sci-Fi,Thriller 7.1 7860 <a href=’https://www.imdb.com/title/tt0064177/'>Colossus: The Forbin Project</a> Joseph Sargent
tt0064285 movie The Wild Child L’enfant sauvage FALSE 1970 NA 83 Drama 7.5 7238 <a href=’https://www.imdb.com/title/tt0064285/'>L'enfant sauvage</a> François Truffaut
tt0064437 movie The Honeymoon Killers The Honeymoon Killers FALSE 1970 NA 107 Crime,Drama,Romance 7.0 4229 <a href=’https://www.imdb.com/title/tt0064437/'>The Honeymoon Killers</a> Donald Volkman;Leonard Kastle

Movies Dataset - Part II

So far so good. We used the downloaded files to create a movies dataset with a lot of information about movies with certain characteristics. But what if we want even more information?

Web scraping IMDb

If we take a look at any movie website in IMDb (for example, this one), we see that there is a lot more information related to that movie. I’m particularly interested in the following:

  • Metascore
  • The place or places where the production companies for that title are based, and therefore where the financing originated (Country)
  • Estimated production cost of the movie (Budget)
  • How much the movie took at the box office in the USA in its first weekend of release (Opening Weekend USA)
  • How much the movie made at the box office in USA (Gross USA)
  • How much the movie made at the box office all over the world, including the USA (Worldwide Gross)
  • Awards information
  • Full cast

How can we add that information to our dataset? Now is when web scraping joins the party. I’ll use the rvest package to extract the information mentioned above for each movie (when available). Also, as we have to do this for 16191 movies, I’ll be using the parallel package to speed up the work. Even with this approach, the code took around one hour and a half to run.

The following code creates an object named scraping that contains the information we want for each movie. Some preprocessing is going to be necessary before adding new columns to movies, but I’ll talk about it later.

Details

I had to retrieve information for a lot of movies, and look for it in two different websites (because full cast data wasn’t in the same title page as the other information). For that reason, I decided to run my code in parallel.

The first part of the code sets up clusters to run the work in parallel. I defined the number of cores to be used equal to the number of cores in my computer (it has eight cores) minus one, because that’s what is recommended. So I created seven clusters. Remember to stop the clusters when you finish the work.

I load the necessary libraries to run the code in each cluster by using the clusterEvalQ function. I assigned the result of that function to junk to prevent the print of some output.

Then we have the extraction function itself. The difference is that instead of using apply, we use parLapply.

The object scraping consists of a list of elements with length equal to the number of movies scraped, and each element of this list is a list of elements (the returned my_list object for each movie).

The scraping part is somewhat straightforward. We read the html, we select nodes and extract the text.

There is one final consideration regarding the scraping of box office information. Based on how the information is structured on IMDb, in this step I extracted more data than necessary. For each movie, the scrap_boxOffice object is a character vector whose length may vary from movie to movie (because not all movies have all data available) and each element of that vector contains different data. I’ll manage to keep the information I want and assign it to the corresponding vector in the next section.

Even though I tried to be as clear as possible, this might be a little confusing, but I hope that by inspecting the code you’ll get what is doing!

# Web scraping more information about movies

# Set the cluster to work in parallel
n_cores <- detectCores() - 1
clust <- makeCluster(n_cores)

# Load packages on each cluster
# Obs: library() returns a character vector of all the attached packages, but
# it does so invisibly. When executed via clusterEvalQ, it isn't invisible.
# To make it "invisible" again, I assigned the result of clusterEvalQ to junk. 

junk <- clusterEvalQ(clust, {
  library(dplyr)
  library(stringr)
  library(rvest)
})

# Create scraping object by running a parallel apply
scraping <- parLapply(clust, movies$tconst, function(x){
  
  # Read title page
  title_page <- read_html(str_c("https://www.imdb.com/title/", x)) 
  
  # Scrape metascore
  scrap_metascore <- title_page %>%
    html_nodes(".metacriticScore span") %>%
    html_text
  
  # Scrape box office (and country)
  scrap_boxOffice <- title_page %>%
    html_nodes(".txt-block") %>%
    html_text
  
  # Scrape awards
  scrap_awards <- title_page %>%
    html_nodes(".awards-blurb b") %>%
    html_text
  
  
  # Read cast page
  cast_page <- read_html(
    str_c("https://www.imdb.com/title/", x, "/fullcredits?ref_=tt_cl_sm#cast")
    )
  
  # Scrape cast
  scrap_cast <- cast_page %>%
    html_nodes(".primary_photo+ td a") %>%
    html_text %>%
    str_c(collapse = ";") %>%
    str_replace_all("\\n", "")
  
  # Output for each movie
  my_list <- list("scrap_cast" = scrap_cast,
                  "scrap_metascore" = scrap_metascore,
                  "scrap_boxOffice" = scrap_boxOffice,
                  "scrap_awards" = scrap_awards)
  return(my_list)
})

# Stop the cluster
stopCluster(clust)

Vector Assignment

The following code uses the scraping object to create the vectors of data that will be used to add columns to movies.

Details

This is what the code does:

  • Preallocates the corresponding vectors
  • Creates no_lenght_zero function to solve an issue realted to how data is scraped when there is no information
  • Fills preallocated vectors

boxOffice_list is a character vector where each element refers to a different type of data. It is somewhat structured though, so I used a combination of keep and str_detect functions to assign each target data to the corresponding vector.

# Number of movies
n_movies <- nrow(movies)

# Preallocate vectors
scrap_cast <- vector("character", length = n_movies)

scrap_metascore <- vector("character", length = n_movies)

scrap_country <- vector("character", length = n_movies)
scrap_budget <- vector("character", length = n_movies)
scrap_open_usa <- vector("character", length = n_movies)
scrap_gross_usa <- vector("character", length = n_movies)
scrap_gross_world <- vector("character", length = n_movies)

scrap_awards <- vector("character", length = n_movies)

# Function to fix length zero problem
no_lenght_zero <- function(x){
  ifelse(identical(x, character(0)), NA_character_, x)
}

# Assign extracted info
for (i in seq_along(scraping)){
  
  scrap_cast[[i]] <- scraping[[i]][["scrap_cast"]] %>%
    no_lenght_zero
  
  scrap_metascore[[i]] <- scraping[[i]][["scrap_metascore"]] %>%
    no_lenght_zero
  
  boxOffice_list <- scraping[[i]][["scrap_boxOffice"]]
  
  scrap_country[[i]] <- boxOffice_list %>%
    keep(~ str_detect(.x, "Country:")) %>%
    no_lenght_zero
  scrap_budget[[i]] <- boxOffice_list %>%
    keep(~ str_detect(.x, "Budget:")) %>%
    no_lenght_zero
  scrap_open_usa[[i]] <- boxOffice_list %>%
    keep(~ str_detect(.x, "Opening Weekend USA:")) %>%
    no_lenght_zero
  scrap_gross_usa[[i]] <- boxOffice_list %>%
    keep(~ str_detect(.x, "Gross USA:")) %>%
    no_lenght_zero
  scrap_gross_world[[i]] <- boxOffice_list %>%
    keep(~ str_detect(.x, "Cumulative Worldwide Gross:")) %>%
    no_lenght_zero
  
  scrap_awards[[i]] <- scraping[[i]][["scrap_awards"]] %>%
    no_lenght_zero
}

Add columns

Finally, we can use mutate to add the new columns to movies.

# Add scraped columns to data
movies <- movies %>%
  mutate(cast = scrap_cast,
         metascore_raw = scrap_metascore,
         country_raw = scrap_country,
         budget_raw = scrap_budget,
         open_usa_raw = scrap_open_usa,
         gross_usa_raw = scrap_gross_usa,
         gross_world_raw = scrap_gross_world,
         awards_raw = scrap_awards)

This is how movies looks like in the end of Part II.

Obs: I have taken the first 30 characters of cast and replace the rest with ... to make the table look better.

movies %>%
  head %>%
  mutate(cast = cast %>%
           str_sub(1, 30) %>%
           str_c("...")) %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(movies, tconst, "3em") %>%
  col_width(movies, titleType, "3em") %>%
  col_width(movies, primaryTitle, "18em") %>%
  col_width(movies, originalTitle, "18em") %>%
  col_width(movies, isAdult, "3em") %>%
  col_width(movies, startYear, "3em") %>%
  col_width(movies, endYear, "3em") %>%
  col_width(movies, runtimeMinutes, "4em") %>%
  col_width(movies, genres, "5em") %>%
  col_width(movies, averageRating, "3em") %>%
  col_width(movies, numVotes, "3em") %>%
  col_width(movies, linkTitle, "45em") %>%
  col_width(movies, directors, "18em") %>%
  col_width(movies, cast, "18em") %>%
  col_width(movies, metascore_raw, "3em") %>%
  col_width(movies, country_raw, "20em") %>%
  col_width(movies, budget_raw, "18em") %>%
  col_width(movies, open_usa_raw, "28em") %>%
  col_width(movies, gross_usa_raw, "18em") %>%
  col_width(movies, gross_world_raw, "25em") %>%
  col_width(movies, awards_raw, "18em")
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres averageRating numVotes linkTitle directors cast metascore_raw country_raw budget_raw open_usa_raw gross_usa_raw gross_world_raw awards_raw
tt0035423 movie Kate & Leopold Kate & Leopold FALSE 2001 NA 118 Comedy,Fantasy,Romance 6.4 77127 <a href=’https://www.imdb.com/title/tt0035423/'>Kate & Leopold</a> James Mangold Meg Ryan; Hugh Jackman; Liev … 44 Country: USA Budget:$48,000,000 (estimated) Opening Weekend USA: $9,725,408, 30 December 2001 Gross USA: $47,121,859 Cumulative Worldwide Gross: $76,019,048 Nominated for 1 Oscar.
tt0064106 movie Le Boucher Le boucher FALSE 1970 NA 93 Drama,Thriller 7.5 7348 <a href=’https://www.imdb.com/title/tt0064106/'>Le boucher</a> Claude Chabrol Stéphane Audran; Jean Yanne; … NA Country: France | Italy NA NA Gross USA: $474,458 NA Nominated for 1 BAFTA Film Award.
tt0064165 movie The Things of Life Les choses de la vie FALSE 1970 NA 89 Drama,Romance 7.6 3664 <a href=’https://www.imdb.com/title/tt0064165/'>Les choses de la vie</a> Claude Sautet Michel Piccoli; Romy Schneide… NA Country: Switzerland | France | Italy NA NA Gross USA: $5,063 Cumulative Worldwide Gross: $5,063 NA
tt0064177 movie Colossus: The Forbin Project Colossus: The Forbin Project FALSE 1970 NA 100 Sci-Fi,Thriller 7.1 7860 <a href=’https://www.imdb.com/title/tt0064177/'>Colossus: The Forbin Project</a> Joseph Sargent Eric Braeden; Susan Clark; Go… NA Country: USA NA NA NA NA NA
tt0064285 movie The Wild Child L’enfant sauvage FALSE 1970 NA 83 Drama 7.5 7238 <a href=’https://www.imdb.com/title/tt0064285/'>L'enfant sauvage</a> François Truffaut Jean-Pierre Cargol; François … 94 Country: France NA Opening Weekend USA: $11,206, 25 April 1999 Gross USA: $65,560 Cumulative Worldwide Gross: $65,560 NA
tt0064437 movie The Honeymoon Killers The Honeymoon Killers FALSE 1970 NA 107 Crime,Drama,Romance 7.0 4229 <a href=’https://www.imdb.com/title/tt0064437/'>The Honeymoon Killers</a> Donald Volkman;Leonard Kastle Shirley Stoler; Tony Lo Bianc… NA Country: USA Budget:$150,000 (estimated) NA NA NA NA

Movies Dataset - Part III

In Part III, I’ll clean the columns created in Part II.

Metascore

metascore_raw is a character column. I used parse_number to create metascore, which is numeric.

# metascore to number
movies <- movies %>%
  mutate(metascore = metascore_raw %>%
           parse_number)

Country

country_raw needs some cleaning that can be done with stringr::str_replace_all function.

movies <- movies %>%
  mutate(country = country_raw %>%
           str_replace_all("Country:", "") %>%
           str_replace_all("\\n", "") %>%
           str_replace_all(" ", "") %>%
           str_replace_all("\\|", ";")
)

Box Office Information

By taking a fast look at this “money” variables, it seemed that the currency was dollars ($). Just to be sure, I created the function check_currency with the following logic: if the sum of values where I find the dollar sign ($) and NA values of a column is equal to the total number of rows, then all currencies are the same (and dollars).

check_currency <- function(data, col) {
  n_movies <- nrow(data)
  missing <- data %>%
    pull(enexpr(col)) %>%
    is.na %>%
    sum
  
  dollar_sign <- data %>%
    pull(enexpr(col)) %>%
    str_detect("\\$") %>%
    sum(na.rm = TRUE)
  
  identical(missing + dollar_sign, n_movies)
}

After applying this function to the corresponding columns, we see that not all currencies are the same. Later I checked the dataset and saw in fact that this was the case.

check_currency(movies, budget_raw)
## [1] FALSE
check_currency(movies, open_usa_raw)
## [1] TRUE
check_currency(movies, gross_usa_raw)
## [1] FALSE
check_currency(movies, gross_world_raw)
## [1] FALSE

I decided to keep only those values express in dollars, assigning NA otherwise. For that matter, I created the keep_dollar function.

keep_dollar <- function(col) {
  ifelse(str_detect(col, "\\$"), col, NA_real_)
}

Now we can create our new variables, through parse_number. This function works just fine, even with columns where there were more numbers present (for example, take a look at open_usa_raw and open_usa)

movies <- movies %>%
  mutate(budget = keep_dollar(budget_raw) %>%
           parse_number,
         open_usa = keep_dollar(open_usa_raw) %>%
           parse_number,
         gross_usa = keep_dollar(gross_usa_raw) %>%
           parse_number,
         gross_world = keep_dollar(gross_world_raw) %>%
           parse_number)

Awards information

Using the awards_raw column I created a few variables.

For every movie:

  • won_oscar, an indicator variable that takes the value 1 if the movie won any Oscar and 0 otherwise
  • oscars_won, the number of Oscars won by a movie

For movies that didn’t win an Oscar:

  • nominated_oscar, an indicator variable that takes the value 1 if the movie was nominated for any Oscar and 0 otherwise
  • oscars_nominations, the number of Oscar nominations

Obs: If a movie won at least one Oscar, we don’t know how many nominations it had. This has to do with the way I scraped this data.

movies <- movies %>%
  mutate(
    
    won_oscar = ifelse(
      test = str_detect(awards_raw, "Won") & 
        str_detect(awards_raw, "Oscar") &
        !is.na(awards_raw), 
      yes = 1, 
      no = 0
      ),
    
    oscars_won = ifelse(
      test = won_oscar == 1, 
      yes = parse_number(awards_raw),
      no = 0
    ),
    
    nominated_oscar = ifelse(
      test = str_detect(awards_raw, "Nominated") & 
        str_detect(awards_raw, "Oscar") &
        !is.na(awards_raw), 
      yes = 1, 
      no = 0
    ),
    
    oscars_nominations = ifelse(
      test = nominated_oscar == 1, 
      yes = parse_number(awards_raw),
      no = 0
    )
)

The final dataset looks like this. I kept the raw columns for you to see the differences with the cleaned ones.

Obs: I have taken the first 30 characters of cast and replace the rest with ... to make the table look better.

movies %>%
  head %>%
  mutate(cast = cast %>%
           str_sub(1, 30) %>%
           str_c("...")) %>%
  kable(format = "html", table.attr = "style = \"color: white;\"") %>%
  col_width(movies, tconst, "3em") %>%
  col_width(movies, titleType, "3em") %>%
  col_width(movies, primaryTitle, "18em") %>%
  col_width(movies, originalTitle, "18em") %>%
  col_width(movies, isAdult, "3em") %>%
  col_width(movies, startYear, "3em") %>%
  col_width(movies, endYear, "3em") %>%
  col_width(movies, runtimeMinutes, "4em") %>%
  col_width(movies, genres, "5em") %>%
  col_width(movies, averageRating, "3em") %>%
  col_width(movies, numVotes, "3em") %>%
  col_width(movies, linkTitle, "45em") %>%
  col_width(movies, directors, "18em") %>%
  col_width(movies, cast, "18em") %>%
  col_width(movies, metascore_raw, "3em") %>%
  col_width(movies, country_raw, "20em") %>%
  col_width(movies, budget_raw, "18em") %>%
  col_width(movies, open_usa_raw, "28em") %>%
  col_width(movies, gross_usa_raw, "18em") %>%
  col_width(movies, gross_world_raw, "25em") %>%
  col_width(movies, awards_raw, "18em") %>%
  col_width(movies, metascore, "3em") %>%
  col_width(movies, country, "9em") %>%
  col_width(movies, budget, "7em") %>%
  col_width(movies, open_usa, "7em") %>%
  col_width(movies, gross_usa, "7em") %>%
  col_width(movies, gross_world, "7em") %>%
  col_width(movies, won_oscar, "10em") %>%
  col_width(movies, oscars_won, "10em") %>%
  col_width(movies, nominated_oscar, "10em") %>%
  col_width(movies, oscars_nominations, "10em")
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres averageRating numVotes linkTitle directors cast metascore_raw country_raw budget_raw open_usa_raw gross_usa_raw gross_world_raw awards_raw metascore country budget open_usa gross_usa gross_world won_oscar oscars_won nominated_oscar oscars_nominations
tt0035423 movie Kate & Leopold Kate & Leopold FALSE 2001 NA 118 Comedy,Fantasy,Romance 6.4 77127 <a href=’https://www.imdb.com/title/tt0035423/'>Kate & Leopold</a> James Mangold Meg Ryan; Hugh Jackman; Liev … 44 Country: USA Budget:$48,000,000 (estimated) Opening Weekend USA: $9,725,408, 30 December 2001 Gross USA: $47,121,859 Cumulative Worldwide Gross: $76,019,048 Nominated for 1 Oscar. 44 USA 4.8e+07 9725408 47121859 76019048 0 0 1 1
tt0064106 movie Le Boucher Le boucher FALSE 1970 NA 93 Drama,Thriller 7.5 7348 <a href=’https://www.imdb.com/title/tt0064106/'>Le boucher</a> Claude Chabrol Stéphane Audran; Jean Yanne; … NA Country: France | Italy NA NA Gross USA: $474,458 NA Nominated for 1 BAFTA Film Award. NA France;Italy NA NA 474458 NA 0 0 0 0
tt0064165 movie The Things of Life Les choses de la vie FALSE 1970 NA 89 Drama,Romance 7.6 3664 <a href=’https://www.imdb.com/title/tt0064165/'>Les choses de la vie</a> Claude Sautet Michel Piccoli; Romy Schneide… NA Country: Switzerland | France | Italy NA NA Gross USA: $5,063 Cumulative Worldwide Gross: $5,063 NA NA Switzerland;France;Italy NA NA 5063 5063 0 0 0 0
tt0064177 movie Colossus: The Forbin Project Colossus: The Forbin Project FALSE 1970 NA 100 Sci-Fi,Thriller 7.1 7860 <a href=’https://www.imdb.com/title/tt0064177/'>Colossus: The Forbin Project</a> Joseph Sargent Eric Braeden; Susan Clark; Go… NA Country: USA NA NA NA NA NA NA USA NA NA NA NA 0 0 0 0
tt0064285 movie The Wild Child L’enfant sauvage FALSE 1970 NA 83 Drama 7.5 7238 <a href=’https://www.imdb.com/title/tt0064285/'>L'enfant sauvage</a> François Truffaut Jean-Pierre Cargol; François … 94 Country: France NA Opening Weekend USA: $11,206, 25 April 1999 Gross USA: $65,560 Cumulative Worldwide Gross: $65,560 NA 94 France NA 11206 65560 65560 0 0 0 0
tt0064437 movie The Honeymoon Killers The Honeymoon Killers FALSE 1970 NA 107 Crime,Drama,Romance 7.0 4229 <a href=’https://www.imdb.com/title/tt0064437/'>The Honeymoon Killers</a> Donald Volkman;Leonard Kastle Shirley Stoler; Tony Lo Bianc… NA Country: USA Budget:$150,000 (estimated) NA NA NA NA NA USA 1.5e+05 NA NA NA 0 0 0 0

Subproducts

With a few lines of codes, some useful datasets can be created.

  • List of different genres found in movies
list_genres <- movies %>%
  select(genres) %>%
  separate_rows(genres, sep = ",") %>%
  unique %>%
  arrange(genres)
  • List of different directors found in movies, sorted by name
list_directors <- movies %>%
  select(directors) %>%
  separate_rows(directors, sep = ";") %>%
  unique %>%
  arrange(directors)
  • List of different actors and actresses found in movies, sorted by name
list_cast <- movies %>%
  select(cast) %>%
  separate_rows(cast, sep = ";") %>%
  unique %>%
  arrange(cast)

Save Datasets

Finally, I’ll save the movies dataset and the subproducts list_genres, list_directors and list_cast in a folder named datasets.

# Create "datasets" folder if does not exist
if (!file.exists("datasets")) {
  dir.create("datasets")
}

# Save files
saveRDS(movies, "datasets/movies.rds")
saveRDS(list_genres, "datasets/list_genres.rds")
saveRDS(list_directors, "datasets/list_directors.rds")
saveRDS(list_cast, "datasets/list_cast.rds")

Download Created Datasets

You can download the created datasets. They are in .rds format.

Wikipedia contributors. 2020. “IMDb — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=IMDb&oldid=959154715.

Ivan Millanes
Ivan Millanes
Analytics & AI Consultant

Analytics & AI Consultant at OZ Digital Consulting.