Creating a Movie Dataset with R
In this post, I’ll show how I used R to create a movies dataset. I worked with IMDb (Internet Movie Database) data files available for download and information gathered through web scraping their site.
Load Packages
The following packages need to be installed and loaded in order to run the code written in this post.
library(tidyverse)
library(parallel)
library(rvest)
library(knitr)
library(kableExtra)
IMDb Datasets
IMDb (Internet Movie Database) is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews (Wikipedia contributors 2020).
Subsets of IMDb data are available for access to customers for personal and non-commercial use.
Download
I’ll download the following datasets:
title.basics.tsv.gz
title.ratings.tsv.gz
title.crew.tsv.gz
name.basics.tsv.gz
I’ll provide more information about them in the following sections.
Details
The following code creates a folder named data
if it doesn’t already exist where I’ll place the downloaded files. For the purpose of this post, the download of each file will proceed if the file hasn’t been downloaded yet. You should take into account that the downloaded files may change depending on when they are downloaded (as new information is available every day).
# Create "data" folder if does not exist
if (!file.exists("data")) {
dir.create("data")
}
# Download the files from IMDb if they haven't been downloaded
if (!file.exists("data/title.basics.tsv.gz")) {
download.file("https://datasets.imdbws.com/title.basics.tsv.gz",
destfile = "data/title.basics.tsv.gz")
}
if (!file.exists("data/title.ratings.tsv.gz")) {
download.file("https://datasets.imdbws.com/title.ratings.tsv.gz",
destfile = "data/title.ratings.tsv.gz")
}
if (!file.exists("data/title.crew.tsv.gz")) {
download.file("https://datasets.imdbws.com/title.crew.tsv.gz",
destfile = "data/title.crew.tsv.gz")
}
if (!file.exists("data/name.basics.tsv.gz")) {
download.file("https://datasets.imdbws.com/name.basics.tsv.gz",
destfile = "data/name.basics.tsv.gz")
}
Description, Load and Inspection
In this section I’ll describe what variables are included in each dataset, load them into R and take a look at their first few rows.
I created the following function to set the column width of a variable.
# Function to set kable column width
col_width <- function(kable_input, data, col, width_min) {
column_spec(kable_input = kable_input,
column = which( colnames(data) == enexpr(col) ),
width_min = width_min)
}
title.basics.tsv.gz
This dataset contains the following information:
tconst
(string): alphanumeric unique identifier of the titletitleType
(string): the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)primaryTitle
(string): the more popular title / the title used by the filmmakers on promotional materials at the point of releaseoriginalTitle
(string): original title, in the original languageisAdult
(boolean):0
: non-adult title;1
: adult title
startYear
(YYYY): represents the release year of a title. In the case of TV Series, it is the series start yearendYear
(YYYY): TV Series end year. ‘\N’ for all other title typesruntimeMinutes
: primary runtime of the title, in minutesgenres
(string array): includes up to three genres associated with the title
I load the data.
# Load title.basics
title.basics <- gzfile("data/title.basics.tsv.gz") %>%
read_delim(
"\t",
escape_double = FALSE,
na = "\\N",
trim_ws = TRUE,
quote='',
col_types = cols(
tconst = col_character(),
titleType = col_character(),
primaryTitle = col_character(),
originalTitle = col_character(),
isAdult = col_logical(),
startYear = col_integer(),
endYear = col_integer(),
runtimeMinutes = col_integer(),
genres = col_character()
)
)
This is how this dataset looks like.
title.basics %>%
head %>%
kable(format = "html", table.attr = "style = \"color: white;\"") %>%
col_width(title.basics, tconst, "3em") %>%
col_width(title.basics, titleType, "3em") %>%
col_width(title.basics, primaryTitle, "18em") %>%
col_width(title.basics, originalTitle, "18em") %>%
col_width(title.basics, isAdult, "3em") %>%
col_width(title.basics, startYear, "3em") %>%
col_width(title.basics, endYear, "3em") %>%
col_width(title.basics, runtimeMinutes, "4em") %>%
col_width(title.basics, genres, "5em")
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres |
---|---|---|---|---|---|---|---|---|
tt0000001 | short | Carmencita | Carmencita | FALSE | 1894 | NA | 1 | Documentary,Short |
tt0000002 | short | Le clown et ses chiens | Le clown et ses chiens | FALSE | 1892 | NA | 5 | Animation,Short |
tt0000003 | short | Pauvre Pierrot | Pauvre Pierrot | FALSE | 1892 | NA | 4 | Animation,Comedy,Romance |
tt0000004 | short | Un bon bock | Un bon bock | FALSE | 1892 | NA | 12 | Animation,Short |
tt0000005 | short | Blacksmith Scene | Blacksmith Scene | FALSE | 1893 | NA | 1 | Comedy,Short |
tt0000006 | short | Chinese Opium Den | Chinese Opium Den | FALSE | 1894 | NA | 1 | Short |
title.ratings.tsv.gz
This dataset contains the following information:
tconst
(string): alphanumeric unique identifier of the titleaverageRating
: weighted average of all the individual user ratingsnumVotes
- number of votes the title has received
I load the data.
# Load title.ratings
title.ratings <- gzfile("data/title.ratings.tsv.gz") %>%
read_delim("\t",
escape_double = FALSE,
na = "\\N",
trim_ws = TRUE,
quote='',
col_types = cols(
tconst = col_character(),
averageRating = col_double(),
numVotes = col_integer()
)
)
This is how this dataset looks like.
title.ratings %>%
head %>%
kable(format = "html", table.attr = "style = \"color: white;\"") %>%
col_width(title.ratings, tconst, "3em") %>%
col_width(title.ratings, averageRating, "3em") %>%
col_width(title.ratings, numVotes, "3em")
tconst | averageRating | numVotes |
---|---|---|
tt0000001 | 5.6 | 1618 |
tt0000002 | 6.0 | 198 |
tt0000003 | 6.5 | 1298 |
tt0000004 | 6.1 | 122 |
tt0000005 | 6.1 | 2077 |
tt0000006 | 5.1 | 112 |
title.crew.tsv.gz
This dataset contains the following information:
tconst
(string): alphanumeric unique identifier of the titledirectors
(array of nconsts): director(s) of the given titlewriters
(array of nconsts): writer(s) of the given title
In this post, I’m not interested in writers
information, so I won’t be loading that column.
I load the data.
# Load title.crew, skipping writers column
title.crew <- gzfile("data/title.crew.tsv.gz") %>%
read_delim("\t",
escape_double = FALSE,
na = "\\N",
trim_ws = TRUE,
quote='',
col_types = cols(
tconst = col_character(),
directors = col_character(),
writers = col_skip()
)
)
This is how this dataset looks like.
title.crew %>%
head(8) %>%
kable(format = "html", table.attr = "style = \"color: white;\"") %>%
col_width(title.crew, directors, "14em")
tconst | directors |
---|---|
tt0000001 | nm0005690 |
tt0000002 | nm0721526 |
tt0000003 | nm0721526 |
tt0000004 | nm0721526 |
tt0000005 | nm0005690 |
tt0000006 | nm0005690 |
tt0000007 | nm0374658,nm0005690 |
tt0000008 | nm0005690 |
name.basics.tsv.gz
This dataset contains the following information:
nconst
(string) - alphanumeric unique identifier of the name/personprimaryName
(string): name by which the person is most often creditedbirthYear
: in YYYY formatdeathYear
: in YYYY format if applicable, else ‘\N’primaryProfession
(array of strings): the top-3 professions of the personknownForTitles
(array of tconsts): titles the person is known for
From this dataset, I’ll only use the first two columns (nconst
and primaryName
)
I load the data.
# Load name.basics
name.basics <- gzfile("data/name.basics.tsv.gz") %>%
read_delim("\t",
escape_double = FALSE,
na = "\\N",
trim_ws = TRUE,
quote='',
col_types = cols(
nconst = col_character(),
primaryName = col_character(),
birthYear = col_skip(),
deathYear = col_skip(),
primaryProfession = col_skip(),
knownForTitles = col_skip()
)
)
This is how this dataset looks like.
name.basics %>%
head %>%
kable(format = "html", table.attr = "style = \"color: white;\"") %>%
col_width(name.basics, nconst, "3em") %>%
col_width(name.basics, primaryName, "10em")
nconst | primaryName |
---|---|
nm0000001 | Fred Astaire |
nm0000002 | Lauren Bacall |
nm0000003 | Brigitte Bardot |
nm0000004 | John Belushi |
nm0000005 | Ingmar Bergman |
nm0000006 | Ingrid Bergman |
I’ll use this dataset to replace the name codes found in title.crew
.
Movies Dataset - Part I
Let’s start the creation of our movies
dataset.
In Part I, I’ll work with the datasets downloaded in the previous section to create a movies
dataset that will contain information about movies with the following characteristics:
- have a release year (
startYear
) greater than or equal to 1970 - have a runtime (
runtimeMinutes
) between 45m and 3h30min (210 min) - have an average rating (
averageRating
) greater than or equal to 4 on IMDb - at least 2500 votes (
numVotes
) on IMDb
The following code takes title.basics
, applies some filters, adds title.ratings
information, applies some filters again and finally, creates a new column, linkTitle
, which contains the html code necessary to link a movie to its IMDb website.
Details
This is what the code does:
- Begin with
title.basics
dataset. - Filter titles with:
titleType == "movie"
,startYear >= 1970
runtimeMinutes >= 45 & runtimeMinutes <= 210
.
- Add ratings information from
title.ratings
. - Delete duplicate titles.
- Filter titles with:
averageRating >= 4
,numVotes >= 2500
- Create
linkTitle
column.
If you want to change any of the parameters I defined in the filters, you can do it!
# Creation of the movies dataset
movies <- title.basics %>%
# Apply filters
filter(
titleType == "movie",
startYear >= 1970,
runtimeMinutes >= 45 & runtimeMinutes <= 210
) %>%
# Add ratings and votes
left_join(title.ratings, by = "tconst") %>%
# Remove duplicates
distinct(tconst, .keep_all = TRUE) %>%
# Apply filters
filter(
averageRating >= 4,
numVotes >= 2500
) %>%
# Create linkTitle
mutate(
linkTitle = str_c(
"<a href='https://www.imdb.com/title/", tconst, "/'>", originalTitle,"</a>")
)
Let’s take a look at our dataset so far.
movies %>%
head %>%
kable(format = "html", table.attr = "style = \"color: white;\"") %>%
col_width(movies, tconst, "3em") %>%
col_width(movies, titleType, "3em") %>%
col_width(movies, primaryTitle, "18em") %>%
col_width(movies, originalTitle, "18em") %>%
col_width(movies, isAdult, "3em") %>%
col_width(movies, startYear, "3em") %>%
col_width(movies, endYear, "3em") %>%
col_width(movies, runtimeMinutes, "4em") %>%
col_width(movies, genres, "5em") %>%
col_width(movies, averageRating, "3em") %>%
col_width(movies, numVotes, "3em") %>%
col_width(movies, linkTitle, "45em")
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | averageRating | numVotes | linkTitle |
---|---|---|---|---|---|---|---|---|---|---|---|
tt0035423 | movie | Kate & Leopold | Kate & Leopold | FALSE | 2001 | NA | 118 | Comedy,Fantasy,Romance | 6.4 | 77127 | <a href=’https://www.imdb.com/title/tt0035423/'>Kate & Leopold</a> |
tt0064106 | movie | Le Boucher | Le boucher | FALSE | 1970 | NA | 93 | Drama,Thriller | 7.5 | 7348 | <a href=’https://www.imdb.com/title/tt0064106/'>Le boucher</a> |
tt0064165 | movie | The Things of Life | Les choses de la vie | FALSE | 1970 | NA | 89 | Drama,Romance | 7.6 | 3664 | <a href=’https://www.imdb.com/title/tt0064165/'>Les choses de la vie</a> |
tt0064177 | movie | Colossus: The Forbin Project | Colossus: The Forbin Project | FALSE | 1970 | NA | 100 | Sci-Fi,Thriller | 7.1 | 7860 | <a href=’https://www.imdb.com/title/tt0064177/'>Colossus: The Forbin Project</a> |
tt0064285 | movie | The Wild Child | L’enfant sauvage | FALSE | 1970 | NA | 83 | Drama | 7.5 | 7238 | <a href=’https://www.imdb.com/title/tt0064285/'>L'enfant sauvage</a> |
tt0064437 | movie | The Honeymoon Killers | The Honeymoon Killers | FALSE | 1970 | NA | 107 | Crime,Drama,Romance | 7.0 | 4229 | <a href=’https://www.imdb.com/title/tt0064437/'>The Honeymoon Killers</a> |
To finish Part I, I’ll add directors
information to our movies
dataset.
This would have been pretty straightforward if title.crew
had directors’ names instead of codes. As that’s not the case, some preprocessing is going to be necessary. Also, the fact that when a title has more than one director they are comma-separated in the same cell adds certain complexity to the task of replacing codes with names. But be not afraid, because the tidyverse
is going to help us with this task!
The following code applies the desired change to title.crew
.
Details
This is what the code does:
- Begin with
title.crew
dataset. - Keep crew that corresponds to titles in
movies
to speed up calculations - Go from a dataset where we have one row per title with comma-separated directors codes, to a dataset where we have one row per director code, indicating the title they directed. This step creates a longer dataset which will have duplicate
tconst
values for titles with more than one director. - Add the name by directors’ code.
- Return to a dataset with one row per title and comma-separated named directors.
title.crew <- title.crew %>%
# Keep crew from titles in movies
semi_join(movies, "tconst") %>%
# One row for each director
separate_rows(directors, sep = ",") %>%
# Add name
left_join(name.basics, by = c("directors" = "nconst")) %>%
# Return to semicolon separated values
group_by(tconst) %>%
summarise(directors = str_c(primaryName, collapse = ";")) %>%
ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)
Let’s see how title.crew
looks like now.
title.crew %>%
head(8) %>%
kable(format = "html", table.attr = "style = \"color: white;\"") %>%
col_width(title.crew, directors, "18em")
tconst | directors |
---|---|
tt0035423 | James Mangold |
tt0064106 | Claude Chabrol |
tt0064165 | Claude Sautet |
tt0064177 | Joseph Sargent |
tt0064285 | François Truffaut |
tt0064437 | Donald Volkman;Leonard Kastle |
tt0064451 | King Hu |
tt0064791 | René Clément |
Now we can add this to movies
.
# Add the column to movies
movies <- movies %>%
left_join(title.crew, by = "tconst")
This is how movies
looks like in the end of Part I.
movies %>%
head %>%
kable(format = "html", table.attr = "style = \"color: white;\"") %>%
col_width(movies, tconst, "3em") %>%
col_width(movies, titleType, "3em") %>%
col_width(movies, primaryTitle, "18em") %>%
col_width(movies, originalTitle, "18em") %>%
col_width(movies, isAdult, "3em") %>%
col_width(movies, startYear, "3em") %>%
col_width(movies, endYear, "3em") %>%
col_width(movies, runtimeMinutes, "4em") %>%
col_width(movies, genres, "5em") %>%
col_width(movies, averageRating, "3em") %>%
col_width(movies, numVotes, "3em") %>%
col_width(movies, linkTitle, "45em") %>%
col_width(movies, directors, "18em")
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | averageRating | numVotes | linkTitle | directors |
---|---|---|---|---|---|---|---|---|---|---|---|---|
tt0035423 | movie | Kate & Leopold | Kate & Leopold | FALSE | 2001 | NA | 118 | Comedy,Fantasy,Romance | 6.4 | 77127 | <a href=’https://www.imdb.com/title/tt0035423/'>Kate & Leopold</a> | James Mangold |
tt0064106 | movie | Le Boucher | Le boucher | FALSE | 1970 | NA | 93 | Drama,Thriller | 7.5 | 7348 | <a href=’https://www.imdb.com/title/tt0064106/'>Le boucher</a> | Claude Chabrol |
tt0064165 | movie | The Things of Life | Les choses de la vie | FALSE | 1970 | NA | 89 | Drama,Romance | 7.6 | 3664 | <a href=’https://www.imdb.com/title/tt0064165/'>Les choses de la vie</a> | Claude Sautet |
tt0064177 | movie | Colossus: The Forbin Project | Colossus: The Forbin Project | FALSE | 1970 | NA | 100 | Sci-Fi,Thriller | 7.1 | 7860 | <a href=’https://www.imdb.com/title/tt0064177/'>Colossus: The Forbin Project</a> | Joseph Sargent |
tt0064285 | movie | The Wild Child | L’enfant sauvage | FALSE | 1970 | NA | 83 | Drama | 7.5 | 7238 | <a href=’https://www.imdb.com/title/tt0064285/'>L'enfant sauvage</a> | François Truffaut |
tt0064437 | movie | The Honeymoon Killers | The Honeymoon Killers | FALSE | 1970 | NA | 107 | Crime,Drama,Romance | 7.0 | 4229 | <a href=’https://www.imdb.com/title/tt0064437/'>The Honeymoon Killers</a> | Donald Volkman;Leonard Kastle |
Movies Dataset - Part II
So far so good. We used the downloaded files to create a movies
dataset with a lot of information about movies with certain characteristics. But what if we want even more information?
Web scraping IMDb
If we take a look at any movie website in IMDb (for example, this one), we see that there is a lot more information related to that movie. I’m particularly interested in the following:
- Metascore
- The place or places where the production companies for that title are based, and therefore where the financing originated (Country)
- Estimated production cost of the movie (Budget)
- How much the movie took at the box office in the USA in its first weekend of release (Opening Weekend USA)
- How much the movie made at the box office in USA (Gross USA)
- How much the movie made at the box office all over the world, including the USA (Worldwide Gross)
- Awards information
- Full cast
How can we add that information to our dataset? Now is when web scraping joins the party. I’ll use the rvest
package to extract the information mentioned above for each movie (when available). Also, as we have to do this for 16191 movies, I’ll be using the parallel
package to speed up the work. Even with this approach, the code took around one hour and a half to run.
The following code creates an object named scraping
that contains the information we want for each movie. Some preprocessing is going to be necessary before adding new columns to movies
, but I’ll talk about it later.
Details
I had to retrieve information for a lot of movies, and look for it in two different websites (because full cast data wasn’t in the same title page as the other information). For that reason, I decided to run my code in parallel
.
The first part of the code sets up clusters to run the work in parallel. I defined the number of cores to be used equal to the number of cores in my computer (it has eight cores) minus one, because that’s what is recommended. So I created seven clusters. Remember to stop the clusters when you finish the work.
I load the necessary libraries to run the code in each cluster by using the clusterEvalQ
function. I assigned the result of that function to junk
to prevent the print of some output.
Then we have the extraction function itself. The difference is that instead of using apply
, we use parLapply
.
The object scraping
consists of a list of elements with length equal to the number of movies scraped, and each element of this list is a list of elements (the returned my_list
object for each movie).
The scraping part is somewhat straightforward. We read the html, we select nodes and extract the text.
There is one final consideration regarding the scraping of box office information. Based on how the information is structured on IMDb, in this step I extracted more data than necessary. For each movie, the scrap_boxOffice
object is a character vector whose length may vary from movie to movie (because not all movies have all data available) and each element of that vector contains different data. I’ll manage to keep the information I want and assign it to the corresponding vector in the next section.
Even though I tried to be as clear as possible, this might be a little confusing, but I hope that by inspecting the code you’ll get what is doing!
# Web scraping more information about movies
# Set the cluster to work in parallel
n_cores <- detectCores() - 1
clust <- makeCluster(n_cores)
# Load packages on each cluster
# Obs: library() returns a character vector of all the attached packages, but
# it does so invisibly. When executed via clusterEvalQ, it isn't invisible.
# To make it "invisible" again, I assigned the result of clusterEvalQ to junk.
junk <- clusterEvalQ(clust, {
library(dplyr)
library(stringr)
library(rvest)
})
# Create scraping object by running a parallel apply
scraping <- parLapply(clust, movies$tconst, function(x){
# Read title page
title_page <- read_html(str_c("https://www.imdb.com/title/", x))
# Scrape metascore
scrap_metascore <- title_page %>%
html_nodes(".metacriticScore span") %>%
html_text
# Scrape box office (and country)
scrap_boxOffice <- title_page %>%
html_nodes(".txt-block") %>%
html_text
# Scrape awards
scrap_awards <- title_page %>%
html_nodes(".awards-blurb b") %>%
html_text
# Read cast page
cast_page <- read_html(
str_c("https://www.imdb.com/title/", x, "/fullcredits?ref_=tt_cl_sm#cast")
)
# Scrape cast
scrap_cast <- cast_page %>%
html_nodes(".primary_photo+ td a") %>%
html_text %>%
str_c(collapse = ";") %>%
str_replace_all("\\n", "")
# Output for each movie
my_list <- list("scrap_cast" = scrap_cast,
"scrap_metascore" = scrap_metascore,
"scrap_boxOffice" = scrap_boxOffice,
"scrap_awards" = scrap_awards)
return(my_list)
})
# Stop the cluster
stopCluster(clust)
Vector Assignment
The following code uses the scraping
object to create the vectors of data that will be used to add columns to movies
.
Details
This is what the code does:
- Preallocates the corresponding vectors
- Creates
no_lenght_zero
function to solve an issue realted to how data is scraped when there is no information - Fills preallocated vectors
boxOffice_list
is a character vector where each element refers to a different type of data. It is somewhat structured though, so I used a combination of keep
and str_detect
functions to assign each target data to the corresponding vector.
# Number of movies
n_movies <- nrow(movies)
# Preallocate vectors
scrap_cast <- vector("character", length = n_movies)
scrap_metascore <- vector("character", length = n_movies)
scrap_country <- vector("character", length = n_movies)
scrap_budget <- vector("character", length = n_movies)
scrap_open_usa <- vector("character", length = n_movies)
scrap_gross_usa <- vector("character", length = n_movies)
scrap_gross_world <- vector("character", length = n_movies)
scrap_awards <- vector("character", length = n_movies)
# Function to fix length zero problem
no_lenght_zero <- function(x){
ifelse(identical(x, character(0)), NA_character_, x)
}
# Assign extracted info
for (i in seq_along(scraping)){
scrap_cast[[i]] <- scraping[[i]][["scrap_cast"]] %>%
no_lenght_zero
scrap_metascore[[i]] <- scraping[[i]][["scrap_metascore"]] %>%
no_lenght_zero
boxOffice_list <- scraping[[i]][["scrap_boxOffice"]]
scrap_country[[i]] <- boxOffice_list %>%
keep(~ str_detect(.x, "Country:")) %>%
no_lenght_zero
scrap_budget[[i]] <- boxOffice_list %>%
keep(~ str_detect(.x, "Budget:")) %>%
no_lenght_zero
scrap_open_usa[[i]] <- boxOffice_list %>%
keep(~ str_detect(.x, "Opening Weekend USA:")) %>%
no_lenght_zero
scrap_gross_usa[[i]] <- boxOffice_list %>%
keep(~ str_detect(.x, "Gross USA:")) %>%
no_lenght_zero
scrap_gross_world[[i]] <- boxOffice_list %>%
keep(~ str_detect(.x, "Cumulative Worldwide Gross:")) %>%
no_lenght_zero
scrap_awards[[i]] <- scraping[[i]][["scrap_awards"]] %>%
no_lenght_zero
}
Add columns
Finally, we can use mutate
to add the new columns to movies
.
# Add scraped columns to data
movies <- movies %>%
mutate(cast = scrap_cast,
metascore_raw = scrap_metascore,
country_raw = scrap_country,
budget_raw = scrap_budget,
open_usa_raw = scrap_open_usa,
gross_usa_raw = scrap_gross_usa,
gross_world_raw = scrap_gross_world,
awards_raw = scrap_awards)
This is how movies
looks like in the end of Part II.
Obs: I have taken the first 30 characters of cast
and replace the rest with ...
to make the table look better.
movies %>%
head %>%
mutate(cast = cast %>%
str_sub(1, 30) %>%
str_c("...")) %>%
kable(format = "html", table.attr = "style = \"color: white;\"") %>%
col_width(movies, tconst, "3em") %>%
col_width(movies, titleType, "3em") %>%
col_width(movies, primaryTitle, "18em") %>%
col_width(movies, originalTitle, "18em") %>%
col_width(movies, isAdult, "3em") %>%
col_width(movies, startYear, "3em") %>%
col_width(movies, endYear, "3em") %>%
col_width(movies, runtimeMinutes, "4em") %>%
col_width(movies, genres, "5em") %>%
col_width(movies, averageRating, "3em") %>%
col_width(movies, numVotes, "3em") %>%
col_width(movies, linkTitle, "45em") %>%
col_width(movies, directors, "18em") %>%
col_width(movies, cast, "18em") %>%
col_width(movies, metascore_raw, "3em") %>%
col_width(movies, country_raw, "20em") %>%
col_width(movies, budget_raw, "18em") %>%
col_width(movies, open_usa_raw, "28em") %>%
col_width(movies, gross_usa_raw, "18em") %>%
col_width(movies, gross_world_raw, "25em") %>%
col_width(movies, awards_raw, "18em")
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | averageRating | numVotes | linkTitle | directors | cast | metascore_raw | country_raw | budget_raw | open_usa_raw | gross_usa_raw | gross_world_raw | awards_raw |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tt0035423 | movie | Kate & Leopold | Kate & Leopold | FALSE | 2001 | NA | 118 | Comedy,Fantasy,Romance | 6.4 | 77127 | <a href=’https://www.imdb.com/title/tt0035423/'>Kate & Leopold</a> | James Mangold | Meg Ryan; Hugh Jackman; Liev … | 44 | Country: USA | Budget:$48,000,000 (estimated) | Opening Weekend USA: $9,725,408, 30 December 2001 | Gross USA: $47,121,859 | Cumulative Worldwide Gross: $76,019,048 | Nominated for 1 Oscar. |
tt0064106 | movie | Le Boucher | Le boucher | FALSE | 1970 | NA | 93 | Drama,Thriller | 7.5 | 7348 | <a href=’https://www.imdb.com/title/tt0064106/'>Le boucher</a> | Claude Chabrol | Stéphane Audran; Jean Yanne; … | NA | Country: France | Italy | NA | NA | Gross USA: $474,458 | NA | Nominated for 1 BAFTA Film Award. |
tt0064165 | movie | The Things of Life | Les choses de la vie | FALSE | 1970 | NA | 89 | Drama,Romance | 7.6 | 3664 | <a href=’https://www.imdb.com/title/tt0064165/'>Les choses de la vie</a> | Claude Sautet | Michel Piccoli; Romy Schneide… | NA | Country: Switzerland | France | Italy | NA | NA | Gross USA: $5,063 | Cumulative Worldwide Gross: $5,063 | NA |
tt0064177 | movie | Colossus: The Forbin Project | Colossus: The Forbin Project | FALSE | 1970 | NA | 100 | Sci-Fi,Thriller | 7.1 | 7860 | <a href=’https://www.imdb.com/title/tt0064177/'>Colossus: The Forbin Project</a> | Joseph Sargent | Eric Braeden; Susan Clark; Go… | NA | Country: USA | NA | NA | NA | NA | NA |
tt0064285 | movie | The Wild Child | L’enfant sauvage | FALSE | 1970 | NA | 83 | Drama | 7.5 | 7238 | <a href=’https://www.imdb.com/title/tt0064285/'>L'enfant sauvage</a> | François Truffaut | Jean-Pierre Cargol; François … | 94 | Country: France | NA | Opening Weekend USA: $11,206, 25 April 1999 | Gross USA: $65,560 | Cumulative Worldwide Gross: $65,560 | NA |
tt0064437 | movie | The Honeymoon Killers | The Honeymoon Killers | FALSE | 1970 | NA | 107 | Crime,Drama,Romance | 7.0 | 4229 | <a href=’https://www.imdb.com/title/tt0064437/'>The Honeymoon Killers</a> | Donald Volkman;Leonard Kastle | Shirley Stoler; Tony Lo Bianc… | NA | Country: USA | Budget:$150,000 (estimated) | NA | NA | NA | NA |
Movies Dataset - Part III
In Part III, I’ll clean the columns created in Part II.
Metascore
metascore_raw
is a character column. I used parse_number
to create metascore
, which is numeric.
# metascore to number
movies <- movies %>%
mutate(metascore = metascore_raw %>%
parse_number)
Country
country_raw
needs some cleaning that can be done with stringr::str_replace_all
function.
movies <- movies %>%
mutate(country = country_raw %>%
str_replace_all("Country:", "") %>%
str_replace_all("\\n", "") %>%
str_replace_all(" ", "") %>%
str_replace_all("\\|", ";")
)
Box Office Information
By taking a fast look at this “money” variables, it seemed that the currency was dollars ($). Just to be sure, I created the function check_currency
with the following logic: if the sum of values where I find the dollar sign ($) and NA
values of a column is equal to the total number of rows, then all currencies are the same (and dollars).
check_currency <- function(data, col) {
n_movies <- nrow(data)
missing <- data %>%
pull(enexpr(col)) %>%
is.na %>%
sum
dollar_sign <- data %>%
pull(enexpr(col)) %>%
str_detect("\\$") %>%
sum(na.rm = TRUE)
identical(missing + dollar_sign, n_movies)
}
After applying this function to the corresponding columns, we see that not all currencies are the same. Later I checked the dataset and saw in fact that this was the case.
check_currency(movies, budget_raw)
## [1] FALSE
check_currency(movies, open_usa_raw)
## [1] TRUE
check_currency(movies, gross_usa_raw)
## [1] FALSE
check_currency(movies, gross_world_raw)
## [1] FALSE
I decided to keep only those values express in dollars, assigning NA
otherwise. For that matter, I created the keep_dollar
function.
keep_dollar <- function(col) {
ifelse(str_detect(col, "\\$"), col, NA_real_)
}
Now we can create our new variables, through parse_number
. This function works just fine, even with columns where there were more numbers present (for example, take a look at open_usa_raw
and open_usa
)
movies <- movies %>%
mutate(budget = keep_dollar(budget_raw) %>%
parse_number,
open_usa = keep_dollar(open_usa_raw) %>%
parse_number,
gross_usa = keep_dollar(gross_usa_raw) %>%
parse_number,
gross_world = keep_dollar(gross_world_raw) %>%
parse_number)
Awards information
Using the awards_raw
column I created a few variables.
For every movie:
won_oscar
, an indicator variable that takes the value 1 if the movie won any Oscar and 0 otherwiseoscars_won
, the number of Oscars won by a movie
For movies that didn’t win an Oscar:
nominated_oscar
, an indicator variable that takes the value 1 if the movie was nominated for any Oscar and 0 otherwiseoscars_nominations
, the number of Oscar nominations
Obs: If a movie won at least one Oscar, we don’t know how many nominations it had. This has to do with the way I scraped this data.
movies <- movies %>%
mutate(
won_oscar = ifelse(
test = str_detect(awards_raw, "Won") &
str_detect(awards_raw, "Oscar") &
!is.na(awards_raw),
yes = 1,
no = 0
),
oscars_won = ifelse(
test = won_oscar == 1,
yes = parse_number(awards_raw),
no = 0
),
nominated_oscar = ifelse(
test = str_detect(awards_raw, "Nominated") &
str_detect(awards_raw, "Oscar") &
!is.na(awards_raw),
yes = 1,
no = 0
),
oscars_nominations = ifelse(
test = nominated_oscar == 1,
yes = parse_number(awards_raw),
no = 0
)
)
The final dataset looks like this. I kept the raw columns for you to see the differences with the cleaned ones.
Obs: I have taken the first 30 characters of cast
and replace the rest with ...
to make the table look better.
movies %>%
head %>%
mutate(cast = cast %>%
str_sub(1, 30) %>%
str_c("...")) %>%
kable(format = "html", table.attr = "style = \"color: white;\"") %>%
col_width(movies, tconst, "3em") %>%
col_width(movies, titleType, "3em") %>%
col_width(movies, primaryTitle, "18em") %>%
col_width(movies, originalTitle, "18em") %>%
col_width(movies, isAdult, "3em") %>%
col_width(movies, startYear, "3em") %>%
col_width(movies, endYear, "3em") %>%
col_width(movies, runtimeMinutes, "4em") %>%
col_width(movies, genres, "5em") %>%
col_width(movies, averageRating, "3em") %>%
col_width(movies, numVotes, "3em") %>%
col_width(movies, linkTitle, "45em") %>%
col_width(movies, directors, "18em") %>%
col_width(movies, cast, "18em") %>%
col_width(movies, metascore_raw, "3em") %>%
col_width(movies, country_raw, "20em") %>%
col_width(movies, budget_raw, "18em") %>%
col_width(movies, open_usa_raw, "28em") %>%
col_width(movies, gross_usa_raw, "18em") %>%
col_width(movies, gross_world_raw, "25em") %>%
col_width(movies, awards_raw, "18em") %>%
col_width(movies, metascore, "3em") %>%
col_width(movies, country, "9em") %>%
col_width(movies, budget, "7em") %>%
col_width(movies, open_usa, "7em") %>%
col_width(movies, gross_usa, "7em") %>%
col_width(movies, gross_world, "7em") %>%
col_width(movies, won_oscar, "10em") %>%
col_width(movies, oscars_won, "10em") %>%
col_width(movies, nominated_oscar, "10em") %>%
col_width(movies, oscars_nominations, "10em")
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | averageRating | numVotes | linkTitle | directors | cast | metascore_raw | country_raw | budget_raw | open_usa_raw | gross_usa_raw | gross_world_raw | awards_raw | metascore | country | budget | open_usa | gross_usa | gross_world | won_oscar | oscars_won | nominated_oscar | oscars_nominations |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tt0035423 | movie | Kate & Leopold | Kate & Leopold | FALSE | 2001 | NA | 118 | Comedy,Fantasy,Romance | 6.4 | 77127 | <a href=’https://www.imdb.com/title/tt0035423/'>Kate & Leopold</a> | James Mangold | Meg Ryan; Hugh Jackman; Liev … | 44 | Country: USA | Budget:$48,000,000 (estimated) | Opening Weekend USA: $9,725,408, 30 December 2001 | Gross USA: $47,121,859 | Cumulative Worldwide Gross: $76,019,048 | Nominated for 1 Oscar. | 44 | USA | 4.8e+07 | 9725408 | 47121859 | 76019048 | 0 | 0 | 1 | 1 |
tt0064106 | movie | Le Boucher | Le boucher | FALSE | 1970 | NA | 93 | Drama,Thriller | 7.5 | 7348 | <a href=’https://www.imdb.com/title/tt0064106/'>Le boucher</a> | Claude Chabrol | Stéphane Audran; Jean Yanne; … | NA | Country: France | Italy | NA | NA | Gross USA: $474,458 | NA | Nominated for 1 BAFTA Film Award. | NA | France;Italy | NA | NA | 474458 | NA | 0 | 0 | 0 | 0 |
tt0064165 | movie | The Things of Life | Les choses de la vie | FALSE | 1970 | NA | 89 | Drama,Romance | 7.6 | 3664 | <a href=’https://www.imdb.com/title/tt0064165/'>Les choses de la vie</a> | Claude Sautet | Michel Piccoli; Romy Schneide… | NA | Country: Switzerland | France | Italy | NA | NA | Gross USA: $5,063 | Cumulative Worldwide Gross: $5,063 | NA | NA | Switzerland;France;Italy | NA | NA | 5063 | 5063 | 0 | 0 | 0 | 0 |
tt0064177 | movie | Colossus: The Forbin Project | Colossus: The Forbin Project | FALSE | 1970 | NA | 100 | Sci-Fi,Thriller | 7.1 | 7860 | <a href=’https://www.imdb.com/title/tt0064177/'>Colossus: The Forbin Project</a> | Joseph Sargent | Eric Braeden; Susan Clark; Go… | NA | Country: USA | NA | NA | NA | NA | NA | NA | USA | NA | NA | NA | NA | 0 | 0 | 0 | 0 |
tt0064285 | movie | The Wild Child | L’enfant sauvage | FALSE | 1970 | NA | 83 | Drama | 7.5 | 7238 | <a href=’https://www.imdb.com/title/tt0064285/'>L'enfant sauvage</a> | François Truffaut | Jean-Pierre Cargol; François … | 94 | Country: France | NA | Opening Weekend USA: $11,206, 25 April 1999 | Gross USA: $65,560 | Cumulative Worldwide Gross: $65,560 | NA | 94 | France | NA | 11206 | 65560 | 65560 | 0 | 0 | 0 | 0 |
tt0064437 | movie | The Honeymoon Killers | The Honeymoon Killers | FALSE | 1970 | NA | 107 | Crime,Drama,Romance | 7.0 | 4229 | <a href=’https://www.imdb.com/title/tt0064437/'>The Honeymoon Killers</a> | Donald Volkman;Leonard Kastle | Shirley Stoler; Tony Lo Bianc… | NA | Country: USA | Budget:$150,000 (estimated) | NA | NA | NA | NA | NA | USA | 1.5e+05 | NA | NA | NA | 0 | 0 | 0 | 0 |
Subproducts
With a few lines of codes, some useful datasets can be created.
- List of different genres found in
movies
list_genres <- movies %>%
select(genres) %>%
separate_rows(genres, sep = ",") %>%
unique %>%
arrange(genres)
- List of different directors found in
movies
, sorted by name
list_directors <- movies %>%
select(directors) %>%
separate_rows(directors, sep = ";") %>%
unique %>%
arrange(directors)
- List of different actors and actresses found in
movies
, sorted by name
list_cast <- movies %>%
select(cast) %>%
separate_rows(cast, sep = ";") %>%
unique %>%
arrange(cast)
Save Datasets
Finally, I’ll save the movies
dataset and the subproducts list_genres
, list_directors
and list_cast
in a folder named datasets
.
# Create "datasets" folder if does not exist
if (!file.exists("datasets")) {
dir.create("datasets")
}
# Save files
saveRDS(movies, "datasets/movies.rds")
saveRDS(list_genres, "datasets/list_genres.rds")
saveRDS(list_directors, "datasets/list_directors.rds")
saveRDS(list_cast, "datasets/list_cast.rds")
Download Created Datasets
You can download the created datasets. They are in .rds
format.
Wikipedia contributors. 2020. “IMDb — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=IMDb&oldid=959154715.