Lab: Analyzing IMDb Data

Armel Djangone
Geek Culture
Published in
4 min readMar 22, 2021

--

In this lab, I completed a series of exercises exploring movie rating data IMDb. I conducted basic exploratory data analysis on IMDB’s movie data, looking to answer such questions as what is the average rating per genre? How many different actors are in a movie?

Basic level

Let’s import the necessary libraries:

Read in ‘imdb_1000.csv’ and store it in a DataFrame named ‘movies’.

Check the number of rows and columns

Check the data type of each column.

Calculate the average movie duration.

Sort the DataFrame by duration to find the shortest and longest movies.

Create a histogram of duration, choosing an “appropriate” number of bins.

Use a box plot to display that same data.

Intermediate level

Count how many movies have each of the content ratings.

Use visualization to display that same data, including a title and x and y labels.

Convert the following content ratings to “UNRATED”: NOT RATED, APPROVED, PASSED, GP.

Convert the following content ratings to “NC-17”: X, TV-MA.

Count the number of missing values in each column.

If there are missing values: examine them, then fill them in with “reasonable” values.

Calculate the average star rating for movies 2 hours or longer, and compare that with the average star rating for movies shorter than 2 hours.

Use visualization to detect whether there is a relationship between duration and star rating.

Calculate the average duration for each genre.

Advanced level

Find the title of the movie with the highest star rating in each genre

Check if there are multiple movies with the same title, and if so, determine if they are actually duplicates.

Calculate the average star rating for each genre, but only include genres with at least 10 movies

--

--