In this lab, I completed a series of exercises exploring movie rating data IMDb. I conducted basic exploratory data analysis on IMDB’s movie data, looking to answer such questions as what is the average rating per genre? How many different actors are in a movie?
Basic level
Let’s import the necessary libraries:
Read in ‘imdb_1000.csv’ and store it in a DataFrame named ‘movies’.
Check the number of rows and columns
Check the data type of each column.
Calculate the average movie duration.
Sort the DataFrame by duration to find the shortest and longest movies.
Create a histogram of duration, choosing an “appropriate” number of bins.
Use a box plot to display that same data.
Intermediate level
Count how many movies have each of the content ratings.
Use visualization to display that same data, including a title and x and y labels.
Convert the following content ratings to “UNRATED”: NOT RATED, APPROVED, PASSED, GP.
Convert the following content ratings to “NC-17”: X, TV-MA.
Count the number of missing values in each column.
If there are missing values: examine them, then fill them in with “reasonable” values.
Calculate the average star rating for movies 2 hours or longer, and compare that with the average star rating for movies shorter than 2 hours.
Use visualization to detect whether there is a relationship between duration and star rating.
Calculate the average duration for each genre.
Advanced level
Find the title of the movie with the highest star rating in each genre
Check if there are multiple movies with the same title, and if so, determine if they are actually duplicates.
Calculate the average star rating for each genre, but only include genres with at least 10 movies