Netflix Data Analysis using Python

Uzair Adamjee
7 min readApr 24, 2023
Photo by Juraj Gabriel on Unsplash

Data analysis is a powerful tool that helps businesses make informed decisions. In today’s blog, we will explore the Netflix dataset using Python and uncover some interesting insights.

Netflix is one of the most popular streaming services in the world, providing users with access to a vast collection of TV shows and movies. The platform has gained a massive following in recent years, and its popularity shows no signs of slowing down. In this blog, we’ll be using Python to perform exploratory data analysis (EDA) on a Netflix dataset that we’ve found on Kaggle. We’ll be using various Python libraries, including Pandas, Matplotlib, Seaborn, and Plotly, to visualize and analyze the data.

Notebook link: https://www.kaggle.com/code/uzairadamjee/netflix-eda-datacleaning-dataanalysis

Import Libraries & Loading Data

import pandas as pd
import seaborn as sns #importing our visualization library
import matplotlib.pyplot as plt
df = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv')
df.head()

We can see that the dataset has 12 columns, and the first five rows of the dataset are TV shows and movies that were added on September 8, 2021. The type column tells us if it is a TV show or a movie.

Let’s explore the dataset further by cleaning data and creating some visualizations.

df.isnull().sum() #checking for null values

. The following columns have null values that need to be cleaned:

  • director — 2,634 null values
  • cast — 825 null values
  • country — 831 null values
  • date_added — 10 null values
  • rating — 4 null values
  • duration — 3 null values
#We replaced all the Nan values in the country column with United States as Netflix was created in the USA and every show is aired on Netflix US. So instead of dropping the whole column we just replaced the values in it in order to save our data.

df['country'].replace(np.nan, 'United States',inplace = True)
df['director'].replace(np.nan, 'No Director',inplace=True)
df['cast'].replace(np.nan, 'No Cast',inplace=True)
df['country'].replace(np.nan, 'Not Specify',inplace=True)
df.isnull().sum()

We have two options for addressing the “Country” column. Firstly, we could replace all the NaN values with “United States” since Netflix was created in the USA. Alternatively, we could replace the NaN values with “Not Specified” since the countries for these movies were not specified in the data. For columns like “Director” and “Cast”, we can replace missing values with “No Director” and “No Cast”, respectively. By replacing values in these columns instead of dropping them entirely, we can preserve our data.

For other columns like date added, duration, and rating, since the null value counts are so low, we shall just drop them from the dataset.

df = df.dropna()
df.isnull().sum()
df['rating'].value_counts()

In the ‘rating’ column, the output shows that there are 14 unique values. The value ‘TV-MA’ appears the most frequently, occurring 2027 times, followed by ‘TV-14’ with 1698 occurrences. The least frequent values are ‘G’ with 37 occurrences, ‘UR’ with 7 occurrences, and ‘NC-17’ with only 2 occurrences.

df['listed_in'].value_counts()

In the ‘listed_in’ column, the output shows that there are 461 unique values. The value ‘Documentaries’ appears the most frequently, occurring 299 times, followed by ‘Stand-Up Comedy’ with 273 occurrences, and ‘Dramas, International Movies’ with 248 occurrences. The least frequent values have only one occurrence each.

As we can see we only have 10 missing values in our rating column, we can either drop them or replace them. We have TV-MA which is the most common rating and hence we can replace all these nan values with TV-MA.

df['rating'].replace(np.nan, 'TV-MA',inplace  = True)
df.isnull().sum()
Photo by Mollie Sivaram on Unsplash

Data Analysis

sns.countplot(x='type',data = df) #looking at number of Movies and TV shows

It shows the number of Movies and TV shows present in the dataset. The plot shows that there are more Movies than TV shows in the dataset.

plt.figure(figsize = (12,8))
sns.countplot(x='rating',data = df)

The second output is a count plot of the ‘rating’ column of the ‘df’ DataFrame. It shows the number of occurrences of each unique value in the ‘rating’ column. The plot shows that the most frequent rating in the dataset is TV-MA, followed by TV-14 and TV-PG.

plt.figure(figsize = (12,8))
sns.countplot(x='rating',data = df,hue='type')

The third output is a count plot of the ‘rating’ column of the ‘df’ DataFrame, with the hue set to the ‘type’ column. This plot shows the number of occurrences of each unique value in the ‘rating’ column, broken down by the type of content (i.e., Movie or TV show). This plot allows us to see how the distribution of ratings differs between Movies and TV shows. We can see that the rating TV-MA is more common in TV shows than in Movies, while the rating PG-13 is more common in Movies than in TV shows.

plt.figure(figsize=(12,6))
df[df["type"]=="Movie"]["release_year"].value_counts()[:20].plot(kind="bar",color="Red")
plt.title("Frequency of Movies which were released in different years and are available on Netflix")

This bar plot shows the frequency of Movies released in different years and available on Netflix. The plot shows the 20 most common years for Movies in the dataset, with the count of Movies for each year shown on the y-axis. The plot is colored red and has a title that reads “Frequency of Movies which were released in different years and are available on Netflix”.

plt.figure(figsize=(12,6))
df[df["type"]=="TV Show"]["release_year"].value_counts()[:20].plot(kind="bar",color="Blue")
plt.title("Frequency of TV shows which were released in different years and are available on Netflix")

In this output is similar toabove but shows the frequency of TV shows released in different years and available on Netflix.

plt.figure(figsize=(12,6))
df[df["type"]=="Movie"]["listed_in"].value_counts()[:10].plot(kind="barh",color="black")
plt.title("Top 10 Genres of Movies",size=18)
plt.figure(figsize=(12,6))
df[df["type"]=="TV Show"]["listed_in"].value_counts()[:10].plot(kind="barh",color="brown")
plt.title("Top 10 Genres of TV Shows",size=18)

These horizontal bar plots show the top 10 genres of Movies & TV Shows in the dataset.

Further Analysis

From the first plot, we can see the frequency of content added by Netflix from 2008 to 2021. The plot shows that there has been a steady increase in the number of titles added each year, with a notable jump in 2015. We can also see that the number of movies added has generally been higher than the number of TV shows added each year.

From the second plot, we can see the top 20 genres that have been added by Netflix from 2008 to 2021. The plot shows that the most common genre is “International Movies”, followed by “Dramas” and “Comedies”. We can also see that the majority of the top 20 genres are movie genres, with only a few TV show genres making the list. This suggests that Netflix has been focusing more on adding movies than TV shows to its platform.

Top 5 Directors: The code identifies the top 5 directors with the most number of movies. The list includes Rajiv Chilaka, Raúl Campos, Jan Suter, Suhas Kadav, and Marcus Raboy.

Top 5 Actors: The code also identifies the top 5 actors with the most number of movies. The list includes Anupam Kher, Rupa Bhimani, Takahiro Sakurai, Julie Tejwani, and Om Puri.

TV Shows with the Most Seasons: The code identifies the top 5 TV shows with the most number of seasons. The dataset includes the title, duration, type, and number of seasons. It can be seen that the TV show with the most seasons has 16 seasons.

Conclusion

In conclusion, using Python and various data analysis libraries, we were able to gain valuable insights into the content on Netflix. These insights can help Netflix make better decisions about the content they add to their platform and how they market it to their users. Data analysis is a powerful tool that can be used in any industry to gain insights and make informed decisions. By using Python and data analysis techniques, we can gain a deeper understanding of any dataset and make data-driven decisions.

Hope you enjoy this article. Thank you for reading!

Let’s Connect on other platforms :

BECOME a WRITER at MLearning.ai

--

--

Uzair Adamjee

Unleash the power of data with my insights! Join me on a journey to explore the untold stories hidden in numbers. Let's dive in together.