Project: Investigate a Dataset from TMDb_Movies Dataset

Table of Contents

1. Introduction

2. Data Wrangling

3. Exploratory Data Analysis

4. Conclusions

1. Introduction

I'll work with TMDb_Movies Dataset, and try to answer the following questions;
  1. Data Cleaning - Filling zero Values
  2. what is the average budget
  3. Creating new features
  4. What are the variables that affect a film's revenue and popularity?
  5. Ratings Distribution since 1960
  6. What are the top rated movies?

Data and Setup

Import visualization libraries and set %matplotlib inline and Importing numpy and pandas
In [8]:
import pandas as pd
import numpy as np
import operator
import csv
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline
Read in the csv file as a DataFrame called df_movies from TMDb_Movies Dataset.
In [9]:
#loading the csv file

df_movies = pd.read_csv('tmdb-movies.csv')

2 Data Wrangling

Check the head of df_movies from TMDb_Movies Dataset.
In [10]:
# first of all, I have to print the first 5 rows and define columns from TMDb_Movies Dataset.

df_movies.head()
Out[10]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09

5 rows × 21 columns

Checking the Dataset of df_movies, by using the function '.shape'
In [11]:
df_movies.shape
Out[11]:
(10866, 21)

Here we can see that we have 10866 rows and 21 columns

Checking the tail of df_movies from TMDb_Movies Dataset, using the function 'tail()'
In [12]:
df_movies.tail()
Out[12]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
10861 21 tt0060371 0.080598 0 0 The Endless Summer Michael Hynson|Robert August|Lord 'Tally Ho' B... NaN Bruce Brown NaN ... The Endless Summer, by Bruce Brown, is one of ... 95 Documentary Bruce Brown Films 6/15/66 11 7.4 1966 0.000000 0.0
10862 20379 tt0060472 0.065543 0 0 Grand Prix James Garner|Eva Marie Saint|Yves Montand|Tosh... NaN John Frankenheimer Cinerama sweeps YOU into a drama of speed and ... ... Grand Prix driver Pete Aron is fired by his te... 176 Action|Adventure|Drama Cherokee Productions|Joel Productions|Douglas ... 12/21/66 20 5.7 1966 0.000000 0.0
10863 39768 tt0060161 0.065141 0 0 Beregis Avtomobilya Innokentiy Smoktunovskiy|Oleg Efremov|Georgi Z... NaN Eldar Ryazanov NaN ... An insurance agent who moonlights as a carthie... 94 Mystery|Comedy Mosfilm 1/1/66 11 6.5 1966 0.000000 0.0
10864 21449 tt0061177 0.064317 0 0 What's Up, Tiger Lily? Tatsuya Mihashi|Akiko Wakabayashi|Mie Hama|Joh... NaN Woody Allen WOODY ALLEN STRIKES BACK! ... In comic Woody Allen's film debut, he took the... 80 Action|Comedy Benedict Pictures Corp. 11/2/66 22 5.4 1966 0.000000 0.0
10865 22293 tt0060666 0.035919 19000 0 Manos: The Hands of Fate Harold P. Warren|Tom Neyman|John Reynolds|Dian... NaN Harold P. Warren It's Shocking! It's Beyond Your Imagination! ... A family gets lost on the road and stumbles up... 74 Horror Norm-Iris 11/15/66 15 1.5 1966 127642.279154 0.0

5 rows × 21 columns

I need to check the contents of my database from TMDb_Movies Dataset, using the info() function
In [13]:
df_movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
Checking type from TMDb_Movies Dataset,
In [28]:
df_movies.dtypes
Out[28]:
id                        int64
imdb_id                  object
popularity              float64
budget                  float64
revenue                 float64
original_title           object
cast                     object
homepage                 object
director                 object
tagline                  object
keywords                 object
overview                 object
runtime                 float64
genres                   object
production_companies     object
release_date             object
vote_count                int64
vote_average            float64
release_year              int64
budget_adj              float64
revenue_adj             float64
dtype: object

3. Exploratory Data Analysis

Data Visualization

To create a histogram, I used pandas hist() method. Calling the hist() method on a pandas dataframe will return histograms for all non-nuisance series in the dataframe.

In [5]:
df_movies.hist(figsize=(20,20));
In [16]:
df_movies.describe()
Out[16]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10866.000000 10866.000000 1.086600e+04 1.086600e+04 10866.000000 10866.000000 10866.000000 10866.000000 1.086600e+04 1.086600e+04
mean 66064.177434 0.646441 1.462570e+07 3.982332e+07 102.070863 217.389748 5.974922 2001.322658 1.755104e+07 5.136436e+07
std 92130.136561 1.000185 3.091321e+07 1.170035e+08 31.381405 575.619058 0.935142 12.812941 3.430616e+07 1.446325e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.250000 0.207583 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20669.000000 0.383856 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75610.000000 0.713817 1.500000e+07 2.400000e+07 111.000000 145.750000 6.600000 2011.000000 2.085325e+07 3.369710e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09
Check the columns from TMDb_Movies Dataset.
In [23]:
df_movies.columns
Out[23]:
Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
       'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
       'runtime', 'genres', 'production_companies', 'release_date',
       'vote_count', 'vote_average', 'release_year', 'budget_adj',
       'revenue_adj'],
      dtype='object')

Q1 Data Cleaning - Filling zero Values

Looking at the data above, budget, revenue and runtime all contain values of 0. Fill these in with the average of each column.

In [29]:
# Get average of budget_adj
print(df_movies['budget'].mean())
22291841.833525978
In [30]:
# Get average of revenue
print(df_movies['revenue'].mean())
61877249.67804794
In [31]:
# Get average of runtime
print(df_movies['runtime'].mean())
102.36301856976878
In [32]:
# Replace 0 values with mean
df_movies['budget'] = df_movies['budget'].replace(0, 14625701.09414688)
In [33]:
# Replace 0 values with mean
df_movies['revenue'] = df_movies['revenue'].replace(0, 39823319.79339223)
In [34]:
# Replace 0 values with mean
df_movies['runtime'] = df_movies['runtime'].replace(0, 102.07086324314375)
df_movies.describe()
Out[34]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10865.000000 10865.000000 1.086500e+04 1.086500e+04 10865.000000 10865.000000 10865.000000 10865.000000 1.086500e+04 1.086500e+04
mean 66066.374413 0.646446 2.229184e+07 6.187725e+07 102.363019 217.399632 5.975012 2001.321859 1.754989e+07 5.136900e+07
std 92134.091971 1.000231 2.801364e+07 1.110239e+08 30.904043 575.644627 0.935138 12.813260 3.430753e+07 1.446383e+08
min 5.000000 0.000065 1.000000e+00 2.000000e+00 2.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.000000 0.207575 1.462570e+07 3.982332e+07 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20662.000000 0.383831 1.462570e+07 3.982332e+07 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75612.000000 0.713857 1.500000e+07 3.982332e+07 111.000000 146.000000 6.600000 2011.000000 2.085325e+07 3.370173e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09

After checking our new data, we can clearly notice the boxes with the mention 0 have been replaced by the average of budget, revenue and runtime.

Data and duplicates cleaning
Finding and removing duplicate rows
In [41]:
# Find out if there are any duplicate rows
sum(df_movies.duplicated())
Out[41]:
0
In [42]:
# Remove the duplicated rows
df_movies.drop_duplicates(inplace=True)

Q2 what is the average budget

To find the budget average from TMDb_Movies Dataset, by using Mean method
In [38]:
df_movies.budget.mean()
Out[38]:
22291841.833525978
What are the top 5 Budget from TMDb_Movies Dataset?
In [39]:
df_movies['budget'].value_counts().head(5)
Out[39]:
1.462570e+07    5696
2.000000e+07     190
1.500000e+07     183
2.500000e+07     178
1.000000e+07     176
Name: budget, dtype: int64
Histogram for budget
In [43]:
df_movies.budget.plot(kind = 'hist', color = 'red', bins = 25)
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f7890b9a08>
In [34]:
# plot relationship between budget and runtime output
df_movies.plot(x='budget', y='runtime', kind='scatter');
In [35]:
# plot relationship between budget and revenue output
df_movies.plot(x='budget', y='revenue', kind='scatter');
In [36]:
# plot relationship between budget and release_year output
df_movies.plot(x='budget', y='release_year', kind='scatter');
Take a look at the 'budget' column, how many unique title are from TMDb_Movies Dataset?
In [39]:
df_movies['budget'].nunique()
Out[39]:
557

Q3 Creating new features

Using .apply() with a custom lambda expression to create a new column called "date" that contains this string value.

In [46]:
df_movies['release_date'].iloc[0]
Out[46]:
'6/9/15'
In [47]:
df_movies['date'] = df_movies['release_date'].apply(lambda original_title: original_title.split(':')[0])
In [48]:
df_movies['date'].value_counts(10)
Out[48]:
1/1/09      0.002577
1/1/08      0.001933
1/1/07      0.001657
1/1/05      0.001472
10/10/14    0.001380
              ...   
9/16/02     0.000092
7/17/70     0.000092
4/15/05     0.000092
7/3/91      0.000092
7/28/11     0.000092
Name: date, Length: 5909, dtype: float64
In [7]:
df_movies['popularity'].unique()
Out[7]:
array([32.985763, 28.419936, 13.112507, ...,  0.065141,  0.064317,
        0.035919])
In [5]:
 
Out[5]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
id 1.000000 -0.014350 -0.141351 -0.099227 -0.088360 -0.035551 -0.058363 0.511364 -0.189015 -0.138477
popularity -0.014350 1.000000 0.545472 0.663358 0.139033 0.800828 0.209511 0.089801 0.513550 0.609083
budget -0.141351 0.545472 1.000000 0.734901 0.191283 0.632702 0.081014 0.115931 0.968963 0.622505
revenue -0.099227 0.663358 0.734901 1.000000 0.162838 0.791175 0.172564 0.057048 0.706427 0.919110
runtime -0.088360 0.139033 0.191283 0.162838 1.000000 0.163278 0.156835 -0.117204 0.221114 0.175676
vote_count -0.035551 0.800828 0.632702 0.791175 0.163278 1.000000 0.253823 0.107948 0.587051 0.707942
vote_average -0.058363 0.209511 0.081014 0.172564 0.156835 0.253823 1.000000 -0.117632 0.093039 0.193085
release_year 0.511364 0.089801 0.115931 0.057048 -0.117204 0.107948 -0.117632 1.000000 0.016793 -0.066256
budget_adj -0.189015 0.513550 0.968963 0.706427 0.221114 0.587051 0.093039 0.016793 1.000000 0.646607
revenue_adj -0.138477 0.609083 0.622505 0.919110 0.175676 0.707942 0.193085 -0.066256 0.646607 1.000000
Now use seaborn to create a countplot from TMDb_Movies Dataset.
Checking Vote_averange by using plot
In [56]:
df_movies.vote_average.plot(kind = 'hist', color = 'red', bins = 25)
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f78fe94b88>
let us visualize also boxplot as kind
In [57]:
x = df_movies.vote_average.plot(kind = 'box') 

Q4 Ratings Distribution since 1960

The following code creates a graph showing the distribution of ratings by year.
In [58]:
sns.set(rc={'figure.figsize':(15,15)}, font_scale=1.3)

temp_df = df_movies[["release_year", "vote_average"]]


sns.set_style("whitegrid")
ax = sns.violinplot(x = temp_df.vote_average, y = temp_df.release_year, orient ="h")

ax.set(xlabel='movie ratings distributions', ylabel='years', title = 'movie ratings distributions per year')
plt.show()

release_year in percent

In [55]:
df = pd.DataFrame(dict(x=np.random.poisson(4, 500)))
ax = sns.barplot(x="release_year", y="release_year", data=df_movies, estimator=lambda x: len(x) / len(df_movies) * 100)
ax.set(ylabel="Percent")
Out[55]:
[Text(0, 0.5, 'Percent')]

Q5 What are the variables that affect a film's revenue and popularity?

In [59]:
df_movies.corr()
Out[59]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
id 1.000000 -0.014351 -0.100563 -0.067241 -0.078432 -0.035555 -0.058391 0.511393 -0.189008 -0.138487
popularity -0.014351 1.000000 0.516044 0.635291 0.136851 0.800828 0.209517 0.089806 0.513555 0.609085
budget -0.100563 0.516044 1.000000 0.705462 0.171059 0.615702 0.074241 0.129276 0.929266 0.599700
revenue -0.067241 0.635291 0.705462 1.000000 0.141847 0.772944 0.152055 0.065170 0.662337 0.898107
runtime -0.078432 0.136851 0.171059 0.141847 1.000000 0.162584 0.154552 -0.112168 0.219841 0.175041
vote_count -0.035555 0.800828 0.615702 0.772944 0.162584 1.000000 0.253818 0.107962 0.587062 0.707941
vote_average -0.058391 0.209517 0.074241 0.152055 0.154552 0.253818 1.000000 -0.117576 0.093079 0.193062
release_year 0.511393 0.089806 0.129276 0.065170 -0.112168 0.107962 -0.117576 1.000000 0.016771 -0.066236
budget_adj -0.189008 0.513555 0.929266 0.662337 0.219841 0.587062 0.093079 0.016771 1.000000 0.646627
revenue_adj -0.138487 0.609085 0.599700 0.898107 0.175041 0.707941 0.193062 -0.066236 0.646627 1.000000
In case of revenue
Strong correlation with popularity, budget and vote count.
Weak correlation with runtime.

In case of popularity,
Moderate correlation with budget.
Strong correlation with revenue, profit and loss and vote count

We will now examine one of each level of correlation.

In [61]:
# Plotting scatterplots to view correlation visually

sns.regplot(x = df_movies['budget'], y = df_movies['revenue'], fit_reg = False)
# Obtaining plot size
fig_size = plt.rcParams["figure.figsize"]

# Changing the length and width of the plot
fig_size[0] = 12
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size

plt.title('Budget vs Revenue', fontsize = 18)
plt.xlabel('Budget', fontsize = 16)
plt.ylabel('Revenue', fontsize = 16);
In [ ]:
 
In [69]:
#function which will take any column as argument from and keep its track
def calculate_count(column):
    # Convert column to string and seperate it by '|'
    data = df_movies[column].str.cat(sep = '|')
    
    # Storing the values seperately in a Pandas series
    data = pd.Series(data.split('|'))
    count = data.value_counts(ascending = False)
    
    return count
In [70]:
#variable to store the retured value#variabl 
count = calculate_count('genres')

#printing top 5 values
count.head()
Out[70]:
Drama       4760
Comedy      3793
Thriller    2907
Action      2384
Romance     1712
dtype: int64
In [71]:
count.plot(kind='pie', figsize = (12, 12));

4. Conclusions

In conclusion, after observation and analysis of the data, I would say that the world of cinema has evolved considerably. From 1960 to 2015, the gap is really considerable, in terms of allocated budget, popularity and others. In terms of percentage, I would say the cement has reached the top, with the budgets out of the norm and the popularity. Talking about the genre of film, according to my analysis, films in the Drama genre have more popularity, followed by Comedy, Thriller, Action and then Romance, which are the top 5.

In [ ]: