Project: Investigate a Dataset from TMDb_Movies Dataset¶

Table of Contents¶

1. Introduction¶

2. Data Wrangling¶

3. Exploratory Data Analysis¶

4. Conclusions¶

1. Introduction¶

I'll work with TMDb_Movies Dataset, and try to answer the following questions;¶

Data Cleaning - Filling zero Values
what is the average budget
Creating new features
What are the variables that affect a film's revenue and popularity?
Ratings Distribution since 1960
What are the top rated movies?

Data and Setup¶

Import visualization libraries and set %matplotlib inline and Importing numpy and pandas¶

import pandas as pd
import numpy as np
import operator
import csv
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline

Read in the csv file as a DataFrame called df_movies from TMDb_Movies Dataset.¶

#loading the csv file

df_movies = pd.read_csv('tmdb-movies.csv')

2 Data Wrangling¶

Check the head of df_movies from TMDb_Movies Dataset.¶

# first of all, I have to print the first 5 rows and define columns from TMDb_Movies Dataset.

df_movies.head()

Checking the Dataset of df_movies, by using the function '.shape'¶

df_movies.shape

(10866, 21)

Here we can see that we have 10866 rows and 21 columns

Checking the tail of df_movies from TMDb_Movies Dataset, using the function 'tail()'¶

df_movies.tail()

I need to check the contents of my database from TMDb_Movies Dataset, using the info() function¶

df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

Checking type from TMDb_Movies Dataset,¶

df_movies.dtypes

id                        int64
imdb_id                  object
popularity              float64
budget                  float64
revenue                 float64
original_title           object
cast                     object
homepage                 object
director                 object
tagline                  object
keywords                 object
overview                 object
runtime                 float64
genres                   object
production_companies     object
release_date             object
vote_count                int64
vote_average            float64
release_year              int64
budget_adj              float64
revenue_adj             float64
dtype: object

3. Exploratory Data Analysis¶

Data Visualization¶

To create a histogram, I used pandas hist() method. Calling the hist() method on a pandas dataframe will return histograms for all non-nuisance series in the dataframe.

df_movies.hist(figsize=(20,20));

df_movies.describe()

Check the columns from TMDb_Movies Dataset.¶

df_movies.columns

Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
       'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
       'runtime', 'genres', 'production_companies', 'release_date',
       'vote_count', 'vote_average', 'release_year', 'budget_adj',
       'revenue_adj'],
      dtype='object')

Q1 Data Cleaning - Filling zero Values¶

Looking at the data above, budget, revenue and runtime all contain values of 0. Fill these in with the average of each column.

# Get average of budget_adj
print(df_movies['budget'].mean())

22291841.833525978

# Get average of revenue
print(df_movies['revenue'].mean())

61877249.67804794

# Get average of runtime
print(df_movies['runtime'].mean())

102.36301856976878

# Replace 0 values with mean
df_movies['budget'] = df_movies['budget'].replace(0, 14625701.09414688)

# Replace 0 values with mean
df_movies['revenue'] = df_movies['revenue'].replace(0, 39823319.79339223)

# Replace 0 values with mean
df_movies['runtime'] = df_movies['runtime'].replace(0, 102.07086324314375)
df_movies.describe()

After checking our new data, we can clearly notice the boxes with the mention 0 have been replaced by the average of budget, revenue and runtime.

Data and duplicates cleaning¶

Finding and removing duplicate rows¶

# Find out if there are any duplicate rows
sum(df_movies.duplicated())

0

# Remove the duplicated rows
df_movies.drop_duplicates(inplace=True)

Q2 what is the average budget¶

To find the budget average from TMDb_Movies Dataset, by using Mean method¶

df_movies.budget.mean()

22291841.833525978

What are the top 5 Budget from TMDb_Movies Dataset?¶

df_movies['budget'].value_counts().head(5)

1.462570e+07    5696
2.000000e+07     190
1.500000e+07     183
2.500000e+07     178
1.000000e+07     176
Name: budget, dtype: int64

Histogram for budget¶

df_movies.budget.plot(kind = 'hist', color = 'red', bins = 25)

<matplotlib.axes._subplots.AxesSubplot at 0x1f7890b9a08>

# plot relationship between budget and runtime output
df_movies.plot(x='budget', y='runtime', kind='scatter');

# plot relationship between budget and revenue output
df_movies.plot(x='budget', y='revenue', kind='scatter');

# plot relationship between budget and release_year output
df_movies.plot(x='budget', y='release_year', kind='scatter');

Take a look at the 'budget' column, how many unique title are from TMDb_Movies Dataset?¶

df_movies['budget'].nunique()

557

Q3 Creating new features¶

Using .apply() with a custom lambda expression to create a new column called "date" that contains this string value.

df_movies['release_date'].iloc[0]

'6/9/15'

df_movies['date'] = df_movies['release_date'].apply(lambda original_title: original_title.split(':')[0])

df_movies['date'].value_counts(10)

1/1/09      0.002577
1/1/08      0.001933
1/1/07      0.001657
1/1/05      0.001472
10/10/14    0.001380
              ...   
9/16/02     0.000092
7/17/70     0.000092
4/15/05     0.000092
7/3/91      0.000092
7/28/11     0.000092
Name: date, Length: 5909, dtype: float64

df_movies['popularity'].unique()

array([32.985763, 28.419936, 13.112507, ...,  0.065141,  0.064317,
        0.035919])

Now use seaborn to create a countplot from TMDb_Movies Dataset.¶

Checking Vote_averange by using plot¶

df_movies.vote_average.plot(kind = 'hist', color = 'red', bins = 25)

<matplotlib.axes._subplots.AxesSubplot at 0x1f78fe94b88>

let us visualize also boxplot as kind¶

x = df_movies.vote_average.plot(kind = 'box')

Q4 Ratings Distribution since 1960¶

The following code creates a graph showing the distribution of ratings by year.¶

sns.set(rc={'figure.figsize':(15,15)}, font_scale=1.3)

temp_df = df_movies[["release_year", "vote_average"]]


sns.set_style("whitegrid")
ax = sns.violinplot(x = temp_df.vote_average, y = temp_df.release_year, orient ="h")

ax.set(xlabel='movie ratings distributions', ylabel='years', title = 'movie ratings distributions per year')
plt.show()

release_year in percent¶

df = pd.DataFrame(dict(x=np.random.poisson(4, 500)))
ax = sns.barplot(x="release_year", y="release_year", data=df_movies, estimator=lambda x: len(x) / len(df_movies) * 100)
ax.set(ylabel="Percent")

[Text(0, 0.5, 'Percent')]

Q5 What are the variables that affect a film's revenue and popularity?¶

df_movies.corr()

In case of revenue¶

Strong correlation with popularity, budget and vote count.
Weak correlation with runtime.

In case of popularity,¶

Moderate correlation with budget.
Strong correlation with revenue, profit and loss and vote count

We will now examine one of each level of correlation.¶

# Plotting scatterplots to view correlation visually

sns.regplot(x = df_movies['budget'], y = df_movies['revenue'], fit_reg = False)
# Obtaining plot size
fig_size = plt.rcParams["figure.figsize"]

# Changing the length and width of the plot
fig_size[0] = 12
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size

plt.title('Budget vs Revenue', fontsize = 18)
plt.xlabel('Budget', fontsize = 16)
plt.ylabel('Revenue', fontsize = 16);

#function which will take any column as argument from and keep its track
def calculate_count(column):
    # Convert column to string and seperate it by '|'
    data = df_movies[column].str.cat(sep = '|')
    
    # Storing the values seperately in a Pandas series
    data = pd.Series(data.split('|'))
    count = data.value_counts(ascending = False)
    
    return count

#variable to store the retured value#variabl 
count = calculate_count('genres')

#printing top 5 values
count.head()

Drama       4760
Comedy      3793
Thriller    2907
Action      2384
Romance     1712
dtype: int64

count.plot(kind='pie', figsize = (12, 12));

4. Conclusions¶

In conclusion, after observation and analysis of the data, I would say that the world of cinema has evolved considerably. From 1960 to 2015, the gap is really considerable, in terms of allocated budget, popularity and others. In terms of percentage, I would say the cement has reached the top, with the budgets out of the norm and the popularity. Talking about the genre of film, according to my analysis, films in the Drama genre have more popularity, followed by Comedy, Thriller, Action and then Romance, which are the top 5.

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562	6.5	2015	1.379999e+08	1.392446e+09
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	...	An apocalyptic story set in the furthest reach...	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	5/13/15	6185	7.1	2015	1.379999e+08	3.481613e+08
2	262500	tt2908446	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	One Choice Can Destroy You	...	Beatrice Prior must confront her inner demons ...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	3/18/15	2480	6.3	2015	1.012000e+08	2.716190e+08
3	140607	tt2488496	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	http://www.starwars.com/films/star-wars-episod...	J.J. Abrams	Every generation has a story.	...	Thirty years after defeating the Galactic Empi...	136	Action\|Adventure\|Science Fiction\|Fantasy	Lucasfilm\|Truenorth Productions\|Bad Robot	12/15/15	5292	7.5	2015	1.839999e+08	1.902723e+09
4	168259	tt2820852	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	http://www.furious7.com/	James Wan	Vengeance Hits Home	...	Deckard Shaw seeks revenge against Dominic Tor...	137	Action\|Crime\|Thriller	Universal Pictures\|Original Film\|Media Rights ...	4/1/15	2947	7.3	2015	1.747999e+08	1.385749e+09

	id	imdb_id	popularity	budget	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj
10861	21	tt0060371	0.080598	0	The Endless Summer	Michael Hynson\|Robert August\|Lord 'Tally Ho' B...	NaN	Bruce Brown	NaN	...	The Endless Summer, by Bruce Brown, is one of ...	95	Documentary	Bruce Brown Films	6/15/66	11	7.4	1966	0.000000
10862	20379	tt0060472	0.065543	0	Grand Prix	James Garner\|Eva Marie Saint\|Yves Montand\|Tosh...	NaN	John Frankenheimer	Cinerama sweeps YOU into a drama of speed and ...	...	Grand Prix driver Pete Aron is fired by his te...	176	Action\|Adventure\|Drama	Cherokee Productions\|Joel Productions\|Douglas ...	12/21/66	20	5.7	1966	0.000000
10863	39768	tt0060161	0.065141	0	Beregis Avtomobilya	Innokentiy Smoktunovskiy\|Oleg Efremov\|Georgi Z...	NaN	Eldar Ryazanov	NaN	...	An insurance agent who moonlights as a carthie...	94	Mystery\|Comedy	Mosfilm	1/1/66	11	6.5	1966	0.000000
10864	21449	tt0061177	0.064317	0	What's Up, Tiger Lily?	Tatsuya Mihashi\|Akiko Wakabayashi\|Mie Hama\|Joh...	NaN	Woody Allen	WOODY ALLEN STRIKES BACK!	...	In comic Woody Allen's film debut, he took the...	80	Action\|Comedy	Benedict Pictures Corp.	11/2/66	22	5.4	1966	0.000000
10865	22293	tt0060666	0.035919	19000	Manos: The Hands of Fate	Harold P. Warren\|Tom Neyman\|John Reynolds\|Dian...	NaN	Harold P. Warren	It's Shocking! It's Beyond Your Imagination!	...	A family gets lost on the road and stumbles up...	74	Horror	Norm-Iris	11/15/66	15	1.5	1966	127642.279154

	id	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	10866.000000	10866.000000	1.086600e+04	1.086600e+04	10866.000000	10866.000000	10866.000000	10866.000000	1.086600e+04	1.086600e+04
mean	66064.177434	0.646441	1.462570e+07	3.982332e+07	102.070863	217.389748	5.974922	2001.322658	1.755104e+07	5.136436e+07
std	92130.136561	1.000185	3.091321e+07	1.170035e+08	31.381405	575.619058	0.935142	12.812941	3.430616e+07	1.446325e+08
min	5.000000	0.000065	0.000000e+00	0.000000e+00	0.000000	10.000000	1.500000	1960.000000	0.000000e+00	0.000000e+00
25%	10596.250000	0.207583	0.000000e+00	0.000000e+00	90.000000	17.000000	5.400000	1995.000000	0.000000e+00	0.000000e+00
50%	20669.000000	0.383856	0.000000e+00	0.000000e+00	99.000000	38.000000	6.000000	2006.000000	0.000000e+00	0.000000e+00
75%	75610.000000	0.713817	1.500000e+07	2.400000e+07	111.000000	145.750000	6.600000	2011.000000	2.085325e+07	3.369710e+07
max	417859.000000	32.985763	4.250000e+08	2.781506e+09	900.000000	9767.000000	9.200000	2015.000000	4.250000e+08	2.827124e+09

	id	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	10865.000000	10865.000000	1.086500e+04	1.086500e+04	10865.000000	10865.000000	10865.000000	10865.000000	1.086500e+04	1.086500e+04
mean	66066.374413	0.646446	2.229184e+07	6.187725e+07	102.363019	217.399632	5.975012	2001.321859	1.754989e+07	5.136900e+07
std	92134.091971	1.000231	2.801364e+07	1.110239e+08	30.904043	575.644627	0.935138	12.813260	3.430753e+07	1.446383e+08
min	5.000000	0.000065	1.000000e+00	2.000000e+00	2.000000	10.000000	1.500000	1960.000000	0.000000e+00	0.000000e+00
25%	10596.000000	0.207575	1.462570e+07	3.982332e+07	90.000000	17.000000	5.400000	1995.000000	0.000000e+00	0.000000e+00
50%	20662.000000	0.383831	1.462570e+07	3.982332e+07	99.000000	38.000000	6.000000	2006.000000	0.000000e+00	0.000000e+00
75%	75612.000000	0.713857	1.500000e+07	3.982332e+07	111.000000	146.000000	6.600000	2011.000000	2.085325e+07	3.370173e+07
max	417859.000000	32.985763	4.250000e+08	2.781506e+09	900.000000	9767.000000	9.200000	2015.000000	4.250000e+08	2.827124e+09

	id	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
id	1.000000	-0.014350	-0.141351	-0.099227	-0.088360	-0.035551	-0.058363	0.511364	-0.189015	-0.138477
popularity	-0.014350	1.000000	0.545472	0.663358	0.139033	0.800828	0.209511	0.089801	0.513550	0.609083
budget	-0.141351	0.545472	1.000000	0.734901	0.191283	0.632702	0.081014	0.115931	0.968963	0.622505
revenue	-0.099227	0.663358	0.734901	1.000000	0.162838	0.791175	0.172564	0.057048	0.706427	0.919110
runtime	-0.088360	0.139033	0.191283	0.162838	1.000000	0.163278	0.156835	-0.117204	0.221114	0.175676
vote_count	-0.035551	0.800828	0.632702	0.791175	0.163278	1.000000	0.253823	0.107948	0.587051	0.707942
vote_average	-0.058363	0.209511	0.081014	0.172564	0.156835	0.253823	1.000000	-0.117632	0.093039	0.193085
release_year	0.511364	0.089801	0.115931	0.057048	-0.117204	0.107948	-0.117632	1.000000	0.016793	-0.066256
budget_adj	-0.189015	0.513550	0.968963	0.706427	0.221114	0.587051	0.093039	0.016793	1.000000	0.646607
revenue_adj	-0.138477	0.609083	0.622505	0.919110	0.175676	0.707942	0.193085	-0.066256	0.646607	1.000000

	id	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
id	1.000000	-0.014351	-0.100563	-0.067241	-0.078432	-0.035555	-0.058391	0.511393	-0.189008	-0.138487
popularity	-0.014351	1.000000	0.516044	0.635291	0.136851	0.800828	0.209517	0.089806	0.513555	0.609085
budget	-0.100563	0.516044	1.000000	0.705462	0.171059	0.615702	0.074241	0.129276	0.929266	0.599700
revenue	-0.067241	0.635291	0.705462	1.000000	0.141847	0.772944	0.152055	0.065170	0.662337	0.898107
runtime	-0.078432	0.136851	0.171059	0.141847	1.000000	0.162584	0.154552	-0.112168	0.219841	0.175041
vote_count	-0.035555	0.800828	0.615702	0.772944	0.162584	1.000000	0.253818	0.107962	0.587062	0.707941
vote_average	-0.058391	0.209517	0.074241	0.152055	0.154552	0.253818	1.000000	-0.117576	0.093079	0.193062
release_year	0.511393	0.089806	0.129276	0.065170	-0.112168	0.107962	-0.117576	1.000000	0.016771	-0.066236
budget_adj	-0.189008	0.513555	0.929266	0.662337	0.219841	0.587062	0.093079	0.016771	1.000000	0.646627
revenue_adj	-0.138487	0.609085	0.599700	0.898107	0.175041	0.707941	0.193062	-0.066236	0.646627	1.000000