import pandas as pd
import numpy as np
import operator
import csv
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline
#loading the csv file
df_movies = pd.read_csv('tmdb-movies.csv')
# first of all, I have to print the first 5 rows and define columns from TMDb_Movies Dataset.
df_movies.head()
df_movies.shape
Here we can see that we have 10866 rows and 21 columns
df_movies.tail()
df_movies.info()
df_movies.dtypes
To create a histogram, I used pandas hist() method. Calling the hist() method on a pandas dataframe will return histograms for all non-nuisance series in the dataframe.
df_movies.hist(figsize=(20,20));
df_movies.describe()
df_movies.columns
Looking at the data above, budget, revenue and runtime all contain values of 0. Fill these in with the average of each column.
# Get average of budget_adj
print(df_movies['budget'].mean())
# Get average of revenue
print(df_movies['revenue'].mean())
# Get average of runtime
print(df_movies['runtime'].mean())
# Replace 0 values with mean
df_movies['budget'] = df_movies['budget'].replace(0, 14625701.09414688)
# Replace 0 values with mean
df_movies['revenue'] = df_movies['revenue'].replace(0, 39823319.79339223)
# Replace 0 values with mean
df_movies['runtime'] = df_movies['runtime'].replace(0, 102.07086324314375)
df_movies.describe()
After checking our new data, we can clearly notice the boxes with the mention 0 have been replaced by the average of budget, revenue and runtime.
# Find out if there are any duplicate rows
sum(df_movies.duplicated())
# Remove the duplicated rows
df_movies.drop_duplicates(inplace=True)
df_movies.budget.mean()
df_movies['budget'].value_counts().head(5)
df_movies.budget.plot(kind = 'hist', color = 'red', bins = 25)
# plot relationship between budget and runtime output
df_movies.plot(x='budget', y='runtime', kind='scatter');
# plot relationship between budget and revenue output
df_movies.plot(x='budget', y='revenue', kind='scatter');
# plot relationship between budget and release_year output
df_movies.plot(x='budget', y='release_year', kind='scatter');
df_movies['budget'].nunique()
Using .apply() with a custom lambda expression to create a new column called "date" that contains this string value.
df_movies['release_date'].iloc[0]
df_movies['date'] = df_movies['release_date'].apply(lambda original_title: original_title.split(':')[0])
df_movies['date'].value_counts(10)
df_movies['popularity'].unique()
df_movies.vote_average.plot(kind = 'hist', color = 'red', bins = 25)
x = df_movies.vote_average.plot(kind = 'box')
sns.set(rc={'figure.figsize':(15,15)}, font_scale=1.3)
temp_df = df_movies[["release_year", "vote_average"]]
sns.set_style("whitegrid")
ax = sns.violinplot(x = temp_df.vote_average, y = temp_df.release_year, orient ="h")
ax.set(xlabel='movie ratings distributions', ylabel='years', title = 'movie ratings distributions per year')
plt.show()
df = pd.DataFrame(dict(x=np.random.poisson(4, 500)))
ax = sns.barplot(x="release_year", y="release_year", data=df_movies, estimator=lambda x: len(x) / len(df_movies) * 100)
ax.set(ylabel="Percent")
df_movies.corr()
# Plotting scatterplots to view correlation visually
sns.regplot(x = df_movies['budget'], y = df_movies['revenue'], fit_reg = False)
# Obtaining plot size
fig_size = plt.rcParams["figure.figsize"]
# Changing the length and width of the plot
fig_size[0] = 12
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size
plt.title('Budget vs Revenue', fontsize = 18)
plt.xlabel('Budget', fontsize = 16)
plt.ylabel('Revenue', fontsize = 16);
#function which will take any column as argument from and keep its track
def calculate_count(column):
# Convert column to string and seperate it by '|'
data = df_movies[column].str.cat(sep = '|')
# Storing the values seperately in a Pandas series
data = pd.Series(data.split('|'))
count = data.value_counts(ascending = False)
return count
#variable to store the retured value#variabl
count = calculate_count('genres')
#printing top 5 values
count.head()
count.plot(kind='pie', figsize = (12, 12));
In conclusion, after observation and analysis of the data, I would say that the world of cinema has evolved considerably. From 1960 to 2015, the gap is really considerable, in terms of allocated budget, popularity and others. In terms of percentage, I would say the cement has reached the top, with the budgets out of the norm and the popularity. Talking about the genre of film, according to my analysis, films in the Drama genre have more popularity, followed by Comedy, Thriller, Action and then Romance, which are the top 5.