A Chinese automobile company Teclov_chinese aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market.
In this project, I will work with two poweful Python packages, Pandas and Seaborn. Both packages have extensive online documentation. There is an extensive tutorial on Visualization with Pandas. The Seaborn tutorial contains many examples of data Visualization. The matplotlib web site has addition resources for learning plotting with Python tools.
In this project, I will you use two or three powerful Python packages. Pandas, Matplotlib and Seaborn
import pandas as pd
import numpy as np
import seaborn as sns
import seaborn as sb
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('automobile-price-data.csv')
df.info()
Some columns are of object data type, and we will have to convert them into float or integer numerical variables, in order to facilitate the exploration of our DataSet.
cols = ['symboling','wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke',
'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg','highway-mpg']
for column in cols:
df.replace("?", np.nan, inplace = True)
df.dropna(axis = 0, inplace = True)
df.replace(np.nan,0)
df["price"] = df["price"].astype(str).astype(float)
df["bore"] = df["bore"].astype(str).astype(float)
df["stroke"] = df["stroke"].astype(str).astype(float)
df["compression-ratio"] = df["compression-ratio"].astype(str).astype(float)
df["horsepower"] = df["horsepower"].astype(str).astype(float)
df["peak-rpm"] = df["peak-rpm"].astype(str).astype(float)
df.info()
print(df.describe())
df.describe()
We can see there's 195 cars that where we have a length value, the mean length is 174, standard deviation isn't actually that much considering, and the range is actually from minimum of 141 and maximum of 208. If we scroll over the price, we'll see the mean price is 13248 in this old dataset. We've got the standard deviation is quite wide actually 8000, so we've got a big range of prices go from 5000 dollars to 45000 dollars and the median is 10000 whereas the mean is 13000. In terms of exploring this data, we know that the price is highly skewed.
We can notice that the highest price was 45400.0 US Dol, 41315.0 US Dol, 40960.0 US Dol, 37028.0 US Dol and so on and the lowest price 5118.0 US Dol
df.shape
print("The Shape of our DataSet is:", df.shape)
df.columns
print("Finding all the columns:", df.columns)
print("The data table has 26 columns")
print(" Which variables are significant in predicting the price of a car")
The goal is to find a model by adapting the price of the cars, with the independent variables available. This approach is that the management organ must be able to understand how prices vary exactly with the independent variables. This will allow to model the design, the strategy to market cars, in order to reach certain price levels. This model will be an important asset for the management organ to be able to understand the price dynamics of a new market.
plt.figure(figsize=(15, 8))
sns.distplot(df['price'].dropna(),kde=False,color='darkred',bins=30);
According to the distribution, the price field has an average around 13207.129353 and a median around 10295.000000 with the most expensive car values at 45400.000000 and the cheapest cars at 5118.000000, and Standard Deviation 8056.330093.
plt.figure(figsize=(15,8))
sns.distplot(df['price'])
print(df.price.describe())
plt.figure(figsize=(15,8))
for col in df.select_dtypes('float'):
plt.figure()
sns.distplot(df[col])
TASK: Let's explore correlation between the continuous feature variables. Calculate the correlation between all continuous numeric variables using .corr() method.
# Code
df.corr()
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(),annot=True,cmap='viridis');
df1 = df[['bore','curb-weight','engine-size', 'price']]
df.plot.scatter(x='bore',y='price',c='red',s=100,figsize=(12,3))
df.plot.scatter(x='curb-weight',y='price',c='orange',s=100,figsize=(12,3))
df.plot.scatter(x='engine-size',y='price',c='red',s=100,figsize=(12,3))
print("At first glance, the 3 variables are positively correlated but spread at higher values.")
plt.figure(figsize=(15,10))
sns.heatmap(df1.corr(), annot =True, linewidth = 0.5)
We can make sure of this by looking at the Coefficient of Correlation
Coefficient of Correlation between Price and bore: 54 %
df.plot.scatter(x='length',y='price',c='green',s=100,figsize=(12,3))
df.plot.scatter(x='width',y='price',c='red',s=100,figsize=(12,3))
df.plot.scatter(x='height',y='price',c='blue',s=100,figsize=(12,3))
df2 = df[['length','width','height', 'price']]
plt.figure(figsize=(15,10))
sns.heatmap(df2.corr(), annot =True, linewidth = 0.5)
Carlength and Carwidth are more correlated than carheight which is more spread out but positive.
We can make sure of this by looking at the Coefficient of Correlation
Correlation coefficient between Price and carlength: 69 %
df.plot.scatter(x='wheel-base',y='price',c='green',s=100,figsize=(12,3))
df.plot.scatter(x='horsepower',y='price',c='black',s=100,figsize=(12,3))
df.plot.scatter(x='stroke',y='price',c='green',s=100,figsize=(12,3))
df3 = df[['wheel-base','horsepower','stroke', 'price']]
plt.figure(figsize=(15,10))
sns.heatmap(df3.corr(), annot =True, linewidth = 0.5)
df.plot.scatter(x='compression-ratio',y='price',c='blue',s=100,figsize=(12,3))
df.plot.scatter(x='peak-rpm',y='price',c='orange',s=100,figsize=(12,3))
df.plot.scatter(x='symboling',y='price',c='blue',s=100,figsize=(12,3))
df4 = df[['compression-ratio','peak-rpm','symboling', 'price']]
plt.figure(figsize=(15,10))
sns.heatmap(df4.corr(), annot =True, linewidth = 0.5)
df.plot.scatter(x='city-mpg',y='price',c='black',s=100,figsize=(12,3))
df.plot.scatter(x='highway-mpg',y='price',c='orange',s=100,figsize=(12,3))
df5 = df[['city-mpg','highway-mpg','price']]
plt.figure(figsize=(15,10))
sns.heatmap(df5.corr(), annot =True, linewidth = 0.5)
plt.figure(figsize=(20,12))
df.corr()['price'].sort_values().drop('price').plot(kind='bar')
plt.figure(figsize=(18,7))
subgrade_order = sorted(df['engine-size'].unique())
sns.countplot(x='engine-size',data=df,order = subgrade_order,palette='coolwarm' )
plt.figure(figsize=(18,7))
subgrade_order = sorted(df['symboling'].unique())
sns.countplot(x='symboling',data=df,order = subgrade_order,palette='coolwarm' )
plt.figure(figsize=(20,12))
sns.countplot(x='symboling',data=df,hue='num-of-doors')
Et ce que nous voulions faire, c'est essentiellement faire ce tracé qui montre quelles caractéristiques numériques ont la plus forte corrélation avec l'étiquette réelle.
plt.figure(figsize=(20,10))
ax = sns.boxplot(x="symboling", y="city-mpg", data=df)
ax = sns.swarmplot(x="symboling", y="city-mpg", data=df, color=".25")
sns.jointplot(x='symboling',y='city-mpg',data=df, kind="kde",space=0, color="g", size=10)
df.groupby('symboling')['city-mpg'].describe()
sns.jointplot(x='symboling',y='price',data=df,color='red',kind='kde',size=10);
df.groupby('symboling')['price'].describe()
plt.figure(figsize=(20,10))
df['price'].plot.kde()
plt.figure(figsize=(15,8))
sns.boxplot(x='body-style',y='price',data=df)
df.groupby('body-style')['price'].describe()
plt.figure(figsize=(15, 10))
sns.boxplot(x='symboling',y='price',data=df,palette='winter')
plt.figure(figsize=(30, 15))
plt.subplot(2,3,1)
sns.boxplot(x=df.symboling, y=df.price)
plt.subplot(2,3,2)
plt.title('Symboling Hist')
order = df['symboling'].value_counts(ascending=False).index
sns.countplot(x='symboling', data=df, order=order)
It seems that symboling 0 and 1 are the most favored.
Cars with symboling -1 and -2 are the most expensive, which is logical because it means that the car is more secure.
Note: Existence of Outliers for several values.
plt.figure(figsize=(12,7))
sns.boxplot(x='fuel-type',y='price',data=df)
We can clearly see that cars that run on gas are preferable to cars that run on diesel.
df.groupby('fuel-type')['price'].describe()
plt.figure(figsize=(20,10))
sns.boxplot(
data=df,
x='make',
y='price',
color='blue')
plt.figure(figsize=(20,10))
plt.bar(df['make'], df['price'], color="blue")
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
import chart_studio.plotly as py
# For Notebooks
init_notebook_mode(connected=True)
# For offline use
cf.go_offline()
df.iplot()
df.iplot(kind='scatter',x='engine-size',y='price',mode='markers',size=15)
sns.lmplot(x = "engine-size", y = "price", data = df.reset_index(), size=8),
As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.
We can examine the correlation between 'engine-size' and 'price' and see it's approximately 0.87
df.iplot(kind='scatter',x='highway-mpg',y='price',mode='markers',size=15)
sns.lmplot(x = "highway-mpg", y = "price", data = df.reset_index(), size=8),
As the highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship between these two variables. Highway mpg could potentially be a predictor of price.
We can examine the correlation between 'highway-mpg' and 'price' and see it's approximately -0.704
df.iplot(kind='scatter',x='peak-rpm',y='price',mode='markers',size=15)
sns.lmplot(x = "peak-rpm", y = "price", data = df.reset_index(), size=8),
We can notice that Peak rpm and price have a Weak Linear Relationship.
Peak rpm does not seems like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore it's it is not a reliable variable.
We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately -0.101616
df.iplot(kind='bar',x='fuel-type',y='price')
df.count().iplot(kind='bar')
df.iplot(kind='bar',x='make',y='price')
plt.figure(figsize=(20,10))
sns.countplot(x='make',data=df)
plt.figure(figsize=(20,10))
df['make'].value_counts().head(30).plot(kind='barh')
df['make'].value_counts().head(2)
df['make'].value_counts().tail(2)
Examine this plot, we can easily see that several car makers have the same number of models, and that the difference in number of models can be as small as 1. We can notice that Toyota and Nissan are the most frequent car and jaguar and porsche the least frequent.
df.make.value_counts()
plt.figure(figsize=(20,10))
df.make.value_counts().plot.pie()
df.iplot(x='make', y='price');
Looking at this plot, we can notice that the most expensive cars are German, namely; Mercedes-Benz, BMW and Porsche.
df.iplot()
df[['symboling','engine-size']].iplot(kind='spread')
df[['price', 'engine-size']].iplot(kind='spread')
df[['horsepower', 'curb-weight']].iplot(kind='spread')
df.iplot(kind='hist')
df['price'].iplot(kind='hist',bins=25)
df['stroke'].iplot(kind='hist',bins=25)
df.iplot(kind='bubble',x='bore',y='price',size='peak-rpm')
df.iplot(kind='bubble',x='bore',y='price',size='wheel-base')
sns.set_style('whitegrid')
sns.lmplot('peak-rpm','city-mpg',data=df, hue='compression-ratio',
palette='coolwarm',size=6,aspect=1,fit_reg=False)
sns.set_style('darkgrid')
g = sns.FacetGrid(df,hue="engine-type",palette='coolwarm',size=8,aspect=2)
g = g.map(plt.hist,'wheel-base',bins=20,alpha=0.7)
sns.set_style('darkgrid')
g = sns.FacetGrid(df,hue="engine-type",palette='coolwarm',size=8,aspect=2)
g = g.map(plt.hist,'horsepower',bins=20,alpha=0.7)
df.columns
X = df[['symboling','wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke',
'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
'highway-mpg']]
y = df['price'] # What I'm trying to predict in this case is the price column.
We want to split our data into a training set for the model in a testing set in order to test the model once it's been trained.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
# This is essentially is using tuple and packing to grab our training set and we're testing set
# and we're going to have our X train and our y train and then we can have our X test and our Y test,
# Then in the train test split function we pass on our X data our y data and we could optionally
#also passe our test size and test size.
cols = ['price']
for column in cols:
df.replace("?", np.nan, inplace = True)
df.dropna(axis = 0, inplace = True)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
Let's evaluate the model by checking out it's coefficients and how we can interpret them.
# print the intercept
print(lm.intercept_)
The next thing we can check out are the coefficients and the coefficients are going to relate to each feature in our dataset. We can grab coefficients by saying lm coef underscore and this will return the coefficients for each.
lm.coef_
So each of these coefficients relates to the columns in x or x train.
X_train.columns
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
Let's grab predictions off our test set and see how well it did!
predictions = lm.predict(X_test)
predictions
y_test
However, since we did the train test split, we know that Y test contains the correct Cars of the Car, and we want to know how far off are the predictions from the tests prices the actual prices. There is one quick way we can visually analyze this which is just by doing a scatterplot.
plt.figure(figsize=(15,7))
plt.scatter(y_test,predictions)
That means that a perfect straight line would be perfectly correct predictions.
Let's go ahead and actually create a histogram of the distribution of our residuals. The residuals are the difference between the actual values y_test and the predicted values.
plt.figure(figsize=(15,7))
sns.distplot((y_test-predictions),bins=50);
Here are three common evaluation metrics for regression problems:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$Mean Squared Error (MSE) is the mean of the squared errors:
$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$Comparing these metrics:
All of these are loss functions, because we want to minimize them.
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
metrics.explained_variance_score(y_test,predictions)
plt.figure(figsize=(15,7))
sns.heatmap(df.corr())