Data Visualization Project 5 Udacity¶

by Ekofiongo Eale¶

from Germany¶

Problem Description¶

A Chinese automobile company Teclov_chinese aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market.

Resources¶

In this project, I will work with two poweful Python packages, Pandas and Seaborn. Both packages have extensive online documentation. There is an extensive tutorial on Visualization with Pandas. The Seaborn tutorial contains many examples of data Visualization. The matplotlib web site has addition resources for learning plotting with Python tools.

In this project, I will you use two or three powerful Python packages. Pandas, Matplotlib and Seaborn

https://matplotlib.org/

https://seaborn.pydata.org/examples/index.html

https://www.kaggle.com/toramky/automobile-dataset/data?select=Automobile_data.csv

import pandas as pd
import numpy as np
import seaborn as sns
import seaborn as sb
%matplotlib inline
import matplotlib.pyplot as plt

plt.style.use('ggplot')

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('automobile-price-data.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling            205 non-null int64
normalized-losses    205 non-null object
make                 205 non-null object
fuel-type            205 non-null object
aspiration           205 non-null object
num-of-doors         205 non-null object
body-style           205 non-null object
drive-wheels         205 non-null object
engine-location      205 non-null object
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb-weight          205 non-null int64
engine-type          205 non-null object
num-of-cylinders     205 non-null object
engine-size          205 non-null int64
fuel-system          205 non-null object
bore                 205 non-null object
stroke               205 non-null object
compression-ratio    205 non-null float64
horsepower           205 non-null object
peak-rpm             205 non-null object
city-mpg             205 non-null int64
highway-mpg          205 non-null int64
price                205 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB

Some columns are of object data type, and we will have to convert them into float or integer numerical variables, in order to facilitate the exploration of our DataSet.

To allow me to better explore my DataSet, I would like to replace the question mark? sign with np.nan.¶

cols = ['symboling','wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg','highway-mpg']
for column in cols:
    df.replace("?", np.nan, inplace = True)
    df.dropna(axis = 0, inplace = True)

In order to be able to treat our, we would like to replace Nan by the number 0¶

df.replace(np.nan,0)

Converting data types unto float¶

df["price"] = df["price"].astype(str).astype(float)

df["bore"] = df["bore"].astype(str).astype(float)
df["stroke"] = df["stroke"].astype(str).astype(float)
df["compression-ratio"] = df["compression-ratio"].astype(str).astype(float)
df["horsepower"] = df["horsepower"].astype(str).astype(float)   
df["peak-rpm"] = df["peak-rpm"].astype(str).astype(float)

Columns after the conversion¶

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159 entries, 3 to 204
Data columns (total 26 columns):
symboling            159 non-null int64
normalized-losses    159 non-null object
make                 159 non-null object
fuel-type            159 non-null object
aspiration           159 non-null object
num-of-doors         159 non-null object
body-style           159 non-null object
drive-wheels         159 non-null object
engine-location      159 non-null object
wheel-base           159 non-null float64
length               159 non-null float64
width                159 non-null float64
height               159 non-null float64
curb-weight          159 non-null int64
engine-type          159 non-null object
num-of-cylinders     159 non-null object
engine-size          159 non-null int64
fuel-system          159 non-null object
bore                 159 non-null float64
stroke               159 non-null float64
compression-ratio    159 non-null float64
horsepower           159 non-null float64
peak-rpm             159 non-null float64
city-mpg             159 non-null int64
highway-mpg          159 non-null int64
price                159 non-null float64
dtypes: float64(10), int64(5), object(11)
memory usage: 33.5+ KB

Checking the descriptive statistics for numeric variables¶

print(df.describe())

        symboling  wheel-base      length       width      height  \
count  159.000000  159.000000  159.000000  159.000000  159.000000   
mean     0.735849   98.264151  172.413836   65.607547   53.899371   
std      1.193086    5.167416   11.523177    1.947883    2.268761   
min     -2.000000   86.600000  141.100000   60.300000   49.400000   
25%      0.000000   94.500000  165.650000   64.000000   52.250000   
50%      1.000000   96.900000  172.400000   65.400000   54.100000   
75%      2.000000  100.800000  177.800000   66.500000   55.500000   
max      3.000000  115.600000  202.600000   71.700000   59.800000   

       curb-weight  engine-size        bore      stroke  compression-ratio  \
count   159.000000   159.000000  159.000000  159.000000         159.000000   
mean   2461.138365   119.226415    3.300126    3.236352          10.161132   
std     481.941321    30.460791    0.267336    0.294888           3.889475   
min    1488.000000    61.000000    2.540000    2.070000           7.000000   
25%    2065.500000    97.000000    3.050000    3.105000           8.700000   
50%    2340.000000   110.000000    3.270000    3.270000           9.000000   
75%    2809.500000   135.000000    3.560000    3.410000           9.400000   
max    4066.000000   258.000000    3.940000    4.170000          23.000000   

       horsepower     peak-rpm    city-mpg  highway-mpg         price  
count  159.000000   159.000000  159.000000   159.000000    159.000000  
mean    95.836478  5113.836478   26.522013    32.081761  11445.729560  
std     30.718583   465.754864    6.097142     6.459189   5877.856195  
min     48.000000  4150.000000   15.000000    18.000000   5118.000000  
25%     69.000000  4800.000000   23.000000    28.000000   7372.000000  
50%     88.000000  5200.000000   26.000000    32.000000   9233.000000  
75%    114.000000  5500.000000   31.000000    37.000000  14719.500000  
max    200.000000  6600.000000   49.000000    54.000000  35056.000000

df.describe()

We can see there's 195 cars that where we have a length value, the mean length is 174, standard deviation isn't actually that much considering, and the range is actually from minimum of 141 and maximum of 208. If we scroll over the price, we'll see the mean price is 13248 in this old dataset. We've got the standard deviation is quite wide actually 8000, so we've got a big range of prices go from 5000 dollars to 45000 dollars and the median is 10000 whereas the mean is 13000. In terms of exploring this data, we know that the price is highly skewed.

We can notice that the highest price was 45400.0 US Dol, 41315.0 US Dol, 40960.0 US Dol, 37028.0 US Dol and so on and the lowest price 5118.0 US Dol

Checking the shape of our DataSet¶

df.shape
print("The Shape of our DataSet is:", df.shape)

The Shape of our DataSet is: (159, 26)

Finding all the columns¶

df.columns
print("Finding all the columns:", df.columns)

Finding all the columns: Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

The structure of our dataset?¶

print("The data table has 26 columns")

The data table has 26 columns

What are the main features of interest in your dataset?¶

print(" Which variables are significant in predicting the price of a car")

 Which variables are significant in predicting the price of a car

Business Goal¶

The goal is to find a model by adapting the price of the cars, with the independent variables available. This approach is that the management organ must be able to understand how prices vary exactly with the independent variables. This will allow to model the design, the strategy to market cars, in order to reach certain price levels. This model will be an important asset for the management organ to be able to understand the price dynamics of a new market.

Exploratory of Data¶

Dependent variable: Price¶

Checking the Car price Distribution¶

plt.figure(figsize=(15, 8))
sns.distplot(df['price'].dropna(),kde=False,color='darkred',bins=30);

According to the distribution, the price field has an average around 13207.129353 and a median around 10295.000000 with the most expensive car values at 45400.000000 and the cheapest cars at 5118.000000, and Standard Deviation 8056.330093.

plt.figure(figsize=(15,8))
sns.distplot(df['price'])

<matplotlib.axes._subplots.AxesSubplot at 0x1a57d975e48>

print(df.price.describe())

count      159.000000
mean     11445.729560
std       5877.856195
min       5118.000000
25%       7372.000000
50%       9233.000000
75%      14719.500000
max      35056.000000
Name: price, dtype: float64

Checking Distribution Plot for DataSet (displot)¶

plt.figure(figsize=(15,8))
for col in df.select_dtypes('float'):
    plt.figure()
    sns.distplot(df[col])

<Figure size 1080x576 with 0 Axes>

Calculate the correlation between variables of type "int64" or "float64" using the method "corr":¶

TASK: Let's explore correlation between the continuous feature variables. Calculate the correlation between all continuous numeric variables using .corr() method.

# Code
df.corr()

Checking the correlated independent variables above and Price¶

plt.figure(figsize=(15,10))
sns.heatmap(df.corr(),annot=True,cmap='viridis');

Determine the independent variable¶

Price VS bore-curb-weight-engine-size¶

df1 = df[['bore','curb-weight','engine-size', 'price']]

df.plot.scatter(x='bore',y='price',c='red',s=100,figsize=(12,3))
df.plot.scatter(x='curb-weight',y='price',c='orange',s=100,figsize=(12,3))
df.plot.scatter(x='engine-size',y='price',c='red',s=100,figsize=(12,3))
print("At first glance, the 3 variables are positively correlated but spread at higher values.")

At first glance, the 3 variables are positively correlated but spread at higher values.

plt.figure(figsize=(15,10))
sns.heatmap(df1.corr(), annot =True, linewidth = 0.5)

<matplotlib.axes._subplots.AxesSubplot at 0x1a57e2b8488>

We can make sure of this by looking at the Coefficient of Correlation
Coefficient of Correlation between Price and bore: 54 %
Correlation coefficient between Price and curb-weight: 83 %
Correlation coefficient between Price and engine-size: 87 %

Price VS car-length - car-width - car-height¶

df.plot.scatter(x='length',y='price',c='green',s=100,figsize=(12,3))
df.plot.scatter(x='width',y='price',c='red',s=100,figsize=(12,3))
df.plot.scatter(x='height',y='price',c='blue',s=100,figsize=(12,3))

<matplotlib.axes._subplots.AxesSubplot at 0x1a57e4525c8>

df2 = df[['length','width','height', 'price']]

plt.figure(figsize=(15,10))
sns.heatmap(df2.corr(), annot =True, linewidth = 0.5)

<matplotlib.axes._subplots.AxesSubplot at 0x1a57e45a748>

Carlength and Carwidth are more correlated than carheight which is more spread out but positive.
We can make sure of this by looking at the Coefficient of Correlation
Correlation coefficient between Price and carlength: 69 %
Correlation coefficient between Price and carwidth: 75 %
Correlation coefficient between Price and carheight: 14 %

Price VS engine-size - horsepower - stroke¶

df.plot.scatter(x='wheel-base',y='price',c='green',s=100,figsize=(12,3))
df.plot.scatter(x='horsepower',y='price',c='black',s=100,figsize=(12,3))
df.plot.scatter(x='stroke',y='price',c='green',s=100,figsize=(12,3))

<matplotlib.axes._subplots.AxesSubplot at 0x1a57da41748>

Wheel-base and Horsepower are positively correlated, but Stroke is more spread out (may not be related).

df3 = df[['wheel-base','horsepower','stroke', 'price']]

plt.figure(figsize=(15,10))
sns.heatmap(df3.corr(), annot =True, linewidth = 0.5)

<matplotlib.axes._subplots.AxesSubplot at 0x1a57d3f07c8>

We can make sure of this by looking at the Coefficient of Correlation¶

Correlation coefficient between Price and wheel-base: 58 %
Correlation coefficient between Price and horsepower: 81 %
Correlation coefficient between Price and stroke: 8.2 %

Price VS compression-ratio - peak-rpm - symboling¶

df.plot.scatter(x='compression-ratio',y='price',c='blue',s=100,figsize=(12,3))
df.plot.scatter(x='peak-rpm',y='price',c='orange',s=100,figsize=(12,3))
df.plot.scatter(x='symboling',y='price',c='blue',s=100,figsize=(12,3))

<matplotlib.axes._subplots.AxesSubplot at 0x1a57d969dc8>

compression-ratio, peak-rpm and symboling are not correlated.

df4 = df[['compression-ratio','peak-rpm','symboling', 'price']]

plt.figure(figsize=(15,10))
sns.heatmap(df4.corr(), annot =True, linewidth = 0.5)

<matplotlib.axes._subplots.AxesSubplot at 0x1a57e3e1508>

We can make sure of this by looking at the Coefficient of Correlation¶

Correlation coefficient between Price and compression-ratio: 7.1 %
Correlation coefficient between Price and peak-rpm: -10 %
Correlation coefficient between Price and symboling: -8.2 %

Price VS city-mpg - highway-mpg¶

df.plot.scatter(x='city-mpg',y='price',c='black',s=100,figsize=(12,3))
df.plot.scatter(x='highway-mpg',y='price',c='orange',s=100,figsize=(12,3))

<matplotlib.axes._subplots.AxesSubplot at 0x1a57de4c808>

City-mpg & Highway-mpg are negatively correlated.

The more prices get lower, the higher the distances get, which means that the cheapest cars have better mileage than expensive cars.

df5 = df[['city-mpg','highway-mpg','price']]

plt.figure(figsize=(15,10))
sns.heatmap(df5.corr(), annot =True, linewidth = 0.5)

<matplotlib.axes._subplots.AxesSubplot at 0x1a57d88f108>

We can make sure of this by looking at the Coefficient of Correlation¶

Correlation coefficient between Price and city-mpg: -69 %
Correlation coefficient between Price and highway-mpg: -70 %

Conclusion¶

Positively correlated variables with Price: wheel-base, lenght, width, curb-weight, engine-size, bore, horesepower
Negatively correlated variables with Price: city-mpg, highway-mpg
These variables should be kept for a better model, and the other variables should be ignored as they are not correlated with Price
The visualization gives us a scatter plot between scatter and city-mpg of price.
Positively correlated variables with Price: scatter and curb-weight

Create a bar plot showing the correlation of the numeric features to the price column.¶

plt.figure(figsize=(20,12))
df.corr()['price'].sort_values().drop('price').plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x1a57db27b08>

plt.figure(figsize=(18,7))
subgrade_order = sorted(df['engine-size'].unique())
sns.countplot(x='engine-size',data=df,order = subgrade_order,palette='coolwarm' )

<matplotlib.axes._subplots.AxesSubplot at 0x1a57d8b1c08>

plt.figure(figsize=(18,7))
subgrade_order = sorted(df['symboling'].unique())
sns.countplot(x='symboling',data=df,order = subgrade_order,palette='coolwarm' )

<matplotlib.axes._subplots.AxesSubplot at 0x1a57db836c8>

TASK: Create a countplot per symboling. Set the hue to the num-of-doors.¶

plt.figure(figsize=(20,12))
sns.countplot(x='symboling',data=df,hue='num-of-doors')

<matplotlib.axes._subplots.AxesSubplot at 0x1a57d9c1248>

Et ce que nous voulions faire, c'est essentiellement faire ce tracé qui montre quelles caractéristiques numériques ont la plus forte corrélation avec l'étiquette réelle.

TASK: Create a boxplot showing the relationship between the symboling and the city-mpg.¶

plt.figure(figsize=(20,10))
ax = sns.boxplot(x="symboling", y="city-mpg", data=df)
ax = sns.swarmplot(x="symboling", y="city-mpg", data=df, color=".25")

symboling 0 and 1 are the most favored.
Cars with symboling -1 and -2 are the most expensive, which is logical because it means that the car is more secure.

Checking the scatterplots and histograms with density estimates and align the marginal Axes tightly with the joint Axes:¶

sns.jointplot(x='symboling',y='city-mpg',data=df, kind="kde",space=0, color="g", size=10)

<seaborn.axisgrid.JointGrid at 0x1a50632fe48>

Calculate the summary statistics for the symboling, grouped by the city-mpg.¶

df.groupby('symboling')['city-mpg'].describe()

Create a jointplot showing the kde distributions of price vs. symboling.¶

sns.jointplot(x='symboling',y='price',data=df,color='red',kind='kde',size=10);

Calculate the summary statistics for the symboling, grouped by the city-mpg.¶

df.groupby('symboling')['price'].describe()

Kernel Density Estimation plot (KDE)¶

plt.figure(figsize=(20,10))
df['price'].plot.kde()

<matplotlib.axes._subplots.AxesSubplot at 0x1a506acf408>

Create a boxplot showing the relationship between the body-style and the price.¶

plt.figure(figsize=(15,8))
sns.boxplot(x='body-style',y='price',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x1a506b34508>

Calculate the summary statistics for the symboling, grouped by the city-mpg.¶

df.groupby('body-style')['price'].describe()

Create a boxplot showing the relationship between the symboling and the price.¶

plt.figure(figsize=(15, 10))
sns.boxplot(x='symboling',y='price',data=df,palette='winter')

<matplotlib.axes._subplots.AxesSubplot at 0x1a507500808>

symboling 0 and 1 are the most favored.
Cars with symboling -1 and -2 are the most expensive, which is logical because it means that the car is more secure.

Price VS symboling¶

plt.figure(figsize=(30, 15))
plt.subplot(2,3,1)
sns.boxplot(x=df.symboling, y=df.price)


plt.subplot(2,3,2)
plt.title('Symboling Hist')
order = df['symboling'].value_counts(ascending=False).index
sns.countplot(x='symboling', data=df, order=order)

<matplotlib.axes._subplots.AxesSubplot at 0x1a507353588>

It seems that symboling 0 and 1 are the most favored.

Cars with symboling -1 and -2 are the most expensive, which is logical because it means that the car is more secure.

Note: Existence of Outliers for several values.

Price Vs Gas¶

plt.figure(figsize=(12,7))
sns.boxplot(x='fuel-type',y='price',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x1a507266b88>

We can clearly see that cars that run on gas are preferable to cars that run on diesel.

Calculate the summary statistics for the fuel-type, grouped by the price.¶

df.groupby('fuel-type')['price'].describe()

Price Vs Make¶

plt.figure(figsize=(20,10))
sns.boxplot(
    data=df,
    x='make',
    y='price',
    color='blue')

<matplotlib.axes._subplots.AxesSubplot at 0x1a507472848>

plt.figure(figsize=(20,10)) 
plt.bar(df['make'], df['price'], color="blue")

<BarContainer object of 159 artists>

Conclusion¶

Mercedez-Benz ,Jaguar produces expensive cars more than 30000
most of the cars comapany produces car in range below 25000
Hardtop model are expensive in prices followed by convertible and sedan body style
Turbo models have higher prices than for the standard model
Convertible has only standard edition with expensive cars
hatchback and sedan turbo models are available below 20000
rwd wheel drive vehicle have expensive prices

Using Cufflinks and iplot()¶

scatter
bar
box
spread
ratio
heatmap
surface
histogram
bubble

from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

import cufflinks as cf

import chart_studio.plotly as py

# For Notebooks
init_notebook_mode(connected=True)

# For offline use
cf.go_offline()

df.iplot()

Scatterplot¶

Price VS engine-size¶

df.iplot(kind='scatter',x='engine-size',y='price',mode='markers',size=15)

sns.lmplot(x = "engine-size", y = "price", data = df.reset_index(), size=8),

(<seaborn.axisgrid.FacetGrid at 0x1a50625f108>,)

As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.

We can examine the correlation between 'engine-size' and 'price' and see it's approximately 0.87

Price VS highway-mpg¶

df.iplot(kind='scatter',x='highway-mpg',y='price',mode='markers',size=15)

sns.lmplot(x = "highway-mpg", y = "price", data = df.reset_index(), size=8),

(<seaborn.axisgrid.FacetGrid at 0x1a508416988>,)

As the highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship between these two variables. Highway mpg could potentially be a predictor of price.

We can examine the correlation between 'highway-mpg' and 'price' and see it's approximately -0.704

df.iplot(kind='scatter',x='peak-rpm',y='price',mode='markers',size=15)

sns.lmplot(x = "peak-rpm", y = "price", data = df.reset_index(), size=8),

(<seaborn.axisgrid.FacetGrid at 0x1a506770308>,)

We can notice that Peak rpm and price have a Weak Linear Relationship.

Peak rpm does not seems like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore it's it is not a reliable variable.

We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately -0.101616

Bar Plots¶

df.iplot(kind='bar',x='fuel-type',y='price')

df.count().iplot(kind='bar')

df.iplot(kind='bar',x='make',y='price')

Checking Countplot for make¶

plt.figure(figsize=(20,10))
sns.countplot(x='make',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x1a5086163c8>

plt.figure(figsize=(20,10))
df['make'].value_counts().head(30).plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x1a5085f31c8>

df['make'].value_counts().head(2)

toyota    31
nissan    18
Name: make, dtype: int64

df['make'].value_counts().tail(2)

jaguar     1
porsche    1
Name: make, dtype: int64

Examine this plot, we can easily see that several car makers have the same number of models, and that the difference in number of models can be as small as 1. We can notice that Toyota and Nissan are the most frequent car and jaguar and porsche the least frequent.

df.make.value_counts()

toyota           31
nissan           18
honda            13
subaru           12
volvo            11
mazda            11
mitsubishi       10
dodge             8
volkswagen        8
peugot            7
plymouth          6
saab              6
mercedes-benz     5
audi              4
bmw               4
chevrolet         3
jaguar            1
porsche           1
Name: make, dtype: int64

Making a pie chart of the number of autos by make¶

plt.figure(figsize=(20,10))
df.make.value_counts().plot.pie()

<matplotlib.axes._subplots.AxesSubplot at 0x1a5085cf388>

Let's plot the total sales¶

df.iplot(x='make', y='price');

Looking at this plot, we can notice that the most expensive cars are German, namely; Mercedes-Benz, BMW and Porsche.

df.iplot()

df[['symboling','engine-size']].iplot(kind='spread')

df[['price', 'engine-size']].iplot(kind='spread')

df[['horsepower', 'curb-weight']].iplot(kind='spread')

Overlapping the Histogram of all the columns¶

df.iplot(kind='hist')

df['price'].iplot(kind='hist',bins=25)

df['stroke'].iplot(kind='hist',bins=25)

Bubbleplot¶

df.iplot(kind='bubble',x='bore',y='price',size='peak-rpm')

df.iplot(kind='bubble',x='bore',y='price',size='wheel-base')

sns.set_style('whitegrid')
sns.lmplot('peak-rpm','city-mpg',data=df, hue='compression-ratio',
           palette='coolwarm',size=6,aspect=1,fit_reg=False)

<seaborn.axisgrid.FacetGrid at 0x1a508cb8108>

sns.set_style('darkgrid')
g = sns.FacetGrid(df,hue="engine-type",palette='coolwarm',size=8,aspect=2)
g = g.map(plt.hist,'wheel-base',bins=20,alpha=0.7)

sns.set_style('darkgrid')
g = sns.FacetGrid(df,hue="engine-type",palette='coolwarm',size=8,aspect=2)
g = g.map(plt.hist,'horsepower',bins=20,alpha=0.7)

Training a Linear Regression Model¶

df.columns

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

X = df[['symboling','wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg']]
y = df['price'] # What I'm trying to predict in this case is the price column.

We want to split our data into a training set for the model in a testing set in order to test the model once it's been trained.

Train Test Split¶

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101) 

# This is essentially is using tuple and packing to grab our training set and we're testing set 
# and we're going to have our X train and our y train and then we can have our X test and our Y test, 
# Then in the train test split function we pass on our X data our y data and we could optionally 
#also passe our test size and test size.

Creating and Training the Model¶

cols = ['price']
for column in cols:
    df.replace("?", np.nan, inplace = True)
    df.dropna(axis = 0, inplace = True)

from sklearn.linear_model import LinearRegression

lm = LinearRegression()

lm.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Model Evaluation¶

Let's evaluate the model by checking out it's coefficients and how we can interpret them.

# print the intercept
print(lm.intercept_)

-52772.48931877864

The next thing we can check out are the coefficients and the coefficients are going to relate to each feature in our dataset. We can grab coefficients by saying lm coef underscore and this will return the coefficients for each.

lm.coef_

array([-3.32058973e+01,  3.40840401e+02, -8.95766661e+01,  6.79651263e+02,
       -6.27057695e+01,  4.61242820e+00, -5.58779007e+00, -2.84120924e+03,
       -1.03895535e+03,  2.01640908e+02,  6.77560434e+01, -4.29839934e-01,
        2.96276613e+01, -9.79600116e+00])

So each of these coefficients relates to the columns in x or x train.

X_train.columns

Index(['symboling', 'wheel-base', 'length', 'width', 'height', 'curb-weight',
       'engine-size', 'bore', 'stroke', 'compression-ratio', 'horsepower',
       'peak-rpm', 'city-mpg', 'highway-mpg'],
      dtype='object')

coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df

Interpreting the coefficients:¶

Holding all other features fixed, a 1 unit decrease in symboling is associated with an decrease of \$-33.21 .
Holding all other features fixed, a 1 unit increase in wheel-base is associated with an increase of \$340.84 .
Holding all other features fixed, a 1 unit increase in length is associated with an decrease of \$-89.58 .
Holding all other features fixed, a 1 unit increase in width is associated with an increase of \$679.65 .
Holding all other features fixed, a 1 unit increase in height is associated with an decrease of \$-62.71 .
Holding all other features fixed, a 1 unit decrease in curb-weight is associated with an increase of \$4.61 .
Holding all other features fixed, a 1 unit increase in engine-size is associated with an decrease of \$-5.59 .
Holding all other features fixed, a 1 unit increase in bore is associated with an decrease of \$-2841.21 .
Holding all other features fixed, a 1 unit increase in stroke is associated with an decrease of \$-1038.955350 .
Holding all other features fixed, a 1 unit decrease in compression-ratio is associated with an increase of \$201.64 .
Holding all other features fixed, a 1 unit increase in horsepower is associated with an increase of \$67.76 .
Holding all other features fixed, a 1 unit increase in peak-rpm is associated with an decrease of \$-0.43 .
Holding all other features fixed, a 1 unit increase in city-mpg is associated with an increase of \$29.63 .
Holding all other features fixed, a 1 unit increase in highway-mpg is associated with an decrease of \$-9.80 .

Predictions from our Model¶

Let's grab predictions off our test set and see how well it did!

predictions = lm.predict(X_test)

predictions

array([11289.5016669 ,  9861.35484015,  8856.13828605,  8663.46262643,
        6101.68782648,  9118.93551026, 19333.70146888,  8750.97414362,
        6791.40611257, 16736.96704291,  6757.35530606, 11828.94826508,
        9792.16841709,  5521.27484195, 10916.41148947,  6599.76052048,
       19587.38502008,  5613.52340602, 10565.52931116, 13973.7067376 ,
       26297.19282931,  8850.31096898,  9792.16841709,  6934.71496975,
       19452.36678081, 10572.78253994, 12289.44486552,  9674.24765482,
       20712.2181174 , 17841.69554727, 10528.46740639, 15710.22394552,
        6234.87081128, 21702.05442994,  9208.20611952,  6838.23319135,
       13251.65187881,  6290.50084828,  7783.67561591, 11768.96832226,
       12034.38363682, 14884.81758687,  6689.93269209, 11949.07890003,
        7397.36700047, 12119.53124191, 14563.2340619 , 14279.44020234,
        5924.8057372 ,  8363.29402411,  5541.723773  ,  6920.2264159 ,
        7498.46705992,  6774.83260534,  5507.68838074, 19122.53410015,
        5369.06471123,  7427.46124801,  8850.31096898, 11050.82672146,
       16771.8442005 ,  9950.9592739 ,  6108.8677617 , 11309.11019571])

These are the predicted prices of the Auto.¶

y_test

168     9639.0
64     11245.0
85      6989.0
142     7775.0
50      5195.0
        ...   
87      9279.0
107    11900.0
147    10198.0
51      6095.0
145    11259.0
Name: price, Length: 64, dtype: float64

However, since we did the train test split, we know that Y test contains the correct Cars of the Car, and we want to know how far off are the predictions from the tests prices the actual prices. There is one quick way we can visually analyze this which is just by doing a scatterplot.

plt.figure(figsize=(15,7))
plt.scatter(y_test,predictions)

<matplotlib.collections.PathCollection at 0x1a50b2c3608>

That means that a perfect straight line would be perfectly correct predictions.

Residual Histogram¶

Let's go ahead and actually create a histogram of the distribution of our residuals. The residuals are the difference between the actual values y_test and the predicted values.

plt.figure(figsize=(15,7))
sns.distplot((y_test-predictions),bins=50);

This is a histogram of the residuals.
Notice here that our residuals looked to be normally distributed.
If you have normally distributed residuals, it means the model was a correct choice for the data.

Regression Evaluation Metrics¶

Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

Mean Squared Error (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Comparing these metrics:

MAE is the easiest to understand, because it's the average error.
MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are loss functions, because we want to minimize them.

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

MAE: 2029.6889486260104
MSE: 8908596.027073476
RMSE: 2984.727127741073

metrics.explained_variance_score(y_test,predictions)

0.7561827006534351

plt.figure(figsize=(15,7))
sns.heatmap(df.corr())

<matplotlib.axes._subplots.AxesSubplot at 0x1a50b62e4c8>

	symboling	wheel-base	length	width	height	curb-weight	engine-size	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
symboling	1.000000	-0.520591	-0.336257	-0.219186	-0.475185	-0.251880	-0.109453	-0.256469	-0.021285	-0.138316	-0.003949	0.199106	0.089550	0.149830	-0.162794
wheel-base	-0.520591	1.000000	0.871534	0.814991	0.555767	0.810181	0.649206	0.578159	0.167449	0.291431	0.516948	-0.289234	-0.580657	-0.611750	0.734419
length	-0.336257	0.871534	1.000000	0.838338	0.499251	0.871291	0.725953	0.646318	0.121073	0.184814	0.672063	-0.234074	-0.724544	-0.724599	0.760952
width	-0.219186	0.814991	0.838338	1.000000	0.292706	0.870595	0.779253	0.572554	0.196619	0.258752	0.681872	-0.232216	-0.666684	-0.693339	0.843371
height	-0.475185	0.555767	0.499251	0.292706	1.000000	0.367052	0.111083	0.254836	-0.091313	0.233308	0.034317	-0.245864	-0.199737	-0.226136	0.244836
curb-weight	-0.251880	0.810181	0.871291	0.870595	0.367052	1.000000	0.888626	0.645792	0.173844	0.224724	0.790095	-0.259988	-0.762155	-0.789338	0.893639
engine-size	-0.109453	0.649206	0.725953	0.779253	0.111083	0.888626	1.000000	0.595737	0.299683	0.141097	0.812073	-0.284686	-0.699139	-0.714095	0.841496
bore	-0.256469	0.578159	0.646318	0.572554	0.254836	0.645792	0.595737	1.000000	-0.102581	0.015119	0.560239	-0.312269	-0.590440	-0.590850	0.533890
stroke	-0.021285	0.167449	0.121073	0.196619	-0.091313	0.173844	0.299683	-0.102581	1.000000	0.243587	0.148804	-0.011312	-0.020055	-0.012934	0.160664
compression-ratio	-0.138316	0.291431	0.184814	0.258752	0.233308	0.224724	0.141097	0.015119	0.243587	1.000000	-0.162305	-0.416769	0.278332	0.221483	0.209361
horsepower	-0.003949	0.516948	0.672063	0.681872	0.034317	0.790095	0.812073	0.560239	0.148804	-0.162305	1.000000	0.074057	-0.837214	-0.827941	0.759874
peak-rpm	0.199106	-0.289234	-0.234074	-0.232216	-0.245864	-0.259988	-0.284686	-0.312269	-0.011312	-0.416769	0.074057	1.000000	-0.052929	-0.032777	-0.171916
city-mpg	0.089550	-0.580657	-0.724544	-0.666684	-0.199737	-0.762155	-0.699139	-0.590440	-0.020055	0.278332	-0.837214	-0.052929	1.000000	0.971999	-0.692273
highway-mpg	0.149830	-0.611750	-0.724599	-0.693339	-0.226136	-0.789338	-0.714095	-0.590850	-0.012934	0.221483	-0.827941	-0.032777	0.971999	1.000000	-0.720090
price	-0.162794	0.734419	0.760952	0.843371	0.244836	0.893639	0.841496	0.533890	0.160664	0.209361	0.759874	-0.171916	-0.692273	-0.720090	1.000000

	count	mean	std	min	25%	50%	75%	max
symboling
-2	3.0	21.333333	3.785939	17.0	20.00	23.0	23.50	24.0
-1	20.0	23.300000	3.614299	17.0	21.50	23.0	26.25	30.0
0	48.0	25.812500	5.365993	15.0	22.75	27.0	28.00	38.0
1	46.0	29.521739	5.154239	17.0	26.00	31.0	31.00	45.0
2	29.0	28.551724	7.543144	18.0	24.00	26.0	31.00	49.0
3	13.0	20.153846	2.609155	16.0	19.00	19.0	21.00	25.0

	count	mean	std	min	25%	50%	75%	max
symboling
-2	3.0	15781.666667	2745.652624	12940.0	14462.50	15985.0	17202.50	18420.0
-1	20.0	16567.050000	6987.757817	8921.0	10520.50	16102.5	21731.25	31600.0
0	48.0	11867.229167	5562.315618	6575.0	7897.25	9754.5	13724.00	32250.0
1	46.0	8071.391304	3494.753251	5195.0	6404.75	7214.0	8043.25	23875.0
2	29.0	9907.034483	3962.416556	5118.0	7053.00	8449.0	11549.00	18620.0
3	13.0	16382.307692	6915.332836	8499.0	11850.00	15998.0	18150.00	35056.0

	count	mean	std	min	25%	50%	75%	max
body-style
convertible	2.0	26362.500000	12294.465604	17669.0	22015.75	26362.5	30709.25	35056.0
hardtop	5.0	13142.400000	8485.769134	8249.0	8449.00	9639.0	11199.00	28176.0
hatchback	56.0	9220.160714	4102.849771	5118.0	6367.25	7847.0	10140.50	22018.0
sedan	79.0	12558.620253	5942.114349	5499.0	7836.50	10245.0	16737.50	32250.0
wagon	17.0	11351.411765	5617.408322	6918.0	7898.00	8921.0	13415.00	28248.0

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.4	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.4	8.0	115	5500	18	22	17450
6	1	158	audi	gas	std	four	sedan	fwd	front	105.8	...	136	mpfi	3.19	3.4	8.5	110	5500	19	25	17710
8	1	158	audi	gas	turbo	four	sedan	fwd	front	105.8	...	131	mpfi	3.13	3.4	8.3	140	5500	17	20	23875
10	2	192	bmw	gas	std	two	sedan	rwd	front	101.2	...	108	mpfi	3.5	2.8	8.8	101	5800	23	29	16430
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
200	-1	95	volvo	gas	std	four	sedan	rwd	front	109.1	...	141	mpfi	3.78	3.15	9.5	114	5400	23	28	16845
201	-1	95	volvo	gas	turbo	four	sedan	rwd	front	109.1	...	141	mpfi	3.78	3.15	8.7	160	5300	19	25	19045
202	-1	95	volvo	gas	std	four	sedan	rwd	front	109.1	...	173	mpfi	3.58	2.87	8.8	134	5500	18	23	21485
203	-1	95	volvo	diesel	turbo	four	sedan	rwd	front	109.1	...	145	idi	3.01	3.4	23.0	106	4800	26	27	22470
204	-1	95	volvo	gas	turbo	four	sedan	rwd	front	109.1	...	141	mpfi	3.78	3.15	9.5	114	5400	19	25	22625

	symboling	wheel-base	length	width	height	curb-weight	engine-size	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
count	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000
mean	0.735849	98.264151	172.413836	65.607547	53.899371	2461.138365	119.226415	3.300126	3.236352	10.161132	95.836478	5113.836478	26.522013	32.081761	11445.729560
std	1.193086	5.167416	11.523177	1.947883	2.268761	481.941321	30.460791	0.267336	0.294888	3.889475	30.718583	465.754864	6.097142	6.459189	5877.856195
min	-2.000000	86.600000	141.100000	60.300000	49.400000	1488.000000	61.000000	2.540000	2.070000	7.000000	48.000000	4150.000000	15.000000	18.000000	5118.000000
25%	0.000000	94.500000	165.650000	64.000000	52.250000	2065.500000	97.000000	3.050000	3.105000	8.700000	69.000000	4800.000000	23.000000	28.000000	7372.000000
50%	1.000000	96.900000	172.400000	65.400000	54.100000	2340.000000	110.000000	3.270000	3.270000	9.000000	88.000000	5200.000000	26.000000	32.000000	9233.000000
75%	2.000000	100.800000	177.800000	66.500000	55.500000	2809.500000	135.000000	3.560000	3.410000	9.400000	114.000000	5500.000000	31.000000	37.000000	14719.500000
max	3.000000	115.600000	202.600000	71.700000	59.800000	4066.000000	258.000000	3.940000	4.170000	23.000000	200.000000	6600.000000	49.000000	54.000000	35056.000000

	count	mean	std	min	25%	50%	75%	max
fuel-type
diesel	15.0	16189.600000	8868.513282	7099.0	7946.5	13200.0	24011.0	31600.0
gas	144.0	10951.576389	5278.891595	5118.0	7295.0	8948.5	13499.0	35056.0

	Coefficient
symboling	-33.205897
wheel-base	340.840401
length	-89.576666
width	679.651263
height	-62.705770
curb-weight	4.612428
engine-size	-5.587790
bore	-2841.209241
stroke	-1038.955350
compression-ratio	201.640908
horsepower	67.756043
peak-rpm	-0.429840
city-mpg	29.627661
highway-mpg	-9.796001