Data Visualization Project 5 Udacity

by Ekofiongo Eale

from Germany

Problem Description

A Chinese automobile company Teclov_chinese aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market.

Resources

In this project, I will work with two poweful Python packages, Pandas and Seaborn. Both packages have extensive online documentation. There is an extensive tutorial on Visualization with Pandas. The Seaborn tutorial contains many examples of data Visualization. The matplotlib web site has addition resources for learning plotting with Python tools.

In this project, I will you use two or three powerful Python packages. Pandas, Matplotlib and Seaborn

In [38]:
import pandas as pd
import numpy as np
import seaborn as sns
import seaborn as sb
%matplotlib inline
import matplotlib.pyplot as plt
In [39]:
plt.style.use('ggplot')
In [40]:
import warnings
warnings.filterwarnings('ignore')
In [41]:
df = pd.read_csv('automobile-price-data.csv')
In [42]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling            205 non-null int64
normalized-losses    205 non-null object
make                 205 non-null object
fuel-type            205 non-null object
aspiration           205 non-null object
num-of-doors         205 non-null object
body-style           205 non-null object
drive-wheels         205 non-null object
engine-location      205 non-null object
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb-weight          205 non-null int64
engine-type          205 non-null object
num-of-cylinders     205 non-null object
engine-size          205 non-null int64
fuel-system          205 non-null object
bore                 205 non-null object
stroke               205 non-null object
compression-ratio    205 non-null float64
horsepower           205 non-null object
peak-rpm             205 non-null object
city-mpg             205 non-null int64
highway-mpg          205 non-null int64
price                205 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB

Some columns are of object data type, and we will have to convert them into float or integer numerical variables, in order to facilitate the exploration of our DataSet.

To allow me to better explore my DataSet, I would like to replace the question mark? sign with np.nan.

In [43]:
cols = ['symboling','wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg','highway-mpg']
for column in cols:
    df.replace("?", np.nan, inplace = True)
    df.dropna(axis = 0, inplace = True) 

In order to be able to treat our, we would like to replace Nan by the number 0

In [44]:
df.replace(np.nan,0)
Out[44]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.4 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.4 8.0 115 5500 18 22 17450
6 1 158 audi gas std four sedan fwd front 105.8 ... 136 mpfi 3.19 3.4 8.5 110 5500 19 25 17710
8 1 158 audi gas turbo four sedan fwd front 105.8 ... 131 mpfi 3.13 3.4 8.3 140 5500 17 20 23875
10 2 192 bmw gas std two sedan rwd front 101.2 ... 108 mpfi 3.5 2.8 8.8 101 5800 23 29 16430
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
200 -1 95 volvo gas std four sedan rwd front 109.1 ... 141 mpfi 3.78 3.15 9.5 114 5400 23 28 16845
201 -1 95 volvo gas turbo four sedan rwd front 109.1 ... 141 mpfi 3.78 3.15 8.7 160 5300 19 25 19045
202 -1 95 volvo gas std four sedan rwd front 109.1 ... 173 mpfi 3.58 2.87 8.8 134 5500 18 23 21485
203 -1 95 volvo diesel turbo four sedan rwd front 109.1 ... 145 idi 3.01 3.4 23.0 106 4800 26 27 22470
204 -1 95 volvo gas turbo four sedan rwd front 109.1 ... 141 mpfi 3.78 3.15 9.5 114 5400 19 25 22625

159 rows × 26 columns

Converting data types unto float

In [45]:
df["price"] = df["price"].astype(str).astype(float)
In [46]:
df["bore"] = df["bore"].astype(str).astype(float)
df["stroke"] = df["stroke"].astype(str).astype(float)
df["compression-ratio"] = df["compression-ratio"].astype(str).astype(float)
df["horsepower"] = df["horsepower"].astype(str).astype(float)   
df["peak-rpm"] = df["peak-rpm"].astype(str).astype(float)

Columns after the conversion

In [47]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 159 entries, 3 to 204
Data columns (total 26 columns):
symboling            159 non-null int64
normalized-losses    159 non-null object
make                 159 non-null object
fuel-type            159 non-null object
aspiration           159 non-null object
num-of-doors         159 non-null object
body-style           159 non-null object
drive-wheels         159 non-null object
engine-location      159 non-null object
wheel-base           159 non-null float64
length               159 non-null float64
width                159 non-null float64
height               159 non-null float64
curb-weight          159 non-null int64
engine-type          159 non-null object
num-of-cylinders     159 non-null object
engine-size          159 non-null int64
fuel-system          159 non-null object
bore                 159 non-null float64
stroke               159 non-null float64
compression-ratio    159 non-null float64
horsepower           159 non-null float64
peak-rpm             159 non-null float64
city-mpg             159 non-null int64
highway-mpg          159 non-null int64
price                159 non-null float64
dtypes: float64(10), int64(5), object(11)
memory usage: 33.5+ KB

Checking the descriptive statistics for numeric variables

In [48]:
print(df.describe())
        symboling  wheel-base      length       width      height  \
count  159.000000  159.000000  159.000000  159.000000  159.000000   
mean     0.735849   98.264151  172.413836   65.607547   53.899371   
std      1.193086    5.167416   11.523177    1.947883    2.268761   
min     -2.000000   86.600000  141.100000   60.300000   49.400000   
25%      0.000000   94.500000  165.650000   64.000000   52.250000   
50%      1.000000   96.900000  172.400000   65.400000   54.100000   
75%      2.000000  100.800000  177.800000   66.500000   55.500000   
max      3.000000  115.600000  202.600000   71.700000   59.800000   

       curb-weight  engine-size        bore      stroke  compression-ratio  \
count   159.000000   159.000000  159.000000  159.000000         159.000000   
mean   2461.138365   119.226415    3.300126    3.236352          10.161132   
std     481.941321    30.460791    0.267336    0.294888           3.889475   
min    1488.000000    61.000000    2.540000    2.070000           7.000000   
25%    2065.500000    97.000000    3.050000    3.105000           8.700000   
50%    2340.000000   110.000000    3.270000    3.270000           9.000000   
75%    2809.500000   135.000000    3.560000    3.410000           9.400000   
max    4066.000000   258.000000    3.940000    4.170000          23.000000   

       horsepower     peak-rpm    city-mpg  highway-mpg         price  
count  159.000000   159.000000  159.000000   159.000000    159.000000  
mean    95.836478  5113.836478   26.522013    32.081761  11445.729560  
std     30.718583   465.754864    6.097142     6.459189   5877.856195  
min     48.000000  4150.000000   15.000000    18.000000   5118.000000  
25%     69.000000  4800.000000   23.000000    28.000000   7372.000000  
50%     88.000000  5200.000000   26.000000    32.000000   9233.000000  
75%    114.000000  5500.000000   31.000000    37.000000  14719.500000  
max    200.000000  6600.000000   49.000000    54.000000  35056.000000  
In [49]:
df.describe()
Out[49]:
symboling wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
count 159.000000 159.000000 159.000000 159.000000 159.000000 159.000000 159.000000 159.000000 159.000000 159.000000 159.000000 159.000000 159.000000 159.000000 159.000000
mean 0.735849 98.264151 172.413836 65.607547 53.899371 2461.138365 119.226415 3.300126 3.236352 10.161132 95.836478 5113.836478 26.522013 32.081761 11445.729560
std 1.193086 5.167416 11.523177 1.947883 2.268761 481.941321 30.460791 0.267336 0.294888 3.889475 30.718583 465.754864 6.097142 6.459189 5877.856195
min -2.000000 86.600000 141.100000 60.300000 49.400000 1488.000000 61.000000 2.540000 2.070000 7.000000 48.000000 4150.000000 15.000000 18.000000 5118.000000
25% 0.000000 94.500000 165.650000 64.000000 52.250000 2065.500000 97.000000 3.050000 3.105000 8.700000 69.000000 4800.000000 23.000000 28.000000 7372.000000
50% 1.000000 96.900000 172.400000 65.400000 54.100000 2340.000000 110.000000 3.270000 3.270000 9.000000 88.000000 5200.000000 26.000000 32.000000 9233.000000
75% 2.000000 100.800000 177.800000 66.500000 55.500000 2809.500000 135.000000 3.560000 3.410000 9.400000 114.000000 5500.000000 31.000000 37.000000 14719.500000
max 3.000000 115.600000 202.600000 71.700000 59.800000 4066.000000 258.000000 3.940000 4.170000 23.000000 200.000000 6600.000000 49.000000 54.000000 35056.000000

We can see there's 195 cars that where we have a length value, the mean length is 174, standard deviation isn't actually that much considering, and the range is actually from minimum of 141 and maximum of 208. If we scroll over the price, we'll see the mean price is 13248 in this old dataset. We've got the standard deviation is quite wide actually 8000, so we've got a big range of prices go from 5000 dollars to 45000 dollars and the median is 10000 whereas the mean is 13000. In terms of exploring this data, we know that the price is highly skewed.

We can notice that the highest price was 45400.0 US Dol, 41315.0 US Dol, 40960.0 US Dol, 37028.0 US Dol and so on and the lowest price 5118.0 US Dol

Checking the shape of our DataSet

In [50]:
df.shape
print("The Shape of our DataSet is:", df.shape)
The Shape of our DataSet is: (159, 26)

Finding all the columns

In [51]:
df.columns
print("Finding all the columns:", df.columns)
Finding all the columns: Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

The structure of our dataset?

In [52]:
print("The data table has 26 columns")
The data table has 26 columns

What are the main features of interest in your dataset?

In [53]:
print(" Which variables are significant in predicting the price of a car")
 Which variables are significant in predicting the price of a car

Business Goal

The goal is to find a model by adapting the price of the cars, with the independent variables available. This approach is that the management organ must be able to understand how prices vary exactly with the independent variables. This will allow to model the design, the strategy to market cars, in order to reach certain price levels. This model will be an important asset for the management organ to be able to understand the price dynamics of a new market.

Exploratory of Data

Dependent variable: Price

Checking the Car price Distribution

In [54]:
plt.figure(figsize=(15, 8))
sns.distplot(df['price'].dropna(),kde=False,color='darkred',bins=30);

According to the distribution, the price field has an average around 13207.129353 and a median around 10295.000000 with the most expensive car values at 45400.000000 and the cheapest cars at 5118.000000, and Standard Deviation 8056.330093.

In [55]:
plt.figure(figsize=(15,8))
sns.distplot(df['price'])
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57d975e48>
In [56]:
print(df.price.describe())
count      159.000000
mean     11445.729560
std       5877.856195
min       5118.000000
25%       7372.000000
50%       9233.000000
75%      14719.500000
max      35056.000000
Name: price, dtype: float64

Checking Distribution Plot for DataSet (displot)

In [58]:
plt.figure(figsize=(15,8))
for col in df.select_dtypes('float'):
    plt.figure()
    sns.distplot(df[col])
<Figure size 1080x576 with 0 Axes>

Calculate the correlation between variables of type "int64" or "float64" using the method "corr":

TASK: Let's explore correlation between the continuous feature variables. Calculate the correlation between all continuous numeric variables using .corr() method.

In [59]:
# Code
df.corr()
Out[59]:
symboling wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
symboling 1.000000 -0.520591 -0.336257 -0.219186 -0.475185 -0.251880 -0.109453 -0.256469 -0.021285 -0.138316 -0.003949 0.199106 0.089550 0.149830 -0.162794
wheel-base -0.520591 1.000000 0.871534 0.814991 0.555767 0.810181 0.649206 0.578159 0.167449 0.291431 0.516948 -0.289234 -0.580657 -0.611750 0.734419
length -0.336257 0.871534 1.000000 0.838338 0.499251 0.871291 0.725953 0.646318 0.121073 0.184814 0.672063 -0.234074 -0.724544 -0.724599 0.760952
width -0.219186 0.814991 0.838338 1.000000 0.292706 0.870595 0.779253 0.572554 0.196619 0.258752 0.681872 -0.232216 -0.666684 -0.693339 0.843371
height -0.475185 0.555767 0.499251 0.292706 1.000000 0.367052 0.111083 0.254836 -0.091313 0.233308 0.034317 -0.245864 -0.199737 -0.226136 0.244836
curb-weight -0.251880 0.810181 0.871291 0.870595 0.367052 1.000000 0.888626 0.645792 0.173844 0.224724 0.790095 -0.259988 -0.762155 -0.789338 0.893639
engine-size -0.109453 0.649206 0.725953 0.779253 0.111083 0.888626 1.000000 0.595737 0.299683 0.141097 0.812073 -0.284686 -0.699139 -0.714095 0.841496
bore -0.256469 0.578159 0.646318 0.572554 0.254836 0.645792 0.595737 1.000000 -0.102581 0.015119 0.560239 -0.312269 -0.590440 -0.590850 0.533890
stroke -0.021285 0.167449 0.121073 0.196619 -0.091313 0.173844 0.299683 -0.102581 1.000000 0.243587 0.148804 -0.011312 -0.020055 -0.012934 0.160664
compression-ratio -0.138316 0.291431 0.184814 0.258752 0.233308 0.224724 0.141097 0.015119 0.243587 1.000000 -0.162305 -0.416769 0.278332 0.221483 0.209361
horsepower -0.003949 0.516948 0.672063 0.681872 0.034317 0.790095 0.812073 0.560239 0.148804 -0.162305 1.000000 0.074057 -0.837214 -0.827941 0.759874
peak-rpm 0.199106 -0.289234 -0.234074 -0.232216 -0.245864 -0.259988 -0.284686 -0.312269 -0.011312 -0.416769 0.074057 1.000000 -0.052929 -0.032777 -0.171916
city-mpg 0.089550 -0.580657 -0.724544 -0.666684 -0.199737 -0.762155 -0.699139 -0.590440 -0.020055 0.278332 -0.837214 -0.052929 1.000000 0.971999 -0.692273
highway-mpg 0.149830 -0.611750 -0.724599 -0.693339 -0.226136 -0.789338 -0.714095 -0.590850 -0.012934 0.221483 -0.827941 -0.032777 0.971999 1.000000 -0.720090
price -0.162794 0.734419 0.760952 0.843371 0.244836 0.893639 0.841496 0.533890 0.160664 0.209361 0.759874 -0.171916 -0.692273 -0.720090 1.000000

Checking the correlated independent variables above and Price

In [60]:
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(),annot=True,cmap='viridis');

Determine the independent variable

Price VS bore-curb-weight-engine-size

In [61]:
df1 = df[['bore','curb-weight','engine-size', 'price']]
In [62]:
df.plot.scatter(x='bore',y='price',c='red',s=100,figsize=(12,3))
df.plot.scatter(x='curb-weight',y='price',c='orange',s=100,figsize=(12,3))
df.plot.scatter(x='engine-size',y='price',c='red',s=100,figsize=(12,3))
print("At first glance, the 3 variables are positively correlated but spread at higher values.")
At first glance, the 3 variables are positively correlated but spread at higher values.
In [63]:
plt.figure(figsize=(15,10))
sns.heatmap(df1.corr(), annot =True, linewidth = 0.5)
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57e2b8488>
  • We can make sure of this by looking at the Coefficient of Correlation

  • Coefficient of Correlation between Price and bore: 54 %

  • Correlation coefficient between Price and curb-weight: 83 %
  • Correlation coefficient between Price and engine-size: 87 %

Price VS car-length - car-width - car-height

In [64]:
df.plot.scatter(x='length',y='price',c='green',s=100,figsize=(12,3))
df.plot.scatter(x='width',y='price',c='red',s=100,figsize=(12,3))
df.plot.scatter(x='height',y='price',c='blue',s=100,figsize=(12,3))
Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57e4525c8>
In [65]:
df2 = df[['length','width','height', 'price']]
In [66]:
plt.figure(figsize=(15,10))
sns.heatmap(df2.corr(), annot =True, linewidth = 0.5)
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57e45a748>
  • Carlength and Carwidth are more correlated than carheight which is more spread out but positive.

  • We can make sure of this by looking at the Coefficient of Correlation

  • Correlation coefficient between Price and carlength: 69 %

  • Correlation coefficient between Price and carwidth: 75 %
  • Correlation coefficient between Price and carheight: 14 %

Price VS engine-size - horsepower - stroke

In [67]:
df.plot.scatter(x='wheel-base',y='price',c='green',s=100,figsize=(12,3))
df.plot.scatter(x='horsepower',y='price',c='black',s=100,figsize=(12,3))
df.plot.scatter(x='stroke',y='price',c='green',s=100,figsize=(12,3))
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57da41748>
  • Wheel-base and Horsepower are positively correlated, but Stroke is more spread out (may not be related).
In [68]:
df3 = df[['wheel-base','horsepower','stroke', 'price']]
In [69]:
plt.figure(figsize=(15,10))
sns.heatmap(df3.corr(), annot =True, linewidth = 0.5)
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57d3f07c8>

We can make sure of this by looking at the Coefficient of Correlation

  • Correlation coefficient between Price and wheel-base: 58 %
  • Correlation coefficient between Price and horsepower: 81 %
  • Correlation coefficient between Price and stroke: 8.2 %

Price VS compression-ratio - peak-rpm - symboling

In [70]:
df.plot.scatter(x='compression-ratio',y='price',c='blue',s=100,figsize=(12,3))
df.plot.scatter(x='peak-rpm',y='price',c='orange',s=100,figsize=(12,3))
df.plot.scatter(x='symboling',y='price',c='blue',s=100,figsize=(12,3))
Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57d969dc8>
  • compression-ratio, peak-rpm and symboling are not correlated.
In [71]:
df4 = df[['compression-ratio','peak-rpm','symboling', 'price']]
In [72]:
plt.figure(figsize=(15,10))
sns.heatmap(df4.corr(), annot =True, linewidth = 0.5)
Out[72]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57e3e1508>

We can make sure of this by looking at the Coefficient of Correlation

  • Correlation coefficient between Price and compression-ratio: 7.1 %
  • Correlation coefficient between Price and peak-rpm: -10 %
  • Correlation coefficient between Price and symboling: -8.2 %

Price VS city-mpg - highway-mpg

In [73]:
df.plot.scatter(x='city-mpg',y='price',c='black',s=100,figsize=(12,3))
df.plot.scatter(x='highway-mpg',y='price',c='orange',s=100,figsize=(12,3))
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57de4c808>
  • City-mpg & Highway-mpg are negatively correlated.
  • The more prices get lower, the higher the distances get, which means that the cheapest cars have better mileage than expensive cars.
In [74]:
df5 = df[['city-mpg','highway-mpg','price']]
In [75]:
plt.figure(figsize=(15,10))
sns.heatmap(df5.corr(), annot =True, linewidth = 0.5)
Out[75]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57d88f108>

We can make sure of this by looking at the Coefficient of Correlation

  • Correlation coefficient between Price and city-mpg: -69 %
  • Correlation coefficient between Price and highway-mpg: -70 %

Conclusion

  • Positively correlated variables with Price: wheel-base, lenght, width, curb-weight, engine-size, bore, horesepower
  • Negatively correlated variables with Price: city-mpg, highway-mpg
  • These variables should be kept for a better model, and the other variables should be ignored as they are not correlated with Price
  • The visualization gives us a scatter plot between scatter and city-mpg of price.
  • Positively correlated variables with Price: scatter and curb-weight

Create a bar plot showing the correlation of the numeric features to the price column.

In [76]:
plt.figure(figsize=(20,12))
df.corr()['price'].sort_values().drop('price').plot(kind='bar')
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57db27b08>
In [77]:
plt.figure(figsize=(18,7))
subgrade_order = sorted(df['engine-size'].unique())
sns.countplot(x='engine-size',data=df,order = subgrade_order,palette='coolwarm' )
Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57d8b1c08>
In [78]:
plt.figure(figsize=(18,7))
subgrade_order = sorted(df['symboling'].unique())
sns.countplot(x='symboling',data=df,order = subgrade_order,palette='coolwarm' )
Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57db836c8>

TASK: Create a countplot per symboling. Set the hue to the num-of-doors.

In [79]:
plt.figure(figsize=(20,12))
sns.countplot(x='symboling',data=df,hue='num-of-doors')
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57d9c1248>

Et ce que nous voulions faire, c'est essentiellement faire ce tracé qui montre quelles caractéristiques numériques ont la plus forte corrélation avec l'étiquette réelle.

TASK: Create a boxplot showing the relationship between the symboling and the city-mpg.

In [80]:
plt.figure(figsize=(20,10))
ax = sns.boxplot(x="symboling", y="city-mpg", data=df)
ax = sns.swarmplot(x="symboling", y="city-mpg", data=df, color=".25")
  • symboling 0 and 1 are the most favored.
  • Cars with symboling -1 and -2 are the most expensive, which is logical because it means that the car is more secure.

Checking the scatterplots and histograms with density estimates and align the marginal Axes tightly with the joint Axes:

In [127]:
sns.jointplot(x='symboling',y='city-mpg',data=df, kind="kde",space=0, color="g", size=10)
Out[127]:
<seaborn.axisgrid.JointGrid at 0x1a50632fe48>

Calculate the summary statistics for the symboling, grouped by the city-mpg.

In [128]:
df.groupby('symboling')['city-mpg'].describe()
Out[128]:
count mean std min 25% 50% 75% max
symboling
-2 3.0 21.333333 3.785939 17.0 20.00 23.0 23.50 24.0
-1 20.0 23.300000 3.614299 17.0 21.50 23.0 26.25 30.0
0 48.0 25.812500 5.365993 15.0 22.75 27.0 28.00 38.0
1 46.0 29.521739 5.154239 17.0 26.00 31.0 31.00 45.0
2 29.0 28.551724 7.543144 18.0 24.00 26.0 31.00 49.0
3 13.0 20.153846 2.609155 16.0 19.00 19.0 21.00 25.0

Create a jointplot showing the kde distributions of price vs. symboling.

In [129]:
sns.jointplot(x='symboling',y='price',data=df,color='red',kind='kde',size=10);

Calculate the summary statistics for the symboling, grouped by the city-mpg.

In [130]:
df.groupby('symboling')['price'].describe()
Out[130]:
count mean std min 25% 50% 75% max
symboling
-2 3.0 15781.666667 2745.652624 12940.0 14462.50 15985.0 17202.50 18420.0
-1 20.0 16567.050000 6987.757817 8921.0 10520.50 16102.5 21731.25 31600.0
0 48.0 11867.229167 5562.315618 6575.0 7897.25 9754.5 13724.00 32250.0
1 46.0 8071.391304 3494.753251 5195.0 6404.75 7214.0 8043.25 23875.0
2 29.0 9907.034483 3962.416556 5118.0 7053.00 8449.0 11549.00 18620.0
3 13.0 16382.307692 6915.332836 8499.0 11850.00 15998.0 18150.00 35056.0

Kernel Density Estimation plot (KDE)

In [131]:
plt.figure(figsize=(20,10))
df['price'].plot.kde()
Out[131]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a506acf408>

Create a boxplot showing the relationship between the body-style and the price.

In [132]:
plt.figure(figsize=(15,8))
sns.boxplot(x='body-style',y='price',data=df)
Out[132]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a506b34508>

Calculate the summary statistics for the symboling, grouped by the city-mpg.

In [133]:
df.groupby('body-style')['price'].describe() 
Out[133]:
count mean std min 25% 50% 75% max
body-style
convertible 2.0 26362.500000 12294.465604 17669.0 22015.75 26362.5 30709.25 35056.0
hardtop 5.0 13142.400000 8485.769134 8249.0 8449.00 9639.0 11199.00 28176.0
hatchback 56.0 9220.160714 4102.849771 5118.0 6367.25 7847.0 10140.50 22018.0
sedan 79.0 12558.620253 5942.114349 5499.0 7836.50 10245.0 16737.50 32250.0
wagon 17.0 11351.411765 5617.408322 6918.0 7898.00 8921.0 13415.00 28248.0

Create a boxplot showing the relationship between the symboling and the price.

In [134]:
plt.figure(figsize=(15, 10))
sns.boxplot(x='symboling',y='price',data=df,palette='winter')
Out[134]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a507500808>
  • symboling 0 and 1 are the most favored.
  • Cars with symboling -1 and -2 are the most expensive, which is logical because it means that the car is more secure.

Price VS symboling

In [135]:
plt.figure(figsize=(30, 15))
plt.subplot(2,3,1)
sns.boxplot(x=df.symboling, y=df.price)


plt.subplot(2,3,2)
plt.title('Symboling Hist')
order = df['symboling'].value_counts(ascending=False).index
sns.countplot(x='symboling', data=df, order=order)
Out[135]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a507353588>

It seems that symboling 0 and 1 are the most favored.

Cars with symboling -1 and -2 are the most expensive, which is logical because it means that the car is more secure.

Note: Existence of Outliers for several values.

Price Vs Gas

In [136]:
plt.figure(figsize=(12,7))
sns.boxplot(x='fuel-type',y='price',data=df)
Out[136]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a507266b88>

We can clearly see that cars that run on gas are preferable to cars that run on diesel.

Calculate the summary statistics for the fuel-type, grouped by the price.

In [137]:
df.groupby('fuel-type')['price'].describe() 
Out[137]:
count mean std min 25% 50% 75% max
fuel-type
diesel 15.0 16189.600000 8868.513282 7099.0 7946.5 13200.0 24011.0 31600.0
gas 144.0 10951.576389 5278.891595 5118.0 7295.0 8948.5 13499.0 35056.0

Price Vs Make

In [138]:
plt.figure(figsize=(20,10))
sns.boxplot(
    data=df,
    x='make',
    y='price',
    color='blue')
Out[138]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a507472848>
In [139]:
plt.figure(figsize=(20,10)) 
plt.bar(df['make'], df['price'], color="blue")
Out[139]:
<BarContainer object of 159 artists>

Conclusion

  • Mercedez-Benz ,Jaguar produces expensive cars more than 30000
  • most of the cars comapany produces car in range below 25000
  • Hardtop model are expensive in prices followed by convertible and sedan body style
  • Turbo models have higher prices than for the standard model
  • Convertible has only standard edition with expensive cars
  • hatchback and sedan turbo models are available below 20000
  • rwd wheel drive vehicle have expensive prices
  • scatter
  • bar
  • box
  • spread
  • ratio
  • heatmap
  • surface
  • histogram
  • bubble
In [140]:
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
In [141]:
import cufflinks as cf
In [142]:
import chart_studio.plotly as py
In [143]:
# For Notebooks
init_notebook_mode(connected=True)
In [144]:
# For offline use
cf.go_offline()
In [145]:
df.iplot()

Scatterplot

Price VS engine-size

In [146]:
df.iplot(kind='scatter',x='engine-size',y='price',mode='markers',size=15)
In [147]:
sns.lmplot(x = "engine-size", y = "price", data = df.reset_index(), size=8),
Out[147]:
(<seaborn.axisgrid.FacetGrid at 0x1a50625f108>,)

As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.

We can examine the correlation between 'engine-size' and 'price' and see it's approximately 0.87

Price VS highway-mpg

In [148]:
df.iplot(kind='scatter',x='highway-mpg',y='price',mode='markers',size=15)
In [149]:
sns.lmplot(x = "highway-mpg", y = "price", data = df.reset_index(), size=8),
Out[149]:
(<seaborn.axisgrid.FacetGrid at 0x1a508416988>,)

As the highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship between these two variables. Highway mpg could potentially be a predictor of price.

We can examine the correlation between 'highway-mpg' and 'price' and see it's approximately -0.704

In [150]:
df.iplot(kind='scatter',x='peak-rpm',y='price',mode='markers',size=15)
In [151]:
sns.lmplot(x = "peak-rpm", y = "price", data = df.reset_index(), size=8),
Out[151]:
(<seaborn.axisgrid.FacetGrid at 0x1a506770308>,)

We can notice that Peak rpm and price have a Weak Linear Relationship.

Peak rpm does not seems like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore it's it is not a reliable variable.

We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately -0.101616

Bar Plots

In [152]:
df.iplot(kind='bar',x='fuel-type',y='price')
In [153]:
df.count().iplot(kind='bar')
In [154]:
df.iplot(kind='bar',x='make',y='price')

Checking Countplot for make

In [155]:
plt.figure(figsize=(20,10))
sns.countplot(x='make',data=df)
Out[155]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5086163c8>
In [156]:
plt.figure(figsize=(20,10))
df['make'].value_counts().head(30).plot(kind='barh')
Out[156]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5085f31c8>
In [157]:
df['make'].value_counts().head(2)
Out[157]:
toyota    31
nissan    18
Name: make, dtype: int64
In [158]:
df['make'].value_counts().tail(2)
Out[158]:
jaguar     1
porsche    1
Name: make, dtype: int64

Examine this plot, we can easily see that several car makers have the same number of models, and that the difference in number of models can be as small as 1. We can notice that Toyota and Nissan are the most frequent car and jaguar and porsche the least frequent.

In [159]:
df.make.value_counts()
Out[159]:
toyota           31
nissan           18
honda            13
subaru           12
volvo            11
mazda            11
mitsubishi       10
dodge             8
volkswagen        8
peugot            7
plymouth          6
saab              6
mercedes-benz     5
audi              4
bmw               4
chevrolet         3
jaguar            1
porsche           1
Name: make, dtype: int64

Making a pie chart of the number of autos by make

In [160]:
plt.figure(figsize=(20,10))
df.make.value_counts().plot.pie()
Out[160]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5085cf388>

Let's plot the total sales

In [161]:
df.iplot(x='make', y='price');

Looking at this plot, we can notice that the most expensive cars are German, namely; Mercedes-Benz, BMW and Porsche.

In [162]:
df.iplot()
In [163]:
df[['symboling','engine-size']].iplot(kind='spread')
In [164]:
df[['price', 'engine-size']].iplot(kind='spread')
In [165]:
df[['horsepower', 'curb-weight']].iplot(kind='spread')

Overlapping the Histogram of all the columns

In [166]:
df.iplot(kind='hist')
In [167]:
df['price'].iplot(kind='hist',bins=25)
In [168]:
df['stroke'].iplot(kind='hist',bins=25)

Bubbleplot

In [169]:
df.iplot(kind='bubble',x='bore',y='price',size='peak-rpm')
In [170]:
df.iplot(kind='bubble',x='bore',y='price',size='wheel-base')
In [171]:
sns.set_style('whitegrid')
sns.lmplot('peak-rpm','city-mpg',data=df, hue='compression-ratio',
           palette='coolwarm',size=6,aspect=1,fit_reg=False)
Out[171]:
<seaborn.axisgrid.FacetGrid at 0x1a508cb8108>
In [172]:
sns.set_style('darkgrid')
g = sns.FacetGrid(df,hue="engine-type",palette='coolwarm',size=8,aspect=2)
g = g.map(plt.hist,'wheel-base',bins=20,alpha=0.7)
In [173]:
sns.set_style('darkgrid')
g = sns.FacetGrid(df,hue="engine-type",palette='coolwarm',size=8,aspect=2)
g = g.map(plt.hist,'horsepower',bins=20,alpha=0.7)

Training a Linear Regression Model

In [174]:
df.columns
Out[174]:
Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')
In [175]:
X = df[['symboling','wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg']]
y = df['price'] # What I'm trying to predict in this case is the price column.

We want to split our data into a training set for the model in a testing set in order to test the model once it's been trained.

Train Test Split

In [176]:
from sklearn.model_selection import train_test_split
In [177]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101) 

# This is essentially is using tuple and packing to grab our training set and we're testing set 
# and we're going to have our X train and our y train and then we can have our X test and our Y test, 
# Then in the train test split function we pass on our X data our y data and we could optionally 
#also passe our test size and test size.

Creating and Training the Model

In [178]:
cols = ['price']
for column in cols:
    df.replace("?", np.nan, inplace = True)
    df.dropna(axis = 0, inplace = True) 
    
In [179]:
from sklearn.linear_model import LinearRegression
In [180]:
lm = LinearRegression()
In [181]:
lm.fit(X_train,y_train)
Out[181]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Model Evaluation

Let's evaluate the model by checking out it's coefficients and how we can interpret them.

In [182]:
# print the intercept
print(lm.intercept_)
-52772.48931877864

The next thing we can check out are the coefficients and the coefficients are going to relate to each feature in our dataset. We can grab coefficients by saying lm coef underscore and this will return the coefficients for each.

In [183]:
lm.coef_
Out[183]:
array([-3.32058973e+01,  3.40840401e+02, -8.95766661e+01,  6.79651263e+02,
       -6.27057695e+01,  4.61242820e+00, -5.58779007e+00, -2.84120924e+03,
       -1.03895535e+03,  2.01640908e+02,  6.77560434e+01, -4.29839934e-01,
        2.96276613e+01, -9.79600116e+00])

So each of these coefficients relates to the columns in x or x train.

In [184]:
X_train.columns
Out[184]:
Index(['symboling', 'wheel-base', 'length', 'width', 'height', 'curb-weight',
       'engine-size', 'bore', 'stroke', 'compression-ratio', 'horsepower',
       'peak-rpm', 'city-mpg', 'highway-mpg'],
      dtype='object')
In [185]:
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
Out[185]:
Coefficient
symboling -33.205897
wheel-base 340.840401
length -89.576666
width 679.651263
height -62.705770
curb-weight 4.612428
engine-size -5.587790
bore -2841.209241
stroke -1038.955350
compression-ratio 201.640908
horsepower 67.756043
peak-rpm -0.429840
city-mpg 29.627661
highway-mpg -9.796001

Interpreting the coefficients:

  • Holding all other features fixed, a 1 unit decrease in symboling is associated with an decrease of \$-33.21 .
  • Holding all other features fixed, a 1 unit increase in wheel-base is associated with an increase of \$340.84 .
  • Holding all other features fixed, a 1 unit increase in length is associated with an decrease of \$-89.58 .
  • Holding all other features fixed, a 1 unit increase in width is associated with an increase of \$679.65 .
  • Holding all other features fixed, a 1 unit increase in height is associated with an decrease of \$-62.71 .
  • Holding all other features fixed, a 1 unit decrease in curb-weight is associated with an increase of \$4.61 .
  • Holding all other features fixed, a 1 unit increase in engine-size is associated with an decrease of \$-5.59 .
  • Holding all other features fixed, a 1 unit increase in bore is associated with an decrease of \$-2841.21 .
  • Holding all other features fixed, a 1 unit increase in stroke is associated with an decrease of \$-1038.955350 .
  • Holding all other features fixed, a 1 unit decrease in compression-ratio is associated with an increase of \$201.64 .
  • Holding all other features fixed, a 1 unit increase in horsepower is associated with an increase of \$67.76 .
  • Holding all other features fixed, a 1 unit increase in peak-rpm is associated with an decrease of \$-0.43 .
  • Holding all other features fixed, a 1 unit increase in city-mpg is associated with an increase of \$29.63 .
  • Holding all other features fixed, a 1 unit increase in highway-mpg is associated with an decrease of \$-9.80 .

Predictions from our Model

Let's grab predictions off our test set and see how well it did!

In [186]:
predictions = lm.predict(X_test)
In [187]:
predictions
Out[187]:
array([11289.5016669 ,  9861.35484015,  8856.13828605,  8663.46262643,
        6101.68782648,  9118.93551026, 19333.70146888,  8750.97414362,
        6791.40611257, 16736.96704291,  6757.35530606, 11828.94826508,
        9792.16841709,  5521.27484195, 10916.41148947,  6599.76052048,
       19587.38502008,  5613.52340602, 10565.52931116, 13973.7067376 ,
       26297.19282931,  8850.31096898,  9792.16841709,  6934.71496975,
       19452.36678081, 10572.78253994, 12289.44486552,  9674.24765482,
       20712.2181174 , 17841.69554727, 10528.46740639, 15710.22394552,
        6234.87081128, 21702.05442994,  9208.20611952,  6838.23319135,
       13251.65187881,  6290.50084828,  7783.67561591, 11768.96832226,
       12034.38363682, 14884.81758687,  6689.93269209, 11949.07890003,
        7397.36700047, 12119.53124191, 14563.2340619 , 14279.44020234,
        5924.8057372 ,  8363.29402411,  5541.723773  ,  6920.2264159 ,
        7498.46705992,  6774.83260534,  5507.68838074, 19122.53410015,
        5369.06471123,  7427.46124801,  8850.31096898, 11050.82672146,
       16771.8442005 ,  9950.9592739 ,  6108.8677617 , 11309.11019571])

These are the predicted prices of the Auto.

In [188]:
y_test
Out[188]:
168     9639.0
64     11245.0
85      6989.0
142     7775.0
50      5195.0
        ...   
87      9279.0
107    11900.0
147    10198.0
51      6095.0
145    11259.0
Name: price, Length: 64, dtype: float64

However, since we did the train test split, we know that Y test contains the correct Cars of the Car, and we want to know how far off are the predictions from the tests prices the actual prices. There is one quick way we can visually analyze this which is just by doing a scatterplot.

In [189]:
plt.figure(figsize=(15,7))
plt.scatter(y_test,predictions)
Out[189]:
<matplotlib.collections.PathCollection at 0x1a50b2c3608>

That means that a perfect straight line would be perfectly correct predictions.

Residual Histogram

Let's go ahead and actually create a histogram of the distribution of our residuals. The residuals are the difference between the actual values y_test and the predicted values.

In [190]:
plt.figure(figsize=(15,7))
sns.distplot((y_test-predictions),bins=50);
  • This is a histogram of the residuals.
  • Notice here that our residuals looked to be normally distributed.
  • If you have normally distributed residuals, it means the model was a correct choice for the data.

Regression Evaluation Metrics

Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

Mean Squared Error (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Comparing these metrics:

  • MAE is the easiest to understand, because it's the average error.
  • MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
  • RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are loss functions, because we want to minimize them.

In [191]:
from sklearn import metrics
In [192]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
MAE: 2029.6889486260104
MSE: 8908596.027073476
RMSE: 2984.727127741073
In [193]:
metrics.explained_variance_score(y_test,predictions)
Out[193]:
0.7561827006534351
In [194]:
plt.figure(figsize=(15,7))
sns.heatmap(df.corr())
Out[194]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a50b62e4c8>