Predicting Car Prices using Multivariate Linear Regression

Linear Regression is one of the most used statistical models in the industry. The main advantage of linear regression lies in its simplicity and interpretability. Linear regression is used to forecast revenue of a company based on parameters, forecasting player’s growth in sports, predicting the price of a product given the cost of raw materials, predicting crop yield given rainfall and much much more. During our internship at Ambee, we were given a warm-up task to predict car prices given the dataset. This task strengthened our understanding of feature selection for multivariate linear regression and statistical measures for choosing the right model. You might be wondering why does an environment company makes interns work on a car pricing dataset. At Ambee, we celebrate outside data as much as inside data. That’s what makes us relate things like how a change in pollutants impacts health businesses’ economies of scale, which aren’t seen directly by many but affect indirectly. It is important for a data scientist to gain domain knowledge but it is also important to keep an open mind on external factors that can be directly or indirectly related. Regression is a statistical technique used to model continuous target variables. It has also been adopted to Machine Learning to predict continuous variables. Regression models the target variable as a function of independent variables also called as predictors. Linear Regression fits a straight line to our data. Simple Linear Regression (SLR) models target variable as a function of a single predictor whereas Multivariate Linear Regression (MLR) models target variable as a function of multiple predictors.

Problem Statement

A new car manufacturer is looking to set up business in the US Market. They need to know the factors on which the pricing of a car depends on to take on their competition in the market. The company wants to know the variables the price depends on and to what extent does the variables explain the price of a car.

Business Goal

We need to build a model for the price of a car as a function of explanatory variables. The company will then use it to configure the price of a car according to its features or configure the features according to its price. In this blog post, we shall go through the process of cleaning the data, understanding our variables and modelling using linear regression. Let us import our libraries. Numpy is a fast matrix computation library that most of the other libraries depend on and we might need it at some point. Pandas is our data manipulation library and one of the most important libraries in our pipeline. matplotlib and Seaborn are used for plotting graphs.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Let us read our dataset.

cardata=pd.read_csv(r'CarPrice_Assignment.csv')

cardata.head()

5 rows × 26 columns We can use head() to view first five records. We observe that there are a lot of variables and many of them are categorical. So, feature selection will play an important role going forward. Let us check if there are any missing values.

cardata.isnull().sum()

car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64

Turns out there aren’t any. So we do not need to worry about filling any missing values. We can get descriptive statistical values using describe() of pandas.

cardata.describe()

Now, we shall do some processing of our data. 1) We only want the company name. So lets split the CarName and extract only company name. We will rename it to Company to avoid confusion. 2) We will calculate total miles per gallon and remove citympg and highwaympg 3) We do not require ID as well so lets remove that as well. 4) We will change the datatype of symboling to string since its a categorical variable and should not be confused to be continuous.

cardata['CarName']=cardata['CarName'].apply(lambda name: name.split()[0])
cardata.rename(index=str,columns={'CarName':'Company'},inplace=True)
cardata['total_mpg']=(55*cardata['citympg']/100)+(45*cardata['highwaympg']/100)
cardata.drop(['car_ID','citympg','highwaympg'],axis=1,inplace=True)
cardata.symboling=cardata.symboling.astype(str)

cardata.head()

5 rows × 24 columns Let us see the companies present in our dataset.

cardata.Company.unique()

array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
       'isuzu', 'jaguar', 'maxda', 'mazda', 'buick', 'mercury',
       'mitsubishi', 'Nissan', 'nissan', 'peugeot', 'plymouth', 'porsche',
       'porcshce', 'renault', 'saab', 'subaru', 'toyota', 'toyouta',
       'vokswagen', 'volkswagen', 'vw', 'volvo'], dtype=object)

We can see that some of the companies are misspelled or repeated. Let us fix that.

cardata.Company.replace('maxda','mazda',inplace=True)
cardata.Company.replace('Nissan','nissan',inplace=True)
cardata.Company.replace('porcshce','porsche',inplace=True)
cardata.Company.replace('toyouta','toyota',inplace=True)
cardata.Company.replace('vokswagen','volkswagen',inplace=True)
cardata.Company.replace('vw','volkswagen',inplace=True)

We see that the names are now fixed.

cardata.Company.unique()

array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
       'isuzu', 'jaguar', 'mazda', 'buick', 'mercury', 'mitsubishi',
       'nissan', 'peugeot', 'plymouth', 'porsche', 'renault', 'saab',
       'subaru', 'toyota', 'volkswagen', 'volvo'], dtype=object)

Now, let us explore our data. We will look at how our data is distributed by plotting a histogram. We’ve plotted using both matplotlib and seaborn but both are the same while interpreting.

sns.set_style('darkgrid')
plt.hist(cardata['price'],histtype='step')

(array([83., 45., 35., 18.,  6.,  3.,  5.,  7.,  2.,  1.]),
 array([ 5118. ,  9146.2, 13174.4, 17202.6, 21230.8, 25259. , 29287.2,
        33315.4, 37343.6, 41371.8, 45400. ]),
 <a list of 1 Patch objects>)

sns.distplot(cardata.price)

<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1d20aeb8>

We can see that our data is skewed by looking at the above plots. What this means is that there are more cheaper cars in our dataset than expensive cars.

print('Mean:',cardata.price.mean())
print('Median:',cardata.price.median())
print('Standard Deviation:',cardata.price.std())
print('Variance:',cardata.price.var())

Mean: 13276.710570731706
Median: 10295.0
Standard Deviation: 7988.85233174315
Variance: 63821761.57839796

Since Mean and Median is not similar, data is skewed as seen in above histogram. There is also high amount of variance in the data.

cardata.price.describe()

count      205.000000
mean     13276.710571
std       7988.852332
min       5118.000000
25%       7788.000000
50%      10295.000000
75%      16503.000000
max      45400.000000
Name: price, dtype: float64

We shall plot a boxplot to see how our price is distributed. Boxplot shows minimum, first quartile(25%), median, third quartile(75%), maximum and outliers (represented using dots).

sns.boxplot(y=cardata.price,color='#13d2f2')

<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1f5b7828>

We can see below that most cars in our dataset is manufactured by Toyota.

plt.figure(figsize=[20,7])
sns.countplot(cardata.Company)

<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1f70b780>

We also have more petrol cars than diesel cars.

sns.countplot(cardata.fueltype)

<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1d09bbe0>

The plot below shows number of cars for each body type.

sns.countplot(cardata.carbody)

<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cfa9320>

Now, let us look at how each variable affects the price. We can see that Jaguar,Buick have highest prices followed by porsche and BMW.

plt.figure(figsize=[20,7])
sns.barplot(x=cardata.Company,y=cardata.price,ci=None)

<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cf5cf28>

Diesel cars are more costly than petrol cars.

sns.barplot(x=cardata.fueltype,y=cardata.price,ci=None)

<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cde8dd8>

The plot below gives relation between body type and price.

sns.barplot(x=cardata.carbody,y=cardata.price,ci=None)

<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cdb2b38>

The plot below gives the relation between number of cylinders and price.

sns.barplot(x=cardata.cylindernumber,y=cardata.price,ci=None)

<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cb276d8>

Now, we need to select features using which we can model our price using linear regression. So, we will look at how much correlation each feature has with price. Correlation explains how much two things are related to each other. For example, the amount of rainfall is correlated with how wet your garden is. But, correlation doesn’t always mean causation. Just because your garden is wet doesn’t mean it was due to rain. It can also be the sprinkler or any other source of water. In general, correlation helps us the choose the most important variables to model after.

plt.figure(figsize=[15,15])
sns.heatmap(cardata.corr(),annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1d20a470>

The Numerical Values having highest correlation are:

Engine Size
Curb Weight
Horsepower
Car Width
Car Length
City mpg (Negative correlation)

Let us drop all other variables.

cardata.drop(['wheelbase','carheight','boreratio','stroke', 'compressionratio','peakrpm'],axis=1,inplace=True)

cardata.head()

Now, we shall look for multicollinearity. Two variables are collinear if they are highly correlated. Multicollinearlity happens when there is high correlation between predictors. This is a problem because linear regression doesn’t handle multicollinearity well.

plt.figure(figsize=[10,10])
sns.heatmap(cardata.drop('price',axis=1).corr(),annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1c9be2e8>

We see that enginesize, horsepower,curbweight and totalmpg are highly correlated. We need to choose one predictor among these. We need to now handle categorical variables. We can convert them to binary variables using get_dummies(). Each value in column becomes a separate column after getting the binarization.

cardata=pd.get_dummies(cardata)

We need to select one predictor from enginesize, horsepower,curbweight and totalmpg. We shall make a simple linear regression model for each predictor. Before moving ahead, we need to look at some statistical measures to choose best value. 1) p-value helps you determine the significance of your results. The relation is statistically significant if p value is less than 0.05. 2) R2 measures goodness of the fit. Higher R2 score means our model fits the data better. But, with increasing number of features, R2 also increases. So, we need to be careful.

predictors=cardata['horsepower']
target=cardata['price']

We use statsmodels library to make a linear regression model since it gives more information and statistics than scikit-learn.

import statsmodels.api as sm
predictors= sm.add_constant(predictors)
lm_1 = sm.OLS(target,predictors).fit()
print(lm_1.summary())

OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.653
Model:                            OLS   Adj. R-squared:                  0.651
Method:                 Least Squares   F-statistic:                     382.2
Date:                Thu, 11 Jul 2019   Prob (F-statistic):           1.48e-48
Time:                        23:00:26   Log-Likelihood:                -2024.0
No. Observations:                 205   AIC:                             4052.
Df Residuals:                     203   BIC:                             4059.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -3721.7615    929.849     -4.003      0.000   -5555.163   -1888.360
horsepower   163.2631      8.351     19.549      0.000     146.796     179.730
==============================================================================
Omnibus:                       47.741   Durbin-Watson:                   0.792
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               91.702
Skew:                           1.141   Prob(JB):                     1.22e-20
Kurtosis:                       5.352   Cond. No.                         314.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

predictors=cardata['enginesize']
import statsmodels.api as sm
predictors= sm.add_constant(predictors)
lm_1 = sm.OLS(target,predictors).fit()
print(lm_1.summary())

OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.764
Model:                            OLS   Adj. R-squared:                  0.763
Method:                 Least Squares   F-statistic:                     657.6
Date:                Thu, 11 Jul 2019   Prob (F-statistic):           1.35e-65
Time:                        23:00:26   Log-Likelihood:                -1984.4
No. Observations:                 205   AIC:                             3973.
Df Residuals:                     203   BIC:                             3979.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -8005.4455    873.221     -9.168      0.000   -9727.191   -6283.700
enginesize   167.6984      6.539     25.645      0.000     154.805     180.592
==============================================================================
Omnibus:                       23.788   Durbin-Watson:                   0.768
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               33.092
Skew:                           0.717   Prob(JB):                     6.52e-08
Kurtosis:                       4.348   Cond. No.                         429.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

predictors=cardata['curbweight']
import statsmodels.api as sm
predictors= sm.add_constant(predictors)
lm_1 = sm.OLS(target,predictors).fit()
print(lm_1.summary())

OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.698
Model:                            OLS   Adj. R-squared:                  0.696
Method:                 Least Squares   F-statistic:                     468.6
Date:                Thu, 11 Jul 2019   Prob (F-statistic):           1.21e-54
Time:                        23:00:26   Log-Likelihood:                -2009.8
No. Observations:                 205   AIC:                             4024.
Df Residuals:                     203   BIC:                             4030.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.948e+04   1543.962    -12.614      0.000   -2.25e+04   -1.64e+04
curbweight    12.8162      0.592     21.647      0.000      11.649      13.984
==============================================================================
Omnibus:                       85.362   Durbin-Watson:                   0.575
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              382.847
Skew:                           1.591   Prob(JB):                     7.34e-84
Kurtosis:                       8.890   Cond. No.                     1.31e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.31e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

predictors=cardata['total_mpg']
import statsmodels.api as sm
predictors= sm.add_constant(predictors)
lm_1 = sm.OLS(target,predictors).fit()
print(lm_1.summary())

OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.485
Model:                            OLS   Adj. R-squared:                  0.482
Method:                 Least Squares   F-statistic:                     191.0
Date:                Thu, 11 Jul 2019   Prob (F-statistic):           4.74e-31
Time:                        23:00:26   Log-Likelihood:                -2064.5
No. Observations:                 205   AIC:                             4133.
Df Residuals:                     203   BIC:                             4140.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       3.645e+04   1724.687     21.137      0.000    3.31e+04    3.99e+04
total_mpg   -836.4846     60.533    -13.819      0.000    -955.839    -717.130
==============================================================================
Omnibus:                       58.414   Durbin-Watson:                   0.820
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              104.935
Skew:                           1.473   Prob(JB):                     1.64e-23
Kurtosis:                       4.900   Cond. No.                         123.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

From the above observations, we can select one of the predictors. Since enginesize has highest R2 value and P value of 0, we can select that. The reason for selecting just one of the above predictors is due to high amount of multicollinearity between them. We shall drop the rest.

cardata.drop(['horsepower','curbweight','total_mpg'],axis=1,inplace=True)

cardata.shape

(205, 70)

We have 70 columns after binarization. We obviously can’t look at all the columns. So, we shall drop all columns which have low correlation.

cols_to_drop=cardata.corr()[(cardata.corr()['price']<=0.5) & (cardata.corr()['price']>=-0.5)]
cols_to_drop=cols_to_drop.reset_index()['index']
cols_to_drop=list(cols_to_drop)

cardata.drop(cols_to_drop,axis=1,inplace=True)

We are left with only 10 variables.

cardata.shape

(205, 10)

cardata.head()

Let us look at the correlations of the variables left.

plt.figure(figsize=[10,10])
sns.heatmap(cardata.corr(),annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1815cbe0>

We will also drop all columns having correlation to price around 0.5. We will also drop carlength and carwidth since they are highly correlated to each other and significantly correlated with enginesize.

cardata.drop(['carlength','carwidth','Company_buick','fuelsystem_2bbl', 'fuelsystem_mpfi'],axis=1,inplace=True)

Let us start with enginesize since we know it has a good R2 and P Value.

predictors=cardata.drop('price',axis=1)
target=cardata.price

predictors1=predictors['enginesize']
import statsmodels.api as sm
predictors1= sm.add_constant(predictors1)
lm_1 = sm.OLS(target,predictors1).fit()
print(lm_1.summary())

OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.764
Model:                            OLS   Adj. R-squared:                  0.763
Method:                 Least Squares   F-statistic:                     657.6
Date:                Thu, 11 Jul 2019   Prob (F-statistic):           1.35e-65
Time:                        23:00:28   Log-Likelihood:                -1984.4
No. Observations:                 205   AIC:                             3973.
Df Residuals:                     203   BIC:                             3979.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -8005.4455    873.221     -9.168      0.000   -9727.191   -6283.700
enginesize   167.6984      6.539     25.645      0.000     154.805     180.592
==============================================================================
Omnibus:                       23.788   Durbin-Watson:                   0.768
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               33.092
Skew:                           0.717   Prob(JB):                     6.52e-08
Kurtosis:                       4.348   Cond. No.                         429.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We get a decent model with R2 score of 0.76 only using enginesize. We will now add more variables to it to see if it improves our model. Let us see if adding forward wheel drive or backward wheel drive improves our model.

predictors2=predictors[['enginesize','drivewheel_fwd']]
import statsmodels.api as sm
predictors2= sm.add_constant(predictors2)
lm_2 = sm.OLS(target,predictors2).fit()
print(lm_2.summary())

OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.794
Model:                            OLS   Adj. R-squared:                  0.792
Method:                 Least Squares   F-statistic:                     390.3
Date:                Thu, 11 Jul 2019   Prob (F-statistic):           4.11e-70
Time:                        23:00:28   Log-Likelihood:                -1970.3
No. Observations:                 205   AIC:                             3947.
Df Residuals:                     202   BIC:                             3957.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const          -3510.5293   1160.633     -3.025      0.003   -5799.040   -1222.019
enginesize       147.4621      7.157     20.604      0.000     133.350     161.574
drivewheel_fwd -3291.5804    603.483     -5.454      0.000   -4481.515   -2101.646
==============================================================================
Omnibus:                       21.512   Durbin-Watson:                   0.819
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               35.873
Skew:                           0.580   Prob(JB):                     1.62e-08
Kurtosis:                       4.689   Cond. No.                         655.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We see that our R2 has improved.

predictors3=predictors[['enginesize','drivewheel_rwd']]
import statsmodels.api as sm
predictors3= sm.add_constant(predictors3)
lm_3 = sm.OLS(target,predictors3).fit()
print(lm_3.summary())

OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.795
Model:                            OLS   Adj. R-squared:                  0.793
Method:                 Least Squares   F-statistic:                     391.4
Date:                Thu, 11 Jul 2019   Prob (F-statistic):           3.26e-70
Time:                        23:00:28   Log-Likelihood:                -1970.1
No. Observations:                 205   AIC:                             3946.
Df Residuals:                     202   BIC:                             3956.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const          -6378.7190    868.209     -7.347      0.000   -8090.634   -4666.804
enginesize       144.6322      7.412     19.512      0.000     130.017     159.248
drivewheel_rwd  3508.0574    637.511      5.503      0.000    2251.028    4765.087
==============================================================================
Omnibus:                       19.717   Durbin-Watson:                   0.781
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               33.357
Skew:                           0.528   Prob(JB):                     5.71e-08
Kurtosis:                       4.670   Cond. No.                         481.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We get a pretty good improvement by adding either of the two. Let us add both and check.

predictors4=predictors[['enginesize','drivewheel_fwd','drivewheel_rwd']]
import statsmodels.api as sm
predictors4= sm.add_constant(predictors4)
lm_4 = sm.OLS(target,predictors4).fit()
print(lm_4.summary())

OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.797
Model:                            OLS   Adj. R-squared:                  0.794
Method:                 Least Squares   F-statistic:                     262.5
Date:                Thu, 11 Jul 2019   Prob (F-statistic):           3.07e-69
Time:                        23:00:28   Log-Likelihood:                -1969.2
No. Observations:                 205   AIC:                             3946.
Df Residuals:                     201   BIC:                             3960.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const          -4829.7218   1458.548     -3.311      0.001   -7705.740   -1953.703
enginesize       144.5557      7.399     19.537      0.000     129.966     159.145
drivewheel_fwd -1656.2169   1254.380     -1.320      0.188   -4129.650     817.216
drivewheel_rwd  1971.1119   1326.626      1.486      0.139    -644.777    4587.001
==============================================================================
Omnibus:                       20.686   Durbin-Watson:                   0.794
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               36.360
Skew:                           0.539   Prob(JB):                     1.27e-08
Kurtosis:                       4.760   Cond. No.                     1.13e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

We get a warning saying that there is a strong multicollinearity. It can be explained since both forward wheel drive and backward wheel drives do not coexist. We see that our p values increase beyond 0.005. So, we cannot go with above predictors We will add cylindernumber_four variable to our existing models and see if it improves our model.

predictors5=predictors[['enginesize','drivewheel_rwd','cylindernumber_four']]
import statsmodels.api as sm
predictors5= sm.add_constant(predictors5)
lm_5 = sm.OLS(target,predictors5).fit()
print(lm_5.summary())

OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.823
Model:                            OLS   Adj. R-squared:                  0.820
Method:                 Least Squares   F-statistic:                     311.8
Date:                Thu, 11 Jul 2019   Prob (F-statistic):           2.55e-75
Time:                        23:00:28   Log-Likelihood:                -1954.9
No. Observations:                 205   AIC:                             3918.
Df Residuals:                     201   BIC:                             3931.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                  21.9214   1389.248      0.016      0.987   -2717.449    2761.292
enginesize            120.8814      8.074     14.971      0.000     104.960     136.803
drivewheel_rwd       3098.2795    597.869      5.182      0.000    1919.379    4277.180
cylindernumber_four -4170.3627    736.215     -5.665      0.000   -5622.059   -2718.666
==============================================================================
Omnibus:                       12.155   Durbin-Watson:                   0.948
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               25.608
Skew:                           0.202   Prob(JB):                     2.75e-06
Kurtosis:                       4.684   Cond. No.                         861.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

predictors6=predictors[['enginesize','drivewheel_fwd','cylindernumber_four']]
import statsmodels.api as sm
predictors6= sm.add_constant(predictors6)
lm_6 = sm.OLS(target,predictors6).fit()
print(lm_6.summary())

OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.821
Model:                            OLS   Adj. R-squared:                  0.819
Method:                 Least Squares   F-statistic:                     308.0
Date:                Thu, 11 Jul 2019   Prob (F-statistic):           6.99e-75
Time:                        23:00:28   Log-Likelihood:                -1955.9
No. Observations:                 205   AIC:                             3920.
Df Residuals:                     201   BIC:                             3933.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                2314.0832   1515.502      1.527      0.128    -674.239    5302.406
enginesize            124.4012      7.893     15.761      0.000     108.838     139.965
drivewheel_fwd      -2827.0579    570.267     -4.957      0.000   -3951.530   -1702.585
cylindernumber_four -4087.0246    742.670     -5.503      0.000   -5551.448   -2622.601
==============================================================================
Omnibus:                       12.945   Durbin-Watson:                   0.985
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               25.263
Skew:                           0.273   Prob(JB):                     3.27e-06
Kurtosis:                       4.631   Cond. No.                         913.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now, let us plot our predictions to see how our models perform.

pred=lm_6.predict(predictors6)

# Actual vs Predicted
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-")     #Plotting Actual
plt.plot(c,pred, color="red",  linewidth=3.5, linestyle="-")  #Plotting predicted
fig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Car Price', fontsize=16)

Text(0, 0.5, 'Car Price')

# Error terms
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                      # X-label
plt.ylabel('Car Price', fontsize=16)

from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(target, pred)
r_squared = r2_score(target, pred)

print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)

Mean_Squared_Error : 11347528.099728286
r_square_value : 0.8213281339239666

pred=lm_5.predict(predictors5)

# Actual vs Predicted
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-")     #Plotting Actual
plt.plot(c,pred, color="red",  linewidth=3.5, linestyle="-")  #Plotting predicted
fig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Car Price', fontsize=16)

Text(0, 0.5, 'Car Price')

# Error terms
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                      # X-label
plt.ylabel('Car Price', fontsize=16)

from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(target, pred)
r_squared = r2_score(target, pred)

print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)

Mean_Squared_Error : 11234025.847691916
r_square_value : 0.8231152772557074

pred=lm_4.predict(predictors4)

# Actual vs Predicted
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-")     #Plotting Actual
plt.plot(c,pred, color="red",  linewidth=3.5, linestyle="-")  #Plotting predicted
fig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Car Price', fontsize=16)

Text(0, 0.5, 'Car Price')

# Error terms
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                      # X-label
plt.ylabel('Car Price', fontsize=16)

from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(target, pred)
r_squared = r2_score(target, pred)

print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)

Mean_Squared_Error : 12915408.499455474
r_square_value : 0.7966411611893495

pred=lm_3.predict(predictors3)

# Actual vs Predicted
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-")     #Plotting Actual
plt.plot(c,pred, color="red",  linewidth=3.5, linestyle="-")  #Plotting predicted
fig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Car Price', fontsize=16)

Text(0, 0.5, 'Car Price')

# Error terms
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                      # X-label
plt.ylabel('Car Price', fontsize=16)

from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(target, pred)
r_squared = r2_score(target, pred)

print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)

Mean_Squared_Error : 13027426.534669885
r_square_value : 0.7948773875101807

pred=lm_2.predict(predictors2)

# Actual vs Predicted
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-")     #Plotting Actual
plt.plot(c,pred, color="red",  linewidth=3.5, linestyle="-")  #Plotting predicted
fig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Car Price', fontsize=16)

Text(0, 0.5, 'Car Price')

# Error terms
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                      # X-label
plt.ylabel('Car Price', fontsize=16)

from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(target, pred)
r_squared = r2_score(target, pred)

print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)

Mean_Squared_Error : 13057261.284937426
r_square_value : 0.7944076261262594

pred=lm_1.predict(predictors1)

# Actual vs Predicted
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-")     #Plotting Actual
plt.plot(c,pred, color="red",  linewidth=3.5, linestyle="-")  #Plotting predicted
fig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Car Price', fontsize=16)

Text(0, 0.5, 'Car Price')

# Error terms
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                      # X-label
plt.ylabel('Car Price', fontsize=16)

from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(target, pred)
r_squared = r2_score(target, pred)

print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)

Mean_Squared_Error : 14980261.405551314
r_square_value : 0.7641291357806177

We can see that models 6 and 5 predict the price pretty well. So the solution for our problem statement is to look at enginesize, forward/backward wheel drive and see if the number of cylinders is 4 (which has negative correlation with price) to determine the price, This task was done by Pareekshith US Katti and N Nithin Srivatsav while doing internship at Ambee.

Predicting Car Prices using Multivariate Linear Regression

Published by Pareekshith Katti on August 9, 2019August 9, 2019

Problem Statement

Business Goal

Engineering

Moving Away From Traditional SSH to AWS EC2

Engineering

ETL at Ambee | AWS Glue | How do we do it?

Uncategorized

Why Environmental Data is Important for Electric Vehicles

Download App

Predicting Car Prices using Multivariate Linear Regression

Published by Pareekshith Katti on August 9, 2019August 9, 2019

Problem Statement

Business Goal

Related Posts

Engineering

Moving Away From Traditional SSH to AWS EC2

Engineering

ETL at Ambee | AWS Glue | How do we do it?

Uncategorized

Why Environmental Data is Important for Electric Vehicles

Download App