Linear Regression is one of the most used statistical models in the industry. The main advantage of linear regression lies in its simplicity and interpretability. Linear regression is used to forecast revenue of a company based on parameters, forecasting player’s growth in sports, predicting the price of a product given the cost of raw materials, predicting crop yield given rainfall and much much more. During our internship at Ambee, we were given a warm-up task to predict car prices given the dataset. This task strengthened our understanding of feature selection for multivariate linear regression and statistical measures for choosing the right model. You might be wondering why does an environment company makes interns work on a car pricing dataset. At Ambee, we celebrate outside data as much as inside data. That’s what makes us relate things like how a change in pollutants impacts health businesses’ economies of scale, which aren’t seen directly by many but affect indirectly. It is important for a data scientist to gain domain knowledge but it is also important to keep an open mind on external factors that can be directly or indirectly related. Regression is a statistical technique used to model continuous target variables. It has also been adopted to Machine Learning to predict continuous variables. Regression models the target variable as a function of independent variables also called as predictors. Linear Regression fits a straight line to our data. Simple Linear Regression (SLR) models target variable as a function of a single predictor whereas Multivariate Linear Regression (MLR) models target variable as a function of multiple predictors.

Problem Statement

A new car manufacturer is looking to set up business in the US Market. They need to know the factors on which the pricing of a car depends on to take on their competition in the market. The company wants to know the variables the price depends on and to what extent does the variables explain the price of a car.

Business Goal

We need to build a model for the price of a car as a function of explanatory variables. The company will then use it to configure the price of a car according to its features or configure the features according to its price. In this blog post, we shall go through the process of cleaning the data, understanding our variables and modelling using linear regression. Let us import our libraries. Numpy is a fast matrix computation library that most of the other libraries depend on and we might need it at some point. Pandas is our data manipulation library and one of the most important libraries in our pipeline. matplotlib and Seaborn are used for plotting graphs.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Let us read our dataset.
cardata=pd.read_csv(r'CarPrice_Assignment.csv')
cardata.head()
5 rows × 26 columns​ We can use head() to view first five records. We observe that there are a lot of variables and many of them are categorical. So, feature selection will play an important role going forward. Let us check if there are any missing values.
cardata.isnull().sum()
car_ID              0
symboling 0
CarName 0
fueltype 0
aspiration 0
doornumber 0
carbody 0
drivewheel 0
enginelocation 0
wheelbase 0
carlength 0
carwidth 0
carheight 0
curbweight 0
enginetype 0
cylindernumber 0
enginesize 0
fuelsystem 0
boreratio 0
stroke 0
compressionratio 0
horsepower 0
peakrpm 0
citympg 0
highwaympg 0
price 0
dtype: int64
Turns out there aren’t any. So we do not need to worry about filling any missing values. We can get descriptive statistical values using describe() of pandas.
cardata.describe()
​Now, we shall do some processing of our data.  1) We only want the company name. So lets split the CarName and extract only company name. We will rename it to Company to avoid confusion.  2) We will calculate total miles per gallon and remove citympg and highwaympg  3) We do not require ID as well so lets remove that as well.  4) We will change the datatype of symboling to string since its a categorical variable and should not be confused to be continuous.
cardata['CarName']=cardata['CarName'].apply(lambda name: name.split()[0])
cardata.rename(index=str,columns={'CarName':'Company'},inplace=True)
cardata['total_mpg']=(55*cardata['citympg']/100)+(45*cardata['highwaympg']/100)
cardata.drop(['car_ID','citympg','highwaympg'],axis=1,inplace=True)
cardata.symboling=cardata.symboling.astype(str)
cardata.head()
​5 rows × 24 columns​ Let us see the companies present in our dataset.
cardata.Company.unique()
array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
'isuzu', 'jaguar', 'maxda', 'mazda', 'buick', 'mercury',
'mitsubishi', 'Nissan', 'nissan', 'peugeot', 'plymouth', 'porsche',
'porcshce', 'renault', 'saab', 'subaru', 'toyota', 'toyouta',
'vokswagen', 'volkswagen', 'vw', 'volvo'], dtype=object)
We can see that some of the companies are misspelled or repeated. Let us fix that.
cardata.Company.replace('maxda','mazda',inplace=True)
cardata.Company.replace('Nissan','nissan',inplace=True)
cardata.Company.replace('porcshce','porsche',inplace=True)
cardata.Company.replace('toyouta','toyota',inplace=True)
cardata.Company.replace('vokswagen','volkswagen',inplace=True)
cardata.Company.replace('vw','volkswagen',inplace=True)
We see that the names are now fixed.
cardata.Company.unique()
array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
'isuzu', 'jaguar', 'mazda', 'buick', 'mercury', 'mitsubishi',
'nissan', 'peugeot', 'plymouth', 'porsche', 'renault', 'saab',
'subaru', 'toyota', 'volkswagen', 'volvo'], dtype=object)
Now, let us explore our data. We will look at how our data is distributed by plotting a histogram. We’ve plotted using both matplotlib and seaborn but both are the same while interpreting.
sns.set_style('darkgrid')
plt.hist(cardata['price'],histtype='step')
(array([83., 45., 35., 18.,  6.,  3.,  5.,  7.,  2.,  1.]),
array([ 5118. , 9146.2, 13174.4, 17202.6, 21230.8, 25259. , 29287.2,
33315.4, 37343.6, 41371.8, 45400. ]),
<a list of 1 Patch objects>)
sns.distplot(cardata.price)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1d20aeb8>
We can see that our data is skewed by looking at the above plots. What this means is that there are more cheaper cars in our dataset than expensive cars.
print('Mean:',cardata.price.mean())
print('Median:',cardata.price.median())
print('Standard Deviation:',cardata.price.std())
print('Variance:',cardata.price.var())
Mean: 13276.710570731706
Median: 10295.0
Standard Deviation: 7988.85233174315
Variance: 63821761.57839796
Since Mean and Median is not similar, data is skewed as seen in above histogram. There is also high amount of variance in the data.
cardata.price.describe()
count      205.000000
mean 13276.710571
std 7988.852332
min 5118.000000
25% 7788.000000
50% 10295.000000
75% 16503.000000
max 45400.000000
Name: price, dtype: float64
We shall plot a boxplot to see how our price is distributed. Boxplot shows minimum, first quartile(25%), median, third quartile(75%), maximum and outliers (represented using dots).
sns.boxplot(y=cardata.price,color='#13d2f2')
<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1f5b7828>
We can see below that most cars in our dataset is manufactured by Toyota.
plt.figure(figsize=[20,7])
sns.countplot(cardata.Company)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1f70b780>
We also have more petrol cars than diesel cars.
sns.countplot(cardata.fueltype)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1d09bbe0>
The plot below shows number of cars for each body type.
sns.countplot(cardata.carbody)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cfa9320>
Now, let us look at how each variable affects the price. We can see that Jaguar,Buick have highest prices followed by porsche and BMW.
plt.figure(figsize=[20,7])
sns.barplot(x=cardata.Company,y=cardata.price,ci=None)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cf5cf28>
Diesel cars are more costly than petrol cars.
sns.barplot(x=cardata.fueltype,y=cardata.price,ci=None)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cde8dd8>
The plot below gives relation between body type and price.
sns.barplot(x=cardata.carbody,y=cardata.price,ci=None)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cdb2b38>
The plot below gives the relation between number of cylinders and price.
sns.barplot(x=cardata.cylindernumber,y=cardata.price,ci=None)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cb276d8>
Now, we need to select features using which we can model our price using linear regression. So, we will look at how much correlation each feature has with price. Correlation explains how much two things are related to each other. For example, the amount of rainfall is correlated with how wet your garden is. But, correlation doesn’t always mean causation. Just because your garden is wet doesn’t mean it was due to rain. It can also be the sprinkler or any other source of water. In general, correlation helps us the choose the most important variables to model after.
plt.figure(figsize=[15,15])
sns.heatmap(cardata.corr(),annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1d20a470>
The Numerical Values having highest correlation are:
  • Engine Size
  • Curb Weight
  • Horsepower
  • Car Width
  • Car Length
  • City mpg (Negative correlation)
Let us drop all other variables.
cardata.drop(['wheelbase','carheight','boreratio','stroke', 'compressionratio','peakrpm'],axis=1,inplace=True)
cardata.head()
Now, we shall look for multicollinearity. Two variables are collinear if they are highly correlated. Multicollinearlity happens when there is high correlation between predictors. This is a problem because linear regression doesn’t handle multicollinearity well.
plt.figure(figsize=[10,10])
sns.heatmap(cardata.drop('price',axis=1).corr(),annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1c9be2e8>
We see that enginesize, horsepower,curbweight and totalmpg are highly correlated. We need to choose one predictor among these. We need to now handle categorical variables. We can convert them to binary variables using get_dummies(). Each value in column becomes a separate column after getting the binarization.
cardata=pd.get_dummies(cardata)
We need to select one predictor from enginesize, horsepower,curbweight and totalmpg. We shall make a simple linear regression model for each predictor. Before moving ahead, we need to look at some statistical measures to choose best value. 1) p-value helps you determine the significance of your results. The relation is statistically significant if p value is less than 0.05. 2) R2 measures goodness of the fit. Higher R2 score means our model fits the data better. But, with increasing number of features, R2 also increases. So, we need to be careful.
predictors=cardata['horsepower']
target=cardata['price']
We use statsmodels library to make a linear regression model since it gives more information and statistics than scikit-learn.
import statsmodels.api as sm
predictors= sm.add_constant(predictors)
lm_1 = sm.OLS(target,predictors).fit()
print(lm_1.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable: price R-squared: 0.653
Model: OLS Adj. R-squared: 0.651
Method: Least Squares F-statistic: 382.2
Date: Thu, 11 Jul 2019 Prob (F-statistic): 1.48e-48
Time: 23:00:26 Log-Likelihood: -2024.0
No. Observations: 205 AIC: 4052.
Df Residuals: 203 BIC: 4059.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -3721.7615 929.849 -4.003 0.000 -5555.163 -1888.360
horsepower 163.2631 8.351 19.549 0.000 146.796 179.730
==============================================================================
Omnibus: 47.741 Durbin-Watson: 0.792
Prob(Omnibus): 0.000 Jarque-Bera (JB): 91.702
Skew: 1.141 Prob(JB): 1.22e-20
Kurtosis: 5.352 Cond. No. 314.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
predictors=cardata['enginesize']
import statsmodels.api as sm
predictors= sm.add_constant(predictors)
lm_1 = sm.OLS(target,predictors).fit()
print(lm_1.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable: price R-squared: 0.764
Model: OLS Adj. R-squared: 0.763
Method: Least Squares F-statistic: 657.6
Date: Thu, 11 Jul 2019 Prob (F-statistic): 1.35e-65
Time: 23:00:26 Log-Likelihood: -1984.4
No. Observations: 205 AIC: 3973.
Df Residuals: 203 BIC: 3979.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -8005.4455 873.221 -9.168 0.000 -9727.191 -6283.700
enginesize 167.6984 6.539 25.645 0.000 154.805 180.592
==============================================================================
Omnibus: 23.788 Durbin-Watson: 0.768
Prob(Omnibus): 0.000 Jarque-Bera (JB): 33.092
Skew: 0.717 Prob(JB): 6.52e-08
Kurtosis: 4.348 Cond. No. 429.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
predictors=cardata['curbweight']
import statsmodels.api as sm
predictors= sm.add_constant(predictors)
lm_1 = sm.OLS(target,predictors).fit()
print(lm_1.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable: price R-squared: 0.698
Model: OLS Adj. R-squared: 0.696
Method: Least Squares F-statistic: 468.6
Date: Thu, 11 Jul 2019 Prob (F-statistic): 1.21e-54
Time: 23:00:26 Log-Likelihood: -2009.8
No. Observations: 205 AIC: 4024.
Df Residuals: 203 BIC: 4030.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1.948e+04 1543.962 -12.614 0.000 -2.25e+04 -1.64e+04
curbweight 12.8162 0.592 21.647 0.000 11.649 13.984
==============================================================================
Omnibus: 85.362 Durbin-Watson: 0.575
Prob(Omnibus): 0.000 Jarque-Bera (JB): 382.847
Skew: 1.591 Prob(JB): 7.34e-84
Kurtosis: 8.890 Cond. No. 1.31e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.31e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
predictors=cardata['total_mpg']
import statsmodels.api as sm
predictors= sm.add_constant(predictors)
lm_1 = sm.OLS(target,predictors).fit()
print(lm_1.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable: price R-squared: 0.485
Model: OLS Adj. R-squared: 0.482
Method: Least Squares F-statistic: 191.0
Date: Thu, 11 Jul 2019 Prob (F-statistic): 4.74e-31
Time: 23:00:26 Log-Likelihood: -2064.5
No. Observations: 205 AIC: 4133.
Df Residuals: 203 BIC: 4140.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 3.645e+04 1724.687 21.137 0.000 3.31e+04 3.99e+04
total_mpg -836.4846 60.533 -13.819 0.000 -955.839 -717.130
==============================================================================
Omnibus: 58.414 Durbin-Watson: 0.820
Prob(Omnibus): 0.000 Jarque-Bera (JB): 104.935
Skew: 1.473 Prob(JB): 1.64e-23
Kurtosis: 4.900 Cond. No. 123.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
From the above observations, we can select one of the predictors. Since enginesize has highest R2 value and P value of 0, we can select that. The reason for selecting just one of the above predictors is due to high amount of multicollinearity between them. We shall drop the rest.
cardata.drop(['horsepower','curbweight','total_mpg'],axis=1,inplace=True)
cardata.shape
(205, 70)
We have 70 columns after binarization. We obviously can’t look at all the columns. So, we shall drop all columns which have low correlation.
cols_to_drop=cardata.corr()[(cardata.corr()['price']<=0.5) & (cardata.corr()['price']>=-0.5)]
cols_to_drop=cols_to_drop.reset_index()['index']
cols_to_drop=list(cols_to_drop)
cardata.drop(cols_to_drop,axis=1,inplace=True)
We are left with only 10 variables.
cardata.shape
(205, 10)
cardata.head()
Let us look at the correlations of the variables left.
plt.figure(figsize=[10,10])
sns.heatmap(cardata.corr(),annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1815cbe0>
We will also drop all columns having correlation to price around 0.5. We will also drop carlength and carwidth since they are highly correlated to each other and significantly correlated with enginesize.
cardata.drop(['carlength','carwidth','Company_buick','fuelsystem_2bbl', 'fuelsystem_mpfi'],axis=1,inplace=True)
Let us start with enginesize since we know it has a good R2 and P Value.
predictors=cardata.drop('price',axis=1)
target=cardata.price
predictors1=predictors['enginesize']
import statsmodels.api as sm
predictors1= sm.add_constant(predictors1)
lm_1 = sm.OLS(target,predictors1).fit()
print(lm_1.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable: price R-squared: 0.764
Model: OLS Adj. R-squared: 0.763
Method: Least Squares F-statistic: 657.6
Date: Thu, 11 Jul 2019 Prob (F-statistic): 1.35e-65
Time: 23:00:28 Log-Likelihood: -1984.4
No. Observations: 205 AIC: 3973.
Df Residuals: 203 BIC: 3979.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -8005.4455 873.221 -9.168 0.000 -9727.191 -6283.700
enginesize 167.6984 6.539 25.645 0.000 154.805 180.592
==============================================================================
Omnibus: 23.788 Durbin-Watson: 0.768
Prob(Omnibus): 0.000 Jarque-Bera (JB): 33.092
Skew: 0.717 Prob(JB): 6.52e-08
Kurtosis: 4.348 Cond. No. 429.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We get a decent model with R2 score of 0.76 only using enginesize. We will now add more variables to it to see if it improves our model. Let us see if adding forward wheel drive or backward wheel drive improves our model.
predictors2=predictors[['enginesize','drivewheel_fwd']]
import statsmodels.api as sm
predictors2= sm.add_constant(predictors2)
lm_2 = sm.OLS(target,predictors2).fit()
print(lm_2.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable: price R-squared: 0.794
Model: OLS Adj. R-squared: 0.792
Method: Least Squares F-statistic: 390.3
Date: Thu, 11 Jul 2019 Prob (F-statistic): 4.11e-70
Time: 23:00:28 Log-Likelihood: -1970.3
No. Observations: 205 AIC: 3947.
Df Residuals: 202 BIC: 3957.
Df Model: 2
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
const -3510.5293 1160.633 -3.025 0.003 -5799.040 -1222.019
enginesize 147.4621 7.157 20.604 0.000 133.350 161.574
drivewheel_fwd -3291.5804 603.483 -5.454 0.000 -4481.515 -2101.646
==============================================================================
Omnibus: 21.512 Durbin-Watson: 0.819
Prob(Omnibus): 0.000 Jarque-Bera (JB): 35.873
Skew: 0.580 Prob(JB): 1.62e-08
Kurtosis: 4.689 Cond. No. 655.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We see that our R2 has improved.
predictors3=predictors[['enginesize','drivewheel_rwd']]
import statsmodels.api as sm
predictors3= sm.add_constant(predictors3)
lm_3 = sm.OLS(target,predictors3).fit()
print(lm_3.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable: price R-squared: 0.795
Model: OLS Adj. R-squared: 0.793
Method: Least Squares F-statistic: 391.4
Date: Thu, 11 Jul 2019 Prob (F-statistic): 3.26e-70
Time: 23:00:28 Log-Likelihood: -1970.1
No. Observations: 205 AIC: 3946.
Df Residuals: 202 BIC: 3956.
Df Model: 2
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
const -6378.7190 868.209 -7.347 0.000 -8090.634 -4666.804
enginesize 144.6322 7.412 19.512 0.000 130.017 159.248
drivewheel_rwd 3508.0574 637.511 5.503 0.000 2251.028 4765.087
==============================================================================
Omnibus: 19.717 Durbin-Watson: 0.781
Prob(Omnibus): 0.000 Jarque-Bera (JB): 33.357
Skew: 0.528 Prob(JB): 5.71e-08
Kurtosis: 4.670 Cond. No. 481.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We get a pretty good improvement by adding either of the two. Let us add both and check.
predictors4=predictors[['enginesize','drivewheel_fwd','drivewheel_rwd']]
import statsmodels.api as sm
predictors4= sm.add_constant(predictors4)
lm_4 = sm.OLS(target,predictors4).fit()
print(lm_4.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable: price R-squared: 0.797
Model: OLS Adj. R-squared: 0.794
Method: Least Squares F-statistic: 262.5
Date: Thu, 11 Jul 2019 Prob (F-statistic): 3.07e-69
Time: 23:00:28 Log-Likelihood: -1969.2
No. Observations: 205 AIC: 3946.
Df Residuals: 201 BIC: 3960.
Df Model: 3
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
const -4829.7218 1458.548 -3.311 0.001 -7705.740 -1953.703
enginesize 144.5557 7.399 19.537 0.000 129.966 159.145
drivewheel_fwd -1656.2169 1254.380 -1.320 0.188 -4129.650 817.216
drivewheel_rwd 1971.1119 1326.626 1.486 0.139 -644.777 4587.001
==============================================================================
Omnibus: 20.686 Durbin-Watson: 0.794
Prob(Omnibus): 0.000 Jarque-Bera (JB): 36.360
Skew: 0.539 Prob(JB): 1.27e-08
Kurtosis: 4.760 Cond. No. 1.13e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
We get a warning saying that there is a strong multicollinearity. It can be explained since both forward wheel drive and backward wheel drives do not coexist. We see that our p values increase beyond 0.005. So, we cannot go with above predictors We will add cylindernumber_four variable to our existing models and see if it improves our model.
predictors5=predictors[['enginesize','drivewheel_rwd','cylindernumber_four']]
import statsmodels.api as sm
predictors5= sm.add_constant(predictors5)
lm_5 = sm.OLS(target,predictors5).fit()
print(lm_5.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable: price R-squared: 0.823
Model: OLS Adj. R-squared: 0.820
Method: Least Squares F-statistic: 311.8
Date: Thu, 11 Jul 2019 Prob (F-statistic): 2.55e-75
Time: 23:00:28 Log-Likelihood: -1954.9
No. Observations: 205 AIC: 3918.
Df Residuals: 201 BIC: 3931.
Df Model: 3
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
const 21.9214 1389.248 0.016 0.987 -2717.449 2761.292
enginesize 120.8814 8.074 14.971 0.000 104.960 136.803
drivewheel_rwd 3098.2795 597.869 5.182 0.000 1919.379 4277.180
cylindernumber_four -4170.3627 736.215 -5.665 0.000 -5622.059 -2718.666
==============================================================================
Omnibus: 12.155 Durbin-Watson: 0.948
Prob(Omnibus): 0.002 Jarque-Bera (JB): 25.608
Skew: 0.202 Prob(JB): 2.75e-06
Kurtosis: 4.684 Cond. No. 861.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
predictors6=predictors[['enginesize','drivewheel_fwd','cylindernumber_four']]
import statsmodels.api as sm
predictors6= sm.add_constant(predictors6)
lm_6 = sm.OLS(target,predictors6).fit()
print(lm_6.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable: price R-squared: 0.821
Model: OLS Adj. R-squared: 0.819
Method: Least Squares F-statistic: 308.0
Date: Thu, 11 Jul 2019 Prob (F-statistic): 6.99e-75
Time: 23:00:28 Log-Likelihood: -1955.9
No. Observations: 205 AIC: 3920.
Df Residuals: 201 BIC: 3933.
Df Model: 3
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
const 2314.0832 1515.502 1.527 0.128 -674.239 5302.406
enginesize 124.4012 7.893 15.761 0.000 108.838 139.965
drivewheel_fwd -2827.0579 570.267 -4.957 0.000 -3951.530 -1702.585
cylindernumber_four -4087.0246 742.670 -5.503 0.000 -5551.448 -2622.601
==============================================================================
Omnibus: 12.945 Durbin-Watson: 0.985
Prob(Omnibus): 0.002 Jarque-Bera (JB): 25.263
Skew: 0.273 Prob(JB): 3.27e-06
Kurtosis: 4.631 Cond. No. 913.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Now, let us plot our predictions to see how our models perform.
pred=lm_6.predict(predictors6)
# Actual vs Predicted
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-") #Plotting Actual
plt.plot(c,pred, color="red", linewidth=3.5, linestyle="-") #Plotting predicted
fig.suptitle('Actual and Predicted', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Car Price', fontsize=16)
Text(0, 0.5, 'Car Price')
# Error terms
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Car Price', fontsize=16)
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(target, pred)
r_squared = r2_score(target, pred)
print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)
Mean_Squared_Error : 11347528.099728286
r_square_value : 0.8213281339239666
pred=lm_5.predict(predictors5)
# Actual vs Predicted
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-") #Plotting Actual
plt.plot(c,pred, color="red", linewidth=3.5, linestyle="-") #Plotting predicted
fig.suptitle('Actual and Predicted', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Car Price', fontsize=16)
Text(0, 0.5, 'Car Price')
# Error terms
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Car Price', fontsize=16)
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(target, pred)
r_squared = r2_score(target, pred)
print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)
Mean_Squared_Error : 11234025.847691916
r_square_value : 0.8231152772557074
pred=lm_4.predict(predictors4)
# Actual vs Predicted
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-") #Plotting Actual
plt.plot(c,pred, color="red", linewidth=3.5, linestyle="-") #Plotting predicted
fig.suptitle('Actual and Predicted', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Car Price', fontsize=16)
Text(0, 0.5, 'Car Price')
# Error terms
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Car Price', fontsize=16)
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(target, pred)
r_squared = r2_score(target, pred)
print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)
Mean_Squared_Error : 12915408.499455474
r_square_value : 0.7966411611893495
pred=lm_3.predict(predictors3)
# Actual vs Predicted
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-") #Plotting Actual
plt.plot(c,pred, color="red", linewidth=3.5, linestyle="-") #Plotting predicted
fig.suptitle('Actual and Predicted', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Car Price', fontsize=16)
Text(0, 0.5, 'Car Price')
# Error terms
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Car Price', fontsize=16)
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(target, pred)
r_squared = r2_score(target, pred)
print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)
Mean_Squared_Error : 13027426.534669885
r_square_value : 0.7948773875101807
pred=lm_2.predict(predictors2)
# Actual vs Predicted
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-") #Plotting Actual
plt.plot(c,pred, color="red", linewidth=3.5, linestyle="-") #Plotting predicted
fig.suptitle('Actual and Predicted', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Car Price', fontsize=16)
Text(0, 0.5, 'Car Price')
# Error terms
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Car Price', fontsize=16)
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(target, pred)
r_squared = r2_score(target, pred)
print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)
Mean_Squared_Error : 13057261.284937426
r_square_value : 0.7944076261262594
pred=lm_1.predict(predictors1)
# Actual vs Predicted
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-") #Plotting Actual
plt.plot(c,pred, color="red", linewidth=3.5, linestyle="-") #Plotting predicted
fig.suptitle('Actual and Predicted', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Car Price', fontsize=16)
Text(0, 0.5, 'Car Price')
# Error terms
c = [i for i in range(1,206,1)]
fig = plt.figure()
plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Car Price', fontsize=16)
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(target, pred)
r_squared = r2_score(target, pred)
print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)
Mean_Squared_Error : 14980261.405551314
r_square_value : 0.7641291357806177
We can see that models 6 and 5 predict the price pretty well. So the solution for our problem statement is to look at enginesize, forward/backward wheel drive and see if the number of cylinders is 4 (which has negative correlation with price) to determine the price, This task was done by Pareekshith US Katti and N Nithin Srivatsav while doing internship at Ambee.
Bitnami