Linear Regression is one of the most used statistical models in the industry. The main advantage of linear regression lies in its simplicity and interpretability. Linear regression is used to forecast revenue of a company based on parameters, forecasting player’s growth in sports, predicting the price of a product given the cost of raw materials, predicting crop yield given rainfall and much much more. During our internship at Ambee, we were given a warm-up task to predict car prices given the dataset. This task strengthened our understanding of feature selection for multivariate linear regression and statistical measures for choosing the right model. You might be wondering why does an environment company makes interns work on a car pricing dataset. At Ambee, we celebrate outside data as much as inside data. That’s what makes us relate things like how a change in pollutants impacts health businesses’ economies of scale, which aren’t seen directly by many but affect indirectly. It is important for a data scientist to gain domain knowledge but it is also important to keep an open mind on external factors that can be directly or indirectly related. Regression is a statistical technique used to model continuous target variables. It has also been adopted to Machine Learning to predict continuous variables. Regression models the target variable as a function of independent variables also called as predictors. Linear Regression fits a straight line to our data. Simple Linear Regression (SLR) models target variable as a function of a single predictor whereas Multivariate Linear Regression (MLR) models target variable as a function of multiple predictors.

### Problem Statement

A new car manufacturer is looking to set up business in the US Market. They need to know the factors on which the pricing of a car depends on to take on their competition in the market. The company wants to know the variables the price depends on and to what extent does the variables explain the price of a car.

We need to build a model for the price of a car as a function of explanatory variables. The company will then use it to configure the price of a car according to its features or configure the features according to its price. In this blog post, we shall go through the process of cleaning the data, understanding our variables and modelling using linear regression. Let us import our libraries. Numpy is a fast matrix computation library that most of the other libraries depend on and we might need it at some point. Pandas is our data manipulation library and one of the most important libraries in our pipeline. matplotlib and Seaborn are used for plotting graphs.
`import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns`
`cardata=pd.read_csv(r'CarPrice_Assignment.csv')`
`cardata.head()`
5 rows × 26 columns​ We can use head() to view first five records. We observe that there are a lot of variables and many of them are categorical. So, feature selection will play an important role going forward. Let us check if there are any missing values.
`cardata.isnull().sum()`
`car_ID              0symboling           0CarName             0fueltype            0aspiration          0doornumber          0carbody             0drivewheel          0enginelocation      0wheelbase           0carlength           0carwidth            0carheight           0curbweight          0enginetype          0cylindernumber      0enginesize          0fuelsystem          0boreratio           0stroke              0compressionratio    0horsepower          0peakrpm             0citympg             0highwaympg          0price               0dtype: int64`
Turns out there aren’t any. So we do not need to worry about filling any missing values. We can get descriptive statistical values using describe() of pandas.
`cardata.describe()`
​Now, we shall do some processing of our data.  1) We only want the company name. So lets split the CarName and extract only company name. We will rename it to Company to avoid confusion.  2) We will calculate total miles per gallon and remove citympg and highwaympg  3) We do not require ID as well so lets remove that as well.  4) We will change the datatype of symboling to string since its a categorical variable and should not be confused to be continuous.
`cardata['CarName']=cardata['CarName'].apply(lambda name: name.split())cardata.rename(index=str,columns={'CarName':'Company'},inplace=True)cardata['total_mpg']=(55*cardata['citympg']/100)+(45*cardata['highwaympg']/100)cardata.drop(['car_ID','citympg','highwaympg'],axis=1,inplace=True)cardata.symboling=cardata.symboling.astype(str)`
`cardata.head()`
​5 rows × 24 columns​ Let us see the companies present in our dataset.
`cardata.Company.unique()`
`array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',       'isuzu', 'jaguar', 'maxda', 'mazda', 'buick', 'mercury',       'mitsubishi', 'Nissan', 'nissan', 'peugeot', 'plymouth', 'porsche',       'porcshce', 'renault', 'saab', 'subaru', 'toyota', 'toyouta',       'vokswagen', 'volkswagen', 'vw', 'volvo'], dtype=object)`
We can see that some of the companies are misspelled or repeated. Let us fix that.
`cardata.Company.replace('maxda','mazda',inplace=True)cardata.Company.replace('Nissan','nissan',inplace=True)cardata.Company.replace('porcshce','porsche',inplace=True)cardata.Company.replace('toyouta','toyota',inplace=True)cardata.Company.replace('vokswagen','volkswagen',inplace=True)cardata.Company.replace('vw','volkswagen',inplace=True)`
We see that the names are now fixed.
`cardata.Company.unique()`
`array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',       'isuzu', 'jaguar', 'mazda', 'buick', 'mercury', 'mitsubishi',       'nissan', 'peugeot', 'plymouth', 'porsche', 'renault', 'saab',       'subaru', 'toyota', 'volkswagen', 'volvo'], dtype=object)`
Now, let us explore our data. We will look at how our data is distributed by plotting a histogram. We’ve plotted using both matplotlib and seaborn but both are the same while interpreting.
`sns.set_style('darkgrid')plt.hist(cardata['price'],histtype='step')`
`(array([83., 45., 35., 18.,  6.,  3.,  5.,  7.,  2.,  1.]), array([ 5118. ,  9146.2, 13174.4, 17202.6, 21230.8, 25259. , 29287.2,        33315.4, 37343.6, 41371.8, 45400. ]), <a list of 1 Patch objects>)`
`sns.distplot(cardata.price)`
`<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1d20aeb8>`
We can see that our data is skewed by looking at the above plots. What this means is that there are more cheaper cars in our dataset than expensive cars.
`print('Mean:',cardata.price.mean())print('Median:',cardata.price.median())print('Standard Deviation:',cardata.price.std())print('Variance:',cardata.price.var())`
`Mean: 13276.710570731706Median: 10295.0Standard Deviation: 7988.85233174315Variance: 63821761.57839796`
Since Mean and Median is not similar, data is skewed as seen in above histogram. There is also high amount of variance in the data.
`cardata.price.describe()`
`count      205.000000mean     13276.710571std       7988.852332min       5118.00000025%       7788.00000050%      10295.00000075%      16503.000000max      45400.000000Name: price, dtype: float64`
We shall plot a boxplot to see how our price is distributed. Boxplot shows minimum, first quartile(25%), median, third quartile(75%), maximum and outliers (represented using dots).
`sns.boxplot(y=cardata.price,color='#13d2f2')`
`<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1f5b7828>`
We can see below that most cars in our dataset is manufactured by Toyota.
`plt.figure(figsize=[20,7])sns.countplot(cardata.Company)`
`<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1f70b780>`
We also have more petrol cars than diesel cars.
`sns.countplot(cardata.fueltype)`
`<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1d09bbe0>`
The plot below shows number of cars for each body type.
`sns.countplot(cardata.carbody)`
`<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cfa9320>`
Now, let us look at how each variable affects the price. We can see that Jaguar,Buick have highest prices followed by porsche and BMW.
`plt.figure(figsize=[20,7])sns.barplot(x=cardata.Company,y=cardata.price,ci=None)`
`<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cf5cf28>`
Diesel cars are more costly than petrol cars.
`sns.barplot(x=cardata.fueltype,y=cardata.price,ci=None)`
`<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cde8dd8>`
The plot below gives relation between body type and price.
`sns.barplot(x=cardata.carbody,y=cardata.price,ci=None)`
`<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cdb2b38>`
The plot below gives the relation between number of cylinders and price.
`sns.barplot(x=cardata.cylindernumber,y=cardata.price,ci=None)`
`<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1cb276d8>`
Now, we need to select features using which we can model our price using linear regression. So, we will look at how much correlation each feature has with price. Correlation explains how much two things are related to each other. For example, the amount of rainfall is correlated with how wet your garden is. But, correlation doesn’t always mean causation. Just because your garden is wet doesn’t mean it was due to rain. It can also be the sprinkler or any other source of water. In general, correlation helps us the choose the most important variables to model after.
`plt.figure(figsize=[15,15])sns.heatmap(cardata.corr(),annot=True)`
`<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1d20a470>`
The Numerical Values having highest correlation are:
• Engine Size
• Curb Weight
• Horsepower
• Car Width
• Car Length
• City mpg (Negative correlation)
Let us drop all other variables.
`cardata.drop(['wheelbase','carheight','boreratio','stroke', 'compressionratio','peakrpm'],axis=1,inplace=True)`
`cardata.head()`
Now, we shall look for multicollinearity. Two variables are collinear if they are highly correlated. Multicollinearlity happens when there is high correlation between predictors. This is a problem because linear regression doesn’t handle multicollinearity well.
`plt.figure(figsize=[10,10])sns.heatmap(cardata.drop('price',axis=1).corr(),annot=True)`
`<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1c9be2e8>`
We see that enginesize, horsepower,curbweight and totalmpg are highly correlated. We need to choose one predictor among these. We need to now handle categorical variables. We can convert them to binary variables using get_dummies(). Each value in column becomes a separate column after getting the binarization.
`cardata=pd.get_dummies(cardata)`
We need to select one predictor from enginesize, horsepower,curbweight and totalmpg. We shall make a simple linear regression model for each predictor. Before moving ahead, we need to look at some statistical measures to choose best value. 1) p-value helps you determine the significance of your results. The relation is statistically significant if p value is less than 0.05. 2) R2 measures goodness of the fit. Higher R2 score means our model fits the data better. But, with increasing number of features, R2 also increases. So, we need to be careful.
`predictors=cardata['horsepower']target=cardata['price']`
We use statsmodels library to make a linear regression model since it gives more information and statistics than scikit-learn.
`import statsmodels.api as smpredictors= sm.add_constant(predictors)lm_1 = sm.OLS(target,predictors).fit()print(lm_1.summary())`
`OLS Regression Results                            ==============================================================================Dep. Variable:                  price   R-squared:                       0.653Model:                            OLS   Adj. R-squared:                  0.651Method:                 Least Squares   F-statistic:                     382.2Date:                Thu, 11 Jul 2019   Prob (F-statistic):           1.48e-48Time:                        23:00:26   Log-Likelihood:                -2024.0No. Observations:                 205   AIC:                             4052.Df Residuals:                     203   BIC:                             4059.Df Model:                           1                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975]------------------------------------------------------------------------------const      -3721.7615    929.849     -4.003      0.000   -5555.163   -1888.360horsepower   163.2631      8.351     19.549      0.000     146.796     179.730==============================================================================Omnibus:                       47.741   Durbin-Watson:                   0.792Prob(Omnibus):                  0.000   Jarque-Bera (JB):               91.702Skew:                           1.141   Prob(JB):                     1.22e-20Kurtosis:                       5.352   Cond. No.                         314.==============================================================================`
`Warnings: Standard Errors assume that the covariance matrix of the errors is correctly specified.`
`predictors=cardata['enginesize']import statsmodels.api as smpredictors= sm.add_constant(predictors)lm_1 = sm.OLS(target,predictors).fit()print(lm_1.summary())`
`OLS Regression Results                            ==============================================================================Dep. Variable:                  price   R-squared:                       0.764Model:                            OLS   Adj. R-squared:                  0.763Method:                 Least Squares   F-statistic:                     657.6Date:                Thu, 11 Jul 2019   Prob (F-statistic):           1.35e-65Time:                        23:00:26   Log-Likelihood:                -1984.4No. Observations:                 205   AIC:                             3973.Df Residuals:                     203   BIC:                             3979.Df Model:                           1                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975]------------------------------------------------------------------------------const      -8005.4455    873.221     -9.168      0.000   -9727.191   -6283.700enginesize   167.6984      6.539     25.645      0.000     154.805     180.592==============================================================================Omnibus:                       23.788   Durbin-Watson:                   0.768Prob(Omnibus):                  0.000   Jarque-Bera (JB):               33.092Skew:                           0.717   Prob(JB):                     6.52e-08Kurtosis:                       4.348   Cond. No.                         429.==============================================================================`
`Warnings: Standard Errors assume that the covariance matrix of the errors is correctly specified.`
`predictors=cardata['curbweight']import statsmodels.api as smpredictors= sm.add_constant(predictors)lm_1 = sm.OLS(target,predictors).fit()print(lm_1.summary())`
`OLS Regression Results                            ==============================================================================Dep. Variable:                  price   R-squared:                       0.698Model:                            OLS   Adj. R-squared:                  0.696Method:                 Least Squares   F-statistic:                     468.6Date:                Thu, 11 Jul 2019   Prob (F-statistic):           1.21e-54Time:                        23:00:26   Log-Likelihood:                -2009.8No. Observations:                 205   AIC:                             4024.Df Residuals:                     203   BIC:                             4030.Df Model:                           1                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975]------------------------------------------------------------------------------const      -1.948e+04   1543.962    -12.614      0.000   -2.25e+04   -1.64e+04curbweight    12.8162      0.592     21.647      0.000      11.649      13.984==============================================================================Omnibus:                       85.362   Durbin-Watson:                   0.575Prob(Omnibus):                  0.000   Jarque-Bera (JB):              382.847Skew:                           1.591   Prob(JB):                     7.34e-84Kurtosis:                       8.890   Cond. No.                     1.31e+04==============================================================================`
`Warnings: Standard Errors assume that the covariance matrix of the errors is correctly specified. The condition number is large, 1.31e+04. This might indicate that there arestrong multicollinearity or other numerical problems.`
`predictors=cardata['total_mpg']import statsmodels.api as smpredictors= sm.add_constant(predictors)lm_1 = sm.OLS(target,predictors).fit()print(lm_1.summary())`
`OLS Regression Results                            ==============================================================================Dep. Variable:                  price   R-squared:                       0.485Model:                            OLS   Adj. R-squared:                  0.482Method:                 Least Squares   F-statistic:                     191.0Date:                Thu, 11 Jul 2019   Prob (F-statistic):           4.74e-31Time:                        23:00:26   Log-Likelihood:                -2064.5No. Observations:                 205   AIC:                             4133.Df Residuals:                     203   BIC:                             4140.Df Model:                           1                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975]------------------------------------------------------------------------------const       3.645e+04   1724.687     21.137      0.000    3.31e+04    3.99e+04total_mpg   -836.4846     60.533    -13.819      0.000    -955.839    -717.130==============================================================================Omnibus:                       58.414   Durbin-Watson:                   0.820Prob(Omnibus):                  0.000   Jarque-Bera (JB):              104.935Skew:                           1.473   Prob(JB):                     1.64e-23Kurtosis:                       4.900   Cond. No.                         123.==============================================================================`
`Warnings: Standard Errors assume that the covariance matrix of the errors is correctly specified.`
From the above observations, we can select one of the predictors. Since enginesize has highest R2 value and P value of 0, we can select that. The reason for selecting just one of the above predictors is due to high amount of multicollinearity between them. We shall drop the rest.
`cardata.drop(['horsepower','curbweight','total_mpg'],axis=1,inplace=True)`
`cardata.shape`
`(205, 70)`
We have 70 columns after binarization. We obviously can’t look at all the columns. So, we shall drop all columns which have low correlation.
`cols_to_drop=cardata.corr()[(cardata.corr()['price']<=0.5) & (cardata.corr()['price']>=-0.5)]cols_to_drop=cols_to_drop.reset_index()['index']cols_to_drop=list(cols_to_drop)`
`cardata.drop(cols_to_drop,axis=1,inplace=True)`
We are left with only 10 variables.
`cardata.shape`
`(205, 10)`
`cardata.head()`
Let us look at the correlations of the variables left.
`plt.figure(figsize=[10,10])sns.heatmap(cardata.corr(),annot=True)`
`<matplotlib.axes._subplots.AxesSubplot at 0x7fcd1815cbe0>`
We will also drop all columns having correlation to price around 0.5. We will also drop carlength and carwidth since they are highly correlated to each other and significantly correlated with enginesize.
`cardata.drop(['carlength','carwidth','Company_buick','fuelsystem_2bbl', 'fuelsystem_mpfi'],axis=1,inplace=True)`
Let us start with enginesize since we know it has a good R2 and P Value.
`predictors=cardata.drop('price',axis=1)target=cardata.price`
`predictors1=predictors['enginesize']import statsmodels.api as smpredictors1= sm.add_constant(predictors1)lm_1 = sm.OLS(target,predictors1).fit()print(lm_1.summary())`
`OLS Regression Results                            ==============================================================================Dep. Variable:                  price   R-squared:                       0.764Model:                            OLS   Adj. R-squared:                  0.763Method:                 Least Squares   F-statistic:                     657.6Date:                Thu, 11 Jul 2019   Prob (F-statistic):           1.35e-65Time:                        23:00:28   Log-Likelihood:                -1984.4No. Observations:                 205   AIC:                             3973.Df Residuals:                     203   BIC:                             3979.Df Model:                           1                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975]------------------------------------------------------------------------------const      -8005.4455    873.221     -9.168      0.000   -9727.191   -6283.700enginesize   167.6984      6.539     25.645      0.000     154.805     180.592==============================================================================Omnibus:                       23.788   Durbin-Watson:                   0.768Prob(Omnibus):                  0.000   Jarque-Bera (JB):               33.092Skew:                           0.717   Prob(JB):                     6.52e-08Kurtosis:                       4.348   Cond. No.                         429.==============================================================================`
`Warnings: Standard Errors assume that the covariance matrix of the errors is correctly specified.`
We get a decent model with R2 score of 0.76 only using enginesize. We will now add more variables to it to see if it improves our model. Let us see if adding forward wheel drive or backward wheel drive improves our model.
`predictors2=predictors[['enginesize','drivewheel_fwd']]import statsmodels.api as smpredictors2= sm.add_constant(predictors2)lm_2 = sm.OLS(target,predictors2).fit()print(lm_2.summary())`
`OLS Regression Results                            ==============================================================================Dep. Variable:                  price   R-squared:                       0.794Model:                            OLS   Adj. R-squared:                  0.792Method:                 Least Squares   F-statistic:                     390.3Date:                Thu, 11 Jul 2019   Prob (F-statistic):           4.11e-70Time:                        23:00:28   Log-Likelihood:                -1970.3No. Observations:                 205   AIC:                             3947.Df Residuals:                     202   BIC:                             3957.Df Model:                           2                                         Covariance Type:            nonrobust                                         ==================================================================================                     coef    std err          t      P>|t|      [0.025      0.975]----------------------------------------------------------------------------------const          -3510.5293   1160.633     -3.025      0.003   -5799.040   -1222.019enginesize       147.4621      7.157     20.604      0.000     133.350     161.574drivewheel_fwd -3291.5804    603.483     -5.454      0.000   -4481.515   -2101.646==============================================================================Omnibus:                       21.512   Durbin-Watson:                   0.819Prob(Omnibus):                  0.000   Jarque-Bera (JB):               35.873Skew:                           0.580   Prob(JB):                     1.62e-08Kurtosis:                       4.689   Cond. No.                         655.==============================================================================`
`Warnings: Standard Errors assume that the covariance matrix of the errors is correctly specified.`
We see that our R2 has improved.
`predictors3=predictors[['enginesize','drivewheel_rwd']]import statsmodels.api as smpredictors3= sm.add_constant(predictors3)lm_3 = sm.OLS(target,predictors3).fit()print(lm_3.summary())`
`OLS Regression Results                            ==============================================================================Dep. Variable:                  price   R-squared:                       0.795Model:                            OLS   Adj. R-squared:                  0.793Method:                 Least Squares   F-statistic:                     391.4Date:                Thu, 11 Jul 2019   Prob (F-statistic):           3.26e-70Time:                        23:00:28   Log-Likelihood:                -1970.1No. Observations:                 205   AIC:                             3946.Df Residuals:                     202   BIC:                             3956.Df Model:                           2                                         Covariance Type:            nonrobust                                         ==================================================================================                     coef    std err          t      P>|t|      [0.025      0.975]----------------------------------------------------------------------------------const          -6378.7190    868.209     -7.347      0.000   -8090.634   -4666.804enginesize       144.6322      7.412     19.512      0.000     130.017     159.248drivewheel_rwd  3508.0574    637.511      5.503      0.000    2251.028    4765.087==============================================================================Omnibus:                       19.717   Durbin-Watson:                   0.781Prob(Omnibus):                  0.000   Jarque-Bera (JB):               33.357Skew:                           0.528   Prob(JB):                     5.71e-08Kurtosis:                       4.670   Cond. No.                         481.==============================================================================`
`Warnings: Standard Errors assume that the covariance matrix of the errors is correctly specified.`
We get a pretty good improvement by adding either of the two. Let us add both and check.
`predictors4=predictors[['enginesize','drivewheel_fwd','drivewheel_rwd']]import statsmodels.api as smpredictors4= sm.add_constant(predictors4)lm_4 = sm.OLS(target,predictors4).fit()print(lm_4.summary())`
`OLS Regression Results                            ==============================================================================Dep. Variable:                  price   R-squared:                       0.797Model:                            OLS   Adj. R-squared:                  0.794Method:                 Least Squares   F-statistic:                     262.5Date:                Thu, 11 Jul 2019   Prob (F-statistic):           3.07e-69Time:                        23:00:28   Log-Likelihood:                -1969.2No. Observations:                 205   AIC:                             3946.Df Residuals:                     201   BIC:                             3960.Df Model:                           3                                         Covariance Type:            nonrobust                                         ==================================================================================                     coef    std err          t      P>|t|      [0.025      0.975]----------------------------------------------------------------------------------const          -4829.7218   1458.548     -3.311      0.001   -7705.740   -1953.703enginesize       144.5557      7.399     19.537      0.000     129.966     159.145drivewheel_fwd -1656.2169   1254.380     -1.320      0.188   -4129.650     817.216drivewheel_rwd  1971.1119   1326.626      1.486      0.139    -644.777    4587.001==============================================================================Omnibus:                       20.686   Durbin-Watson:                   0.794Prob(Omnibus):                  0.000   Jarque-Bera (JB):               36.360Skew:                           0.539   Prob(JB):                     1.27e-08Kurtosis:                       4.760   Cond. No.                     1.13e+03==============================================================================`
`Warnings: Standard Errors assume that the covariance matrix of the errors is correctly specified. The condition number is large, 1.13e+03. This might indicate that there arestrong multicollinearity or other numerical problems.`
We get a warning saying that there is a strong multicollinearity. It can be explained since both forward wheel drive and backward wheel drives do not coexist. We see that our p values increase beyond 0.005. So, we cannot go with above predictors We will add cylindernumber_four variable to our existing models and see if it improves our model.
`predictors5=predictors[['enginesize','drivewheel_rwd','cylindernumber_four']]import statsmodels.api as smpredictors5= sm.add_constant(predictors5)lm_5 = sm.OLS(target,predictors5).fit()print(lm_5.summary())`
`OLS Regression Results                            ==============================================================================Dep. Variable:                  price   R-squared:                       0.823Model:                            OLS   Adj. R-squared:                  0.820Method:                 Least Squares   F-statistic:                     311.8Date:                Thu, 11 Jul 2019   Prob (F-statistic):           2.55e-75Time:                        23:00:28   Log-Likelihood:                -1954.9No. Observations:                 205   AIC:                             3918.Df Residuals:                     201   BIC:                             3931.Df Model:                           3                                         Covariance Type:            nonrobust                                         =======================================================================================                          coef    std err          t      P>|t|      [0.025      0.975]---------------------------------------------------------------------------------------const                  21.9214   1389.248      0.016      0.987   -2717.449    2761.292enginesize            120.8814      8.074     14.971      0.000     104.960     136.803drivewheel_rwd       3098.2795    597.869      5.182      0.000    1919.379    4277.180cylindernumber_four -4170.3627    736.215     -5.665      0.000   -5622.059   -2718.666==============================================================================Omnibus:                       12.155   Durbin-Watson:                   0.948Prob(Omnibus):                  0.002   Jarque-Bera (JB):               25.608Skew:                           0.202   Prob(JB):                     2.75e-06Kurtosis:                       4.684   Cond. No.                         861.==============================================================================`
`Warnings: Standard Errors assume that the covariance matrix of the errors is correctly specified.`
`predictors6=predictors[['enginesize','drivewheel_fwd','cylindernumber_four']]import statsmodels.api as smpredictors6= sm.add_constant(predictors6)lm_6 = sm.OLS(target,predictors6).fit()print(lm_6.summary())`
`OLS Regression Results                            ==============================================================================Dep. Variable:                  price   R-squared:                       0.821Model:                            OLS   Adj. R-squared:                  0.819Method:                 Least Squares   F-statistic:                     308.0Date:                Thu, 11 Jul 2019   Prob (F-statistic):           6.99e-75Time:                        23:00:28   Log-Likelihood:                -1955.9No. Observations:                 205   AIC:                             3920.Df Residuals:                     201   BIC:                             3933.Df Model:                           3                                         Covariance Type:            nonrobust                                         =======================================================================================                          coef    std err          t      P>|t|      [0.025      0.975]---------------------------------------------------------------------------------------const                2314.0832   1515.502      1.527      0.128    -674.239    5302.406enginesize            124.4012      7.893     15.761      0.000     108.838     139.965drivewheel_fwd      -2827.0579    570.267     -4.957      0.000   -3951.530   -1702.585cylindernumber_four -4087.0246    742.670     -5.503      0.000   -5551.448   -2622.601==============================================================================Omnibus:                       12.945   Durbin-Watson:                   0.985Prob(Omnibus):                  0.002   Jarque-Bera (JB):               25.263Skew:                           0.273   Prob(JB):                     3.27e-06Kurtosis:                       4.631   Cond. No.                         913.==============================================================================`
`Warnings: Standard Errors assume that the covariance matrix of the errors is correctly specified.`
Now, let us plot our predictions to see how our models perform.
`pred=lm_6.predict(predictors6)`
`# Actual vs Predictedc = [i for i in range(1,206,1)]fig = plt.figure()plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-")     #Plotting Actualplt.plot(c,pred, color="red",  linewidth=3.5, linestyle="-")  #Plotting predictedfig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading plt.xlabel('Index', fontsize=18)                               # X-labelplt.ylabel('Car Price', fontsize=16)`
`Text(0, 0.5, 'Car Price')`
`# Error termsc = [i for i in range(1,206,1)]fig = plt.figure()plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")fig.suptitle('Error Terms', fontsize=20)              # Plot heading plt.xlabel('Index', fontsize=18)                      # X-labelplt.ylabel('Car Price', fontsize=16)    `
`from sklearn.metrics import mean_squared_error, r2_scoremse = mean_squared_error(target, pred)r_squared = r2_score(target, pred)`
`print('Mean_Squared_Error :' ,mse)print('r_square_value :',r_squared)`
`Mean_Squared_Error : 11347528.099728286r_square_value : 0.8213281339239666`
`pred=lm_5.predict(predictors5)`
`# Actual vs Predictedc = [i for i in range(1,206,1)]fig = plt.figure()plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-")     #Plotting Actualplt.plot(c,pred, color="red",  linewidth=3.5, linestyle="-")  #Plotting predictedfig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading plt.xlabel('Index', fontsize=18)                               # X-labelplt.ylabel('Car Price', fontsize=16)`
`Text(0, 0.5, 'Car Price')`
`# Error termsc = [i for i in range(1,206,1)]fig = plt.figure()plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")fig.suptitle('Error Terms', fontsize=20)              # Plot heading plt.xlabel('Index', fontsize=18)                      # X-labelplt.ylabel('Car Price', fontsize=16)   `
`from sklearn.metrics import mean_squared_error, r2_scoremse = mean_squared_error(target, pred)r_squared = r2_score(target, pred)`
`print('Mean_Squared_Error :' ,mse)print('r_square_value :',r_squared)`
`Mean_Squared_Error : 11234025.847691916r_square_value : 0.8231152772557074`
`pred=lm_4.predict(predictors4)`
`# Actual vs Predictedc = [i for i in range(1,206,1)]fig = plt.figure()plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-")     #Plotting Actualplt.plot(c,pred, color="red",  linewidth=3.5, linestyle="-")  #Plotting predictedfig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading plt.xlabel('Index', fontsize=18)                               # X-labelplt.ylabel('Car Price', fontsize=16)`
`Text(0, 0.5, 'Car Price')`
`# Error termsc = [i for i in range(1,206,1)]fig = plt.figure()plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")fig.suptitle('Error Terms', fontsize=20)              # Plot heading plt.xlabel('Index', fontsize=18)                      # X-labelplt.ylabel('Car Price', fontsize=16)  `
`from sklearn.metrics import mean_squared_error, r2_scoremse = mean_squared_error(target, pred)r_squared = r2_score(target, pred)`
`print('Mean_Squared_Error :' ,mse)print('r_square_value :',r_squared)`
`Mean_Squared_Error : 12915408.499455474r_square_value : 0.7966411611893495`
`pred=lm_3.predict(predictors3)`
`# Actual vs Predictedc = [i for i in range(1,206,1)]fig = plt.figure()plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-")     #Plotting Actualplt.plot(c,pred, color="red",  linewidth=3.5, linestyle="-")  #Plotting predictedfig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading plt.xlabel('Index', fontsize=18)                               # X-labelplt.ylabel('Car Price', fontsize=16)`
`Text(0, 0.5, 'Car Price')`
`# Error termsc = [i for i in range(1,206,1)]fig = plt.figure()plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")fig.suptitle('Error Terms', fontsize=20)              # Plot heading plt.xlabel('Index', fontsize=18)                      # X-labelplt.ylabel('Car Price', fontsize=16)    `
`from sklearn.metrics import mean_squared_error, r2_scoremse = mean_squared_error(target, pred)r_squared = r2_score(target, pred)`
`print('Mean_Squared_Error :' ,mse)print('r_square_value :',r_squared)`
`Mean_Squared_Error : 13027426.534669885r_square_value : 0.7948773875101807`
`pred=lm_2.predict(predictors2)`
`# Actual vs Predictedc = [i for i in range(1,206,1)]fig = plt.figure()plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-")     #Plotting Actualplt.plot(c,pred, color="red",  linewidth=3.5, linestyle="-")  #Plotting predictedfig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading plt.xlabel('Index', fontsize=18)                               # X-labelplt.ylabel('Car Price', fontsize=16)`
`Text(0, 0.5, 'Car Price')`
`# Error termsc = [i for i in range(1,206,1)]fig = plt.figure()plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")fig.suptitle('Error Terms', fontsize=20)              # Plot heading plt.xlabel('Index', fontsize=18)                      # X-labelplt.ylabel('Car Price', fontsize=16)    `
`from sklearn.metrics import mean_squared_error, r2_scoremse = mean_squared_error(target, pred)r_squared = r2_score(target, pred)`
`print('Mean_Squared_Error :' ,mse)print('r_square_value :',r_squared)`
`Mean_Squared_Error : 13057261.284937426r_square_value : 0.7944076261262594`
`pred=lm_1.predict(predictors1)`
`# Actual vs Predictedc = [i for i in range(1,206,1)]fig = plt.figure()plt.plot(c,target, color="blue", linewidth=3.5, linestyle="-")     #Plotting Actualplt.plot(c,pred, color="red",  linewidth=3.5, linestyle="-")  #Plotting predictedfig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading plt.xlabel('Index', fontsize=18)                               # X-labelplt.ylabel('Car Price', fontsize=16)`
`Text(0, 0.5, 'Car Price')`
`# Error termsc = [i for i in range(1,206,1)]fig = plt.figure()plt.plot(c,target-pred, color="blue", linewidth=2.5, linestyle="-")fig.suptitle('Error Terms', fontsize=20)              # Plot heading plt.xlabel('Index', fontsize=18)                      # X-labelplt.ylabel('Car Price', fontsize=16)    `
`from sklearn.metrics import mean_squared_error, r2_scoremse = mean_squared_error(target, pred)r_squared = r2_score(target, pred)`
`print('Mean_Squared_Error :' ,mse)print('r_square_value :',r_squared)`
`Mean_Squared_Error : 14980261.405551314r_square_value : 0.7641291357806177`
We can see that models 6 and 5 predict the price pretty well. So the solution for our problem statement is to look at enginesize, forward/backward wheel drive and see if the number of cylinders is 4 (which has negative correlation with price) to determine the price, This task was done by Pareekshith US Katti and N Nithin Srivatsav while doing internship at Ambee. 