Implementing a Multivariate Linear Regression model in python

Earlier, I wrote about how to implement a simple linear regression (SLR) model in python. SLR is probably the easiest model to implement among the most popular machine learning algorithms. In this post, we are going to take it one step further and instead of working with just one independent variable, we will be working with multiple independent variables. Such a model is called a multivariate linear regression (MLR) model.

How does the model work?

A multivariate linear model can be described by a linear equation consisting of multiple independent variables.

For example:

In this equation, ß (beta) defines all the coefficients, x defines all the independent variables and y defines dependent variable.

An SLR model is a simplified version of an MLR model where there is only one x. Linear regression models use a technique called Ordinary Least Squares (OLS) to find the optimum value for the betas. OLS consists of calculating the error which is the difference between predicted value and actual value and then taking square of it. The goal is to find the betas that minimize the sum of the squared errors.

If you want to learn more about SLM and OLS, I highly recommend this visual explanation.

How can I implement MLR model?

Implementing an MLR model is pretty much the same as implementing an SLR model. The only difference is that because you have additional independent variables, you need to make sure you select only the most relevant features/independent varibles in your final model. Not all of your data is necessary to build your model. In fact, most of the times, a lot of your data will be garbage and as you know garbige in means garbage out! The process of refining your model by selecting ideal features is called ‘feature selection’. I will not be covering that in this post but will definitely do in a separate post.

As discussed before, we are going to be following these steps:

  • Exploring the dataset
  • Preprocessing the dataset
  • Splitting the dataset into training and testing set
  • Building the model
  • Evaluating the model

Exploring the dataset

We will be working with automobile dataset provided by UCI. This dataset has a lot of features and instead of looking at them all, we will select only a few important ones for simplicity.

In [34]:
# Let's load the data into python and take a look at it
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  

# Loading entire dataset
dataset = pd.read_csv('car_data.csv', names=['symboling','normalized_losses','make','fuel_type','aspiration',
                                             'price'], index_col=None)

# Select few important columns
dataset = dataset[['make', 'engine_size', 'horsepower', 'price']]

As we can see below, our dataset has four columns:

  • make: car company
  • engine_size: size of the engine
  • horsepower: horsepower
  • price: price of the car
In [35]:
make engine_size horsepower price
0 alfa-romero 130 111 13495
1 alfa-romero 130 111 16500
2 alfa-romero 152 154 16500
3 audi 109 102 13950
4 audi 136 115 17450
In [36]:
# Taking a deeper look, we can see that we have missing values denoted as '?' in our dataset
# For example, in the example below, you can see that last row has a price of '?' which doesn't make sense.
make engine_size horsepower price
0 alfa-romero 130 111 13495
1 alfa-romero 130 111 16500
2 alfa-romero 152 154 16500
3 audi 109 102 13950
4 audi 136 115 17450
5 audi 136 110 15250
6 audi 136 110 17710
7 audi 136 110 18920
8 audi 131 140 23875
9 audi 131 160 ?
In [37]:
# Let's remove these '?' values from our dataset entirely
import numpy as np
dataset = dataset.replace('?', np.nan)

# Let's drop any null values
dataset = dataset.dropna()

Now, we are going to plot to scatter plots to see if there is any obvious relationship between our independent variables (features) and price. Of course, we can't plot 'make' so we will only focus on engine_size and horsepower.

In [38]:
# As we can see from the scatter plot below, there seems to be a linear relationship between horsepower and price.
# This makes sense because a car with higher horsepower is usually more expensive. 
plt.scatter(dataset.horsepower, dataset.price);
plt.title('Horsepower vs Price');
In [39]:
# The graph below also shows that there is a linear relationship between engine size. 
# We can see that as the engine size increases, the price also goes up. 
plt.scatter(dataset.engine_size, dataset.price);
plt.title('Engine Size vs Price');
plt.xlabel('engine size');

Preprocessing the dataset

Looking at the dataset, we can see that we have a categorical feature (make of the car). In one of my earlier posts, I had covered how to handle categorical features so that they can be fed into our machine learning algorithms.

But before we take care of that, let's separate our independent variables from our dependent variable.

In [40]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

Now, let's encode our categorical feature.

In [41]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
X_label_encoder = LabelEncoder()
X[:,0] = X_label_encoder.fit_transform(X[:,0])
one_hot_encoder = OneHotEncoder(categorical_features=[0])
X = one_hot_encoder.fit_transform(X).toarray()

As we can see below, we now have additional features that replace our categorical feature. Specifically, the number of additional features is equal to different categories that our categorical feature contained.

In [42]:
array([[   1.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,  130.,  111.]])

Splitting the dataset into training and testing set

Let's split our dataset into training and testing sets so that we can build our model using the training set and then evaluate its performance against the testing set.

In [43]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

At this point, we have four different arrays:

  • X_train - independent variable (training set)
  • X_test - independent variable (testing set)
  • y_train - dependent variable (training set)
  • y_test - dependent variable (testing set)

We will use X_train and y_train to train our MLR model and then feed MLR model with X_test to get our predictions. We will compare these predictions that our model gave us with the actual data, y_test. By comparing these two datasets, we can evaluate our model's performance.

Building the model

We will now train our model using the training dataset. This part of the code is exactly the same as simple linear regression model which is convenient.

In [44]:
# Training our model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression(), y_train)

# Predicting values using our trained model
y_pred = regressor.predict(X_test)

We can see the values of betas (coefficients) that the model decided worked best.

In [45]:
# We can see that some coefficients are positive while others are negative.
array([ -1516.3446665 ,   2272.32332627,   4934.5416987 ,  -3444.61060831,
        -3631.62722277,  -2717.60124225,  -3331.47776225,   7471.81657909,
         -665.32087109,  10733.19687431,  -2203.29512067,  -4215.7795186 ,
        -3655.55602433,   1341.23937185,  -3815.74011407,   7673.87780497,
          816.73779978,  -3222.32699847,  -2963.15098654,  -1444.18288889,
         1583.28056975,     64.98959376,     53.11623642])
In [46]:
# Let's look at the output of y_test and y_pred
In [47]:
array(['9538', '8189', '11900', '13295', '9095', '7957', '16695', '9995',
       '28248', '12940', '5572', '7895', '12629', '37028', '9639', '5151',
       '11850', '13860', '15645', '6189', '13950', '10595', '41315',
       '7775', '36000', '6095', '16500', '7603', '7395', '11845', '23875',
       '11248', '9959', '31600', '9495', '7689', '36880', '8948', '8495',
       '22470'], dtype=object)
In [48]:
array([  9667.25830159,   8699.59034567,  14604.67617639,  13549.59848931,
         9311.66102404,   8283.50680983,  14498.44370355,  11263.71709354,
        29472.00023277,  17114.47486245,   6141.7509127 ,   9311.66102404,
        13936.86200959,  31589.33055424,  12999.22374781,   3381.74457964,
        14835.67527156,  16578.11070391,  12016.94916703,   5687.57780439,
        15086.45578155,  12037.58404749,  28496.93244384,   8464.49113461,
        42887.28870804,   9173.04685814,  13140.61538556,   7986.44500681,
         9173.04685814,   9561.10119108,  18534.64382832,  10164.68382342,
         9406.96984035,  29472.00023277,   8784.12240291,   7883.4674053 ,
        28496.93244384,  10164.68382342,  12037.58404749,  16949.50334612])
In [49]:
# Let's convert all values to integers for better comparison
array([ 9667,  8699, 14604, 13549,  9311,  8283, 14498, 11263, 29472,
       17114,  6141,  9311, 13936, 31589, 12999,  3381, 14835, 16578,
       12016,  5687, 15086, 12037, 28496,  8464, 42887,  9173, 13140,
        7986,  9173,  9561, 18534, 10164,  9406, 29472,  8784,  7883,
       28496, 10164, 12037, 16949])

Evaluating the Model

At this point, we have two arrays with different values of our dependent variable:

  • y_test - data that we set aside and didn't expost to our model when we were training it
  • y_pred - data which is the output of our model

We would like to compare the two datasets to see how our model performed. How accurate was our model when predicting the price of cars?

We are going to use some common metrics like variance score, mean absolute error, mean squared error and r squared error to evaluate our model.

In [50]:
from sklearn.metrics import explained_variance_score, mean_absolute_error, mean_squared_error, r2_score
ex_var_score = explained_variance_score(y_test, y_pred)
m_absolute_error = mean_absolute_error(y_test, y_pred)
m_squared_error = mean_squared_error(y_test, y_pred)
r_2_score = r2_score(y_test, y_pred)

print("Explained Variance Score: "+str(ex_var_score))
print("Mean Absolute Error "+str(m_absolute_error))
print("Mean Squared Error "+str(m_squared_error))
print("R Squared Error "+str(r_2_score))
Explained Variance Score: 0.865080396011
Mean Absolute Error 2480.75774823
Mean Squared Error 12641258.8543
R Squared Error 0.864080747279

An r squared error of 0.864 is good but not great. I think we can do better. However, I am going to end this post here as we have successfully implemented a multivariate linear regression model in python using scikit learn. In a different post, we will look at the same example with the same dataset and see how we can use feature selection to possibly improve this model!

Hope you liked this post and let me know your thoughts and your suggestions!

You can download this code from my github.

Leave a Reply

Your email address will not be published. Required fields are marked *