Implementing a Multivariate Linear Regression model in python
Earlier, I wrote about how to implement a simple linear regression (SLR) model in python. SLR is probably the easiest model to implement among the most popular machine learning algorithms. In this post, we are going to take it one step further and instead of working with just one independent variable, we will be working with multiple independent variables. Such a model is called a multivariate linear regression (MLR) model.
How does the model work?
A multivariate linear model can be described by a linear equation consisting of multiple independent variables.
In this equation, ß (beta) defines all the coefficients, x defines all the independent variables and y defines dependent variable.
An SLR model is a simplified version of an MLR model where there is only one x. Linear regression models use a technique called Ordinary Least Squares (OLS) to find the optimum value for the betas. OLS consists of calculating the error which is the difference between predicted value and actual value and then taking square of it. The goal is to find the betas that minimize the sum of the squared errors.
If you want to learn more about SLM and OLS, I highly recommend this visual explanation.
How can I implement MLR model?
Implementing an MLR model is pretty much the same as implementing an SLR model. The only difference is that because you have additional independent variables, you need to make sure you select only the most relevant features/independent varibles in your final model. Not all of your data is necessary to build your model. In fact, most of the times, a lot of your data will be garbage and as you know garbige in means garbage out! The process of refining your model by selecting ideal features is called ‘feature selection’. I will not be covering that in this post but will definitely do in a separate post.
As discussed before, we are going to be following these steps:
- Exploring the dataset
- Preprocessing the dataset
- Splitting the dataset into training and testing set
- Building the model
- Evaluating the model
# Let's load the data into python and take a look at it import pandas as pd import matplotlib.pyplot as plt %matplotlib inline # Loading entire dataset dataset = pd.read_csv('car_data.csv', names=['symboling','normalized_losses','make','fuel_type','aspiration', 'num_of_doors','body_style','drive_wheels','engine_location', 'wheel_base','length','width','height','curb_weight','engine_type', 'num_of_cylinders','engine_size','fuel_system','bore','stroke', 'compression_ratio','horsepower','peak_rpm','city_mpg','highway_mpg', 'price'], index_col=None) # Select few important columns dataset = dataset[['make', 'engine_size', 'horsepower', 'price']]
As we can see below, our dataset has four columns:
- make: car company
- engine_size: size of the engine
- horsepower: horsepower
- price: price of the car
# Taking a deeper look, we can see that we have missing values denoted as '?' in our dataset # For example, in the example below, you can see that last row has a price of '?' which doesn't make sense. dataset[:10]
# Let's remove these '?' values from our dataset entirely import numpy as np dataset = dataset.replace('?', np.nan) # Let's drop any null values dataset = dataset.dropna()
Now, we are going to plot to scatter plots to see if there is any obvious relationship between our independent variables (features) and price. Of course, we can't plot 'make' so we will only focus on engine_size and horsepower.
# As we can see from the scatter plot below, there seems to be a linear relationship between horsepower and price. # This makes sense because a car with higher horsepower is usually more expensive. plt.scatter(dataset.horsepower, dataset.price); plt.title('Horsepower vs Price'); plt.xlabel('horsepower'); plt.ylabel('price');
# The graph below also shows that there is a linear relationship between engine size. # We can see that as the engine size increases, the price also goes up. plt.scatter(dataset.engine_size, dataset.price); plt.title('Engine Size vs Price'); plt.xlabel('engine size'); plt.ylabel('price');
Preprocessing the dataset¶
Looking at the dataset, we can see that we have a categorical feature (make of the car). In one of my earlier posts, I had covered how to handle categorical features so that they can be fed into our machine learning algorithms.
But before we take care of that, let's separate our independent variables from our dependent variable.
X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 3].values
Now, let's encode our categorical feature.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder X_label_encoder = LabelEncoder() X[:,0] = X_label_encoder.fit_transform(X[:,0]) one_hot_encoder = OneHotEncoder(categorical_features=) X = one_hot_encoder.fit_transform(X).toarray()
As we can see below, we now have additional features that replace our categorical feature. Specifically, the number of additional features is equal to different categories that our categorical feature contained.
array([[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 130., 111.]])
Splitting the dataset into training and testing set¶
Let's split our dataset into training and testing sets so that we can build our model using the training set and then evaluate its performance against the testing set.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
At this point, we have four different arrays:
- X_train - independent variable (training set)
- X_test - independent variable (testing set)
- y_train - dependent variable (training set)
- y_test - dependent variable (testing set)
We will use X_train and y_train to train our MLR model and then feed MLR model with X_test to get our predictions. We will compare these predictions that our model gave us with the actual data, y_test. By comparing these two datasets, we can evaluate our model's performance.
Building the model¶
We will now train our model using the training dataset. This part of the code is exactly the same as simple linear regression model which is convenient.
# Training our model from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) # Predicting values using our trained model y_pred = regressor.predict(X_test)
We can see the values of betas (coefficients) that the model decided worked best.
# We can see that some coefficients are positive while others are negative. regressor.coef_
array([ -1516.3446665 , 2272.32332627, 4934.5416987 , -3444.61060831, -3631.62722277, -2717.60124225, -3331.47776225, 7471.81657909, -665.32087109, 10733.19687431, -2203.29512067, -4215.7795186 , -3655.55602433, 1341.23937185, -3815.74011407, 7673.87780497, 816.73779978, -3222.32699847, -2963.15098654, -1444.18288889, 1583.28056975, 64.98959376, 53.11623642])
# Let's look at the output of y_test and y_pred
array(['9538', '8189', '11900', '13295', '9095', '7957', '16695', '9995', '28248', '12940', '5572', '7895', '12629', '37028', '9639', '5151', '11850', '13860', '15645', '6189', '13950', '10595', '41315', '7775', '36000', '6095', '16500', '7603', '7395', '11845', '23875', '11248', '9959', '31600', '9495', '7689', '36880', '8948', '8495', '22470'], dtype=object)
array([ 9667.25830159, 8699.59034567, 14604.67617639, 13549.59848931, 9311.66102404, 8283.50680983, 14498.44370355, 11263.71709354, 29472.00023277, 17114.47486245, 6141.7509127 , 9311.66102404, 13936.86200959, 31589.33055424, 12999.22374781, 3381.74457964, 14835.67527156, 16578.11070391, 12016.94916703, 5687.57780439, 15086.45578155, 12037.58404749, 28496.93244384, 8464.49113461, 42887.28870804, 9173.04685814, 13140.61538556, 7986.44500681, 9173.04685814, 9561.10119108, 18534.64382832, 10164.68382342, 9406.96984035, 29472.00023277, 8784.12240291, 7883.4674053 , 28496.93244384, 10164.68382342, 12037.58404749, 16949.50334612])
# Let's convert all values to integers for better comparison y_pred.astype(int)
array([ 9667, 8699, 14604, 13549, 9311, 8283, 14498, 11263, 29472, 17114, 6141, 9311, 13936, 31589, 12999, 3381, 14835, 16578, 12016, 5687, 15086, 12037, 28496, 8464, 42887, 9173, 13140, 7986, 9173, 9561, 18534, 10164, 9406, 29472, 8784, 7883, 28496, 10164, 12037, 16949])
Evaluating the Model¶
At this point, we have two arrays with different values of our dependent variable:
- y_test - data that we set aside and didn't expost to our model when we were training it
- y_pred - data which is the output of our model
We would like to compare the two datasets to see how our model performed. How accurate was our model when predicting the price of cars?
We are going to use some common metrics like variance score, mean absolute error, mean squared error and r squared error to evaluate our model.
from sklearn.metrics import explained_variance_score, mean_absolute_error, mean_squared_error, r2_score ex_var_score = explained_variance_score(y_test, y_pred) m_absolute_error = mean_absolute_error(y_test, y_pred) m_squared_error = mean_squared_error(y_test, y_pred) r_2_score = r2_score(y_test, y_pred) print("Explained Variance Score: "+str(ex_var_score)) print("Mean Absolute Error "+str(m_absolute_error)) print("Mean Squared Error "+str(m_squared_error)) print("R Squared Error "+str(r_2_score))
Explained Variance Score: 0.865080396011 Mean Absolute Error 2480.75774823 Mean Squared Error 12641258.8543 R Squared Error 0.864080747279
An r squared error of 0.864 is good but not great. I think we can do better. However, I am going to end this post here as we have successfully implemented a multivariate linear regression model in python using scikit learn. In a different post, we will look at the same example with the same dataset and see how we can use feature selection to possibly improve this model!
Hope you liked this post and let me know your thoughts and your suggestions!
You can download this code from my github.