Implementing Support Vector Machine (SVM) algorithm in python

As you have probably noticed by now, there are several machine learning algorithms available at your disposal. In my previous post, I covered a very popular classification algorithm called K-Nearest Neighbors. In today’s post, I will cover another very common and powerful classification algorithm called Support Vector Machine (SVM).

What is SVM and how does it work?

Just like KNN, SVM is a supervised learning model which means that it learns from the training set that we feed it. It can be used for both classification and regression problems but it’s mostly used for classification. In this post, we will focus on using SVM for classification.

SVM consists of picking support vectors and then using them to define a decision boundary for classifying features into different classes. The decision boundary is more formally known as hyperplane. Points on different sides of the plane belong to different classes. However, different sets of points can be segregated by numerous hyperplanes so how do you decide which hyperplane to select? That’s where the support vectors come into the picture.

 

Support vectors are just data points in your dataset that help decide which hyperplane to pick. These data points are edge cases that define the boundaries of your class. For example, let’s say you have a dataset consisting of different car models belonging to two different classes: Ferrari and Porche. The training data might have Ferraris that have same characteristics as a Porche and vice-versa. These are the data points that will end up being our support vectors. To pick the hyperplane, you need to calculate distance from nearest data point (can be of any class) to the hyperplane. This distance is called margin. The plane which results in the highest margin is chosen as our hyperplane.

This is easy to accomplish when you have two classes that are linearly separable. What do you do when they aren’t? In such scenarios, SVM will map the data points to a higher dimension (i.e. from 2d to 3d) so that classes can be separated.

Like KNN, SVM is better suited for smaller datasets and provides high accuracy.

Here are some useful posts to learn more about SVM models:


Exploring the dataset

For this example, we are going one of the most popular datasets in machine learning called the Iris dataset. Some of you might have heard of this dataset already. This data was collected by Fisher in 1933. It contains 4 features and a label. The features are:

  • sepal length
  • sepal width
  • petal length
  • petal width

and the label is the type of iris plant.

Our task is to use these four features to correctly predict the type of iris plant.

More information about this dataset can be found here.

In [97]:
# Let's load the data into python and take a look at it
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  

dataset = pd.read_csv(r'iris.csv')

dataset.head()
Out[97]:
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
In [98]:
# Let's modify the class values so that they don't have "Iris-" prefix
dataset['class'] = [x.split('-')[-1] for x in dataset['class']]
dataset.head()
Out[98]:
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [99]:
# Let's get some more information about our dataset
dataset.describe()
Out[99]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

We can see there are 150 rows. The mean value of sepal_length, sepal_width, petal_length and petal_width is 5.84 cm, 3.05 cm, 3.76 cm and 1.20, respectively.

We don't see information about our label because it's values are not quantitative.

Preprocessing the dataset

Iris is a very standard dataset and doesn't have any missing or null values so we don't need to do any preprocessing.

Only thing we need to do is split the dataset into dependent variable and independent variable so that we can feed it to our machine learning model.

Keep in mind, that life is never this easy and data preprocessing is a very crucial step in the machine learning process and often the most time consuming.

In [100]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

Splitting the dataset into training and testing set

SVM is a supervised learning model (see my earlier post on supervised and unsupervised models) which means we need to feed it some data first to train it. And, once it is trained, we need to test it on a different set of data to evaluate it. To do this, we need to split our dataset into training and testing set. Rule of thumb is to assign 80% of the dataset to training and 20% to testing.

In [101]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

At this point, we have four different arrays:

  • X_train - independent variable (training set)
  • X_test - independent variable (testing set)
  • y_train - dependent variable (training set)
  • y_test - dependent variable (testing set)

Building the model

This is the step where we actually build our model.

In [102]:
from sklearn.svm import SVC

# Create classifier using linear kernel
# There are other kernels available as well such as poly, rbf, sigmoid etc.
classifier = SVC(kernel='linear')

# Training our model
classifier.fit(X_train, y_train)

# Predicting values using our trained model
y_pred = classifier.predict(X_test)

Evaluating the model

We now have our predicted values, y_pred, and our actual values, y_test. Let's look at how well we predicted.

In [103]:
# Generating a confusion matrix
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_test, y_pred)

print(conf_matrix)
[[10  0  0]
 [ 0 10  1]
 [ 0  0  9]]

Let's calculate accuracy score which is the most popular metric for classification models:

In [104]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
0.966666666667

With a 96.67% accuracy score, I would say our model is very accurate. We can improve the performance more by fine tuning the hyperparameters such as picking a different kernel. A great way to decide which values to pick for our hyperparameters is known as cross validation which we will cover in a different post later.

Hope you found this post useful. You can download this code from my github.

Leave a Reply

Your email address will not be published. Required fields are marked *