Implementing k-nearest neighbors in python

Last time, we looked into one of the simplest classification algorithms in machine learning called binomial logistic regression. In this post, I am going to cover another common classification algorithm called K Nearest Neighbors, otherwise known as KNN.

To recap, we have mostly discussed regression models such as simple and multivariate linear regression and polynomial regression which are used for predicting a quantity. On the other hand, classification models are used for predicting a category such as yes/no, will buy car/scooter/truck, will turn pink/green/red etc.

KNN is another such classification model and once I am done explaining how it works, you will understand from where it gets its name. KNN is a supervised learning algorithm that follows 5 simple steps to predict a class:

1. Pick a value for k – this is the number of neighbors you will consider for each new datapoint you want to predict a class for.
2. Pick the neighbors – calculate distance from new datapoint to rest of the data points. The distance can be euclidean distance, manhattan distance or something else.
3. Identify categories – Pick the k data points that are closest to new datapoint and then identify which category they belong to.
4. Label – Label the new datapoint category which most k nearest data points belong to.

Key points to remember about KNN algorithm are that:

• KNN doesn’t make any assumptions about initial dataset unlike some other popular machine learning algorithms. This means that KNN can work with more diverse datasets and hence, can be more accurate.
• KNN computes the results on the go. It memorizes the training dataset and then uses that to label the new dataset when it is run.
• KNN doesn’t scale well. It can be very slow for large datasets.

Exploring the dataset¶

For this example, we are going one of the most popular datasets in machine learning called the Iris dataset. Some of you might have heard of this dataset already. This data was collected by Fisher in 1933. It contains 4 features and a label. The features are:

• sepal length
• sepal width
• petal length
• petal width

and the label is the type of iris plant.

Our task is to use these four features to correctly predict the type of iris plant.

More information about this dataset can be found here.

In [1]:
# Let's load the data into python and take a look at it
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

dataset = pd.read_csv(r'iris.csv')

dataset.head()

Out[1]:
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
In [2]:
# Let's modify the class values so that they don't have "Iris-" prefix
dataset['class'] = [x.split('-')[-1] for x in dataset['class']]
dataset.head()

Out[2]:
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [3]:
# Let's get some more information about our dataset
dataset.describe()

Out[3]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

We can see there are 150 rows. The mean value of sepal_length, sepal_width, petal_length and petal_width is 5.84 cm, 3.05 cm, 3.76 cm and 1.20, respectively.

We don't see information about our label because it's values are not quantitative.

Preprocessing the dataset¶

Iris is a very standard dataset and doesn't have any missing or null values so we don't need to do any preprocessing.

Only thing we need to do is split the dataset into dependent variable and independent variable so that we can feed it to our machine learning model.

Keep in mind, that life is never this easy and data preprocessing is a very crucial step in the machine learning process and often the most time consuming.

In [4]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values


Splitting the dataset into training and testing set¶

KNN is a supervised learning model (see my earlier post on supervised and unsupervised models) which means we need to feed it some data first to train it. And, once it is trained, we need to test it on a different set of data to evaluate it. To do this, we need to split our dataset into training and testing set. Rule of thumb is to assign 80% of the dataset to training and 20% to testing.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)


At this point, we have four different arrays:

• X_train - independent variable (training set)
• X_test - independent variable (testing set)
• y_train - dependent variable (training set)
• y_test - dependent variable (testing set)

Building the model¶

This is the step where we actually build our model.

In [6]:
from sklearn.neighbors import KNeighborsClassifier

# Create classifier using K Neighbors Classifier with k=5
classifier = KNeighborsClassifier(n_neighbors=5)

# Training our model
classifier.fit(X_train, y_train)

# Predicting values using our trained model
y_pred = classifier.predict(X_test)


Evaluating the model¶

We now have our predicted values, y_pred, and our actual values, y_test. Let's look at how well we predicted.

In [7]:
# Generating a confusion matrix
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_test, y_pred)

print(conf_matrix)

[[ 8  0  0]
[ 0 12  2]
[ 0  0  8]]


Let's calculate accuracy score which is the most popular metric for classification models:

In [8]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.933333333333


With a 93.3% accuracy score, I would say our model is quite accurate. We can improve the performance more by fine tuning the hyperparameters such as the initial value of k we selected earlier on to build our model. A great way to decide which values to pick for our hyperparameters is known as cross validation which we will cover in a different post later.

Hope you found this post useful. You can download this code from my github.