Implementing k-nearest neighbors in python
Last time, we looked into one of the simplest classification algorithms in machine learning called binomial logistic regression. In this post, I am going to cover another common classification algorithm called K Nearest Neighbors, otherwise known as KNN.
To recap, we have mostly discussed regression models such as simple and multivariate linear regression and polynomial regression which are used for predicting a quantity. On the other hand, classification models are used for predicting a category such as yes/no, will buy car/scooter/truck, will turn pink/green/red etc.
KNN is another such classification model and once I am done explaining how it works, you will understand from where it gets its name. KNN is a supervised learning algorithm that follows 5 simple steps to predict a class:
- Pick a value for k – this is the number of neighbors you will consider for each new datapoint you want to predict a class for.
- Pick the neighbors – calculate distance from new datapoint to rest of the data points. The distance can be euclidean distance, manhattan distance or something else.
- Identify categories – Pick the k data points that are closest to new datapoint and then identify which category they belong to.
- Label – Label the new datapoint category which most k nearest data points belong to.
Key points to remember about KNN algorithm are that:
- KNN doesn’t make any assumptions about initial dataset unlike some other popular machine learning algorithms. This means that KNN can work with more diverse datasets and hence, can be more accurate.
- KNN computes the results on the go. It memorizes the training dataset and then uses that to label the new dataset when it is run.
- KNN doesn’t scale well. It can be very slow for large datasets.
Exploring the dataset¶
For this example, we are going one of the most popular datasets in machine learning called the Iris dataset. Some of you might have heard of this dataset already. This data was collected by Fisher in 1933. It contains 4 features and a label. The features are:
- sepal length
- sepal width
- petal length
- petal width
and the label is the type of iris plant.
Our task is to use these four features to correctly predict the type of iris plant.
More information about this dataset can be found here.
# Let's load the data into python and take a look at it import pandas as pd import matplotlib.pyplot as plt %matplotlib inline dataset = pd.read_csv(r'iris.csv') dataset.head()
# Let's modify the class values so that they don't have "Iris-" prefix dataset['class'] = [x.split('-')[-1] for x in dataset['class']] dataset.head()
# Let's get some more information about our dataset dataset.describe()
We can see there are 150 rows. The mean value of sepal_length, sepal_width, petal_length and petal_width is 5.84 cm, 3.05 cm, 3.76 cm and 1.20, respectively.
We don't see information about our label because it's values are not quantitative.
Preprocessing the dataset¶
Iris is a very standard dataset and doesn't have any missing or null values so we don't need to do any preprocessing.
Only thing we need to do is split the dataset into dependent variable and independent variable so that we can feed it to our machine learning model.
Keep in mind, that life is never this easy and data preprocessing is a very crucial step in the machine learning process and often the most time consuming.
X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values
Splitting the dataset into training and testing set¶
KNN is a supervised learning model (see my earlier post on supervised and unsupervised models) which means we need to feed it some data first to train it. And, once it is trained, we need to test it on a different set of data to evaluate it. To do this, we need to split our dataset into training and testing set. Rule of thumb is to assign 80% of the dataset to training and 20% to testing.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
At this point, we have four different arrays:
- X_train - independent variable (training set)
- X_test - independent variable (testing set)
- y_train - dependent variable (training set)
- y_test - dependent variable (testing set)
from sklearn.neighbors import KNeighborsClassifier # Create classifier using K Neighbors Classifier with k=5 classifier = KNeighborsClassifier(n_neighbors=5) # Training our model classifier.fit(X_train, y_train) # Predicting values using our trained model y_pred = classifier.predict(X_test)
Evaluating the model¶
We now have our predicted values, y_pred, and our actual values, y_test. Let's look at how well we predicted.
# Generating a confusion matrix from sklearn.metrics import confusion_matrix conf_matrix = confusion_matrix(y_test, y_pred) print(conf_matrix)
[[ 8 0 0] [ 0 12 2] [ 0 0 8]]
Let's calculate accuracy score which is the most popular metric for classification models:
from sklearn.metrics import accuracy_score print(accuracy_score(y_test, y_pred))
With a 93.3% accuracy score, I would say our model is quite accurate. We can improve the performance more by fine tuning the hyperparameters such as the initial value of k we selected earlier on to build our model. A great way to decide which values to pick for our hyperparameters is known as cross validation which we will cover in a different post later.
Hope you found this post useful. You can download this code from my github.