Exploring the dataset¶

For this example, we are going one of the most popular datasets in machine learning called the Iris dataset. Some of you might have heard of this dataset already. This data was collected by Fisher in 1933. It contains 4 features and a label. The features are:

sepal length
sepal width
petal length
petal width

and the label is the type of iris plant.

Our task is to use these four features to correctly predict the type of iris plant.

More information about this dataset can be found here.

# Let's load the data into python and take a look at it
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  

dataset = pd.read_csv(r'iris.csv')

dataset.head()

# Let's modify the class values so that they don't have "Iris-" prefix
dataset['class'] = [x.split('-')[-1] for x in dataset['class']]
dataset.head()

# Let's get some more information about our dataset
dataset.describe()

We can see there are 150 rows. The mean value of sepal_length, sepal_width, petal_length and petal_width is 5.84 cm, 3.05 cm, 3.76 cm and 1.20, respectively.

We don't see information about our label because it's values are not quantitative.

Preprocessing the dataset¶

Iris is a very standard dataset and doesn't have any missing or null values so we don't need to do any preprocessing.

Only thing we need to do is split the dataset into dependent variable and independent variable so that we can feed it to our machine learning model.

Keep in mind, that life is never this easy and data preprocessing is a very crucial step in the machine learning process and often the most time consuming.

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

Splitting the dataset into training and testing set¶

KNN is a supervised learning model (see my earlier post on supervised and unsupervised models) which means we need to feed it some data first to train it. And, once it is trained, we need to test it on a different set of data to evaluate it. To do this, we need to split our dataset into training and testing set. Rule of thumb is to assign 80% of the dataset to training and 20% to testing.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

At this point, we have four different arrays:

X_train - independent variable (training set)
X_test - independent variable (testing set)
y_train - dependent variable (training set)
y_test - dependent variable (testing set)

Building the model¶

This is the step where we actually build our model.

from sklearn.neighbors import KNeighborsClassifier

# Create classifier using K Neighbors Classifier with k=5
classifier = KNeighborsClassifier(n_neighbors=5)

# Training our model
classifier.fit(X_train, y_train)

# Predicting values using our trained model
y_pred = classifier.predict(X_test)

Evaluating the model¶

We now have our predicted values, y_pred, and our actual values, y_test. Let's look at how well we predicted.

# Generating a confusion matrix
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_test, y_pred)

print(conf_matrix)

[[ 8  0  0]
 [ 0 12  2]
 [ 0  0  8]]

Let's calculate accuracy score which is the most popular metric for classification models:

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.933333333333

With a 93.3% accuracy score, I would say our model is quite accurate. We can improve the performance more by fine tuning the hyperparameters such as the initial value of k we selected earlier on to build our model. A great way to decide which values to pick for our hyperparameters is known as cross validation which we will cover in a different post later.

	sepal_length	sepal_width	petal_length	petal_width	class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

	sepal_length	sepal_width	petal_length	petal_width	class
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

	sepal_length	sepal_width	petal_length	petal_width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Implementing k-nearest neighbors in python

Exploring the dataset¶

Preprocessing the dataset¶

Splitting the dataset into training and testing set¶

Building the model¶

Evaluating the model¶

Leave a comment

Cancel reply