Feature encoding in python using scikit-learn

Note: You can now subscribe to my blog updates here to receive latest updates.

A key step in applying machine learning models to your data is feature encoding and in this post, we are going to discuss what that consists of and how we can do that in python using scikit-learn.

Not all the fields in your dataset will be numerical. Many times you will have at least one non-numerical feature, which is also known as a categorical feature. For example, your dataset might have a feature called ‘ethnicity’ to describe the ethnicity of employees at a company. Similarly, you can also have a categorical dependent variable if you are dealing with a classification problem where your dataset is used to predict a class instead of a number (regression).

For example, let’s look at a famous machine learning dataset called Iris. This dataset has 4 numerical features: sepal length, sepal width, petal length and petal width. The output is a type of species which can be one of these three classes: setosa, versicolor and virginica.

Loading the Iris dataset

In [55]:
#Importing pandas
import pandas as pd

# Importing the dataset
dataset = pd.read_csv(r'/Users/himanshugupta/Documents/iris.csv')
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Let's reduce the size of the dataset a bit so it's easier to visualize the dataset by taking a random sample of 10 rows. We can see different values for species by grouping the data by 'species' and getting the count.

In [56]:
dataset = dataset.sample(10)
sepal_length sepal_width petal_length petal_width
setosa 4 4 4 4
versicolor 3 3 3 3
virginica 3 3 3 3

Machine learning algorithms are designed to understand numbers. They cannot understand categories. For example, to them, 'setosa' means nothing. So, how do we get these algorithms to analyze our dataset. One way to do that is by encoding our data. We can do that by enumerating our data so that we are using numbers to represent our classes. For example, we can assign 0, 1 and 2 to our three classes. While the algorithms might not know what these numbers represent, we do because we have the original mapping. We know that 0 represents setosa, 1 represents versicolor and 2 represents virginica.

The example below shows how to do that.

In [57]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
species = dataset['species']
species_encoded = encoder.fit_transform(species)
array([0, 1, 0, 2, 1, 2, 0, 1, 2, 0])

We still have one problem. Our algorithm is stupid and if we give it these numbers, it is going to think that just because 1 is greater than 0 and 2 is greater than 1, there must be some relationship between them. However, we know that these classes themselves are not related. We arbitarily chose these numerical values to represent them. We could have easily picked some other numbers and it shouldn't have made a difference. Sadly, the machines are not that smart (yet).

To avoid this problem, we need to create a new column for each class that would represent whether a flower belongs to that class or not. For example, for setosa, we will have is_setosa and if the class is setosa, we will assign '1' to is_setosa and '0' if it is not. We will then do this for other two classes as well which means we will convert our 'species' column into 3 columns. This process is called 'one hot encoding'.

To do this, we are going to take the encoding from earlier example and transform it a bit more. Here is an example on how to do that using scikit-learn. As you can see, our final output has three columns and only uses '1' and '0'.

In [58]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
species_one_hot = encoder.fit_transform(species_encoded.reshape(-1,1))
<10x3 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>
In [59]:
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  0.]])

Wouldn't it be great if we could do all this directly without having to first enumerate our data and then use one hot encoding? Turns out that you can do that by simply using LabelBinarizer class.

Here is an example of how to use LabelBinarizer.

In [60]:
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
species_one_hot = encoder.fit_transform(dataset.species)
In [61]:
array([[1, 0, 0],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0]])

Feature encoding is a required step whenever you are dealing with categorical data. Hopefully, this post gave you a good understanding of what feature encoding is and how you can apply it to your dataset.

You can download this code from my github.

Leave a Reply

Your email address will not be published. Required fields are marked *