Note: You can now subscribe to my blog updates here to receive latest updates.
A key step in applying machine learning models to your data is feature encoding and in this post, we are going to discuss what that consists of and how we can do that in python using scikit-learn.
Not all the fields in your dataset will be numerical. Many times you will have at least one non-numerical feature, which is also known as a categorical feature. For example, your dataset might have a feature called ‘ethnicity’ to describe the ethnicity of employees at a company. Similarly, you can also have a categorical dependent variable if you are dealing with a classification problem where your dataset is used to predict a class instead of a number (regression).
For example, let’s look at a famous machine learning dataset called Iris. This dataset has 4 numerical features: sepal length, sepal width, petal length and petal width. The output is a type of species which can be one of these three classes: setosa, versicolor and virginica.