Feature encoding in python using scikit-learn

Note: You can now subscribe to my blog updates here to receive latest updates.

A key step in applying machine learning models to your data is feature encoding and in this post, we are going to discuss what that consists of and how we can do that in python using scikit-learn.

Not all the fields in your dataset will be numerical. Many times you will have at least one non-numerical feature, which is also known as a categorical feature. For example, your dataset might have a feature called ‘ethnicity’ to describe the ethnicity of employees at a company. Similarly, you can also have a categorical dependent variable if you are dealing with a classification problem where your dataset is used to predict a class instead of a number (regression).

For example, let’s look at a famous machine learning dataset called Iris. This dataset has 4 numerical features: sepal length, sepal width, petal length and petal width. The output is a type of species which can be one of these three classes: setosa, versicolor and virginica.


Difference between supervised and unsupervised learning models

Note: You can now subscribe to my blog updates here to receive latest updates.

In my introductory post about machine learning (ML), I listed a bunch of ML models by their output (regression, classification and clustering). These models can be classified differently as either supervised or unsupervised learning models.

Supervised learning models

In a supervised learning model, your data consists of independent variable(s) and dependent variable(s). You build your model by feeding it this data. The goal is to have a model which can take values of your independent variable and accurately predict corresponding values of the dependent variable.

These types of models are called supervised learning models because your test dataset is labeled with the right ‘answers’ for the model to learn from. You can say that you are supervising the model during its learning process.


A brief overview of machine learning

Note: You can now subscribe to my blog updates here to receive latest updates.

A few years ago, ‘big data‘ was the latest buzzword. Since about a year or two ago, we have moved on to ‘machine learning‘ (and blockchain). Machine learning (ML) is nothing new. It has been used for at least a few decades but only recently has it become accessible enough to be used in mainstream. Only recently have the ML models been made so easily available to everyone through open source libraries. And only recently has the computer processing power become so cheap that it can be easily afforded to deploy computation heavy ML models. Never before was there a better time to learn ML and start using it!

I am no ML expert. Maybe one day I can become one but for now, I am simply an ML enthusiast. One of the main issues with getting started with ML is that it sounds very sophisticated and by all means, it is! ML consists of a lot of complex models that have been optimized over several decades. Unless you have a strong background in mathematics, you will find it extremely difficult to understand the inner workings of these models. But don’t let that intimidate you! Thanks to all the recent development in open source software, most of the ML models are easily available to be used. All you need to do is understand how the models work and when to use them! Instead of getting bogged down by the mathematical details, try to focus on high level theory and how to easily apply ML models. Once you build your basic understanding, you can pick which model(s) you want to explore further.


Python api for getting market and financial data from IEX

Most of you have probably heard about IEX: The Investors Exchange. IEX is the exchange started by Brad Katsuyama who was the protagonist of Michael Lewis’s famous book Flash Boys (review). Just last year, IEX scored a major win when SEC approved its application to register as a national securities exchange. As time passes by, IEX continues to gain more and more market share.

Just like any other exchange, one of IEX’s most valuable asset is the market data generated by all the trading. However, unlike other exchanges, IEX makes its data available to public for free via web API. On February 22, 2017, IEX wrote a blog post announcing release of its web API. Since then, IEX has made quite a few enhancements and added support for newer datasets as well.

As of today, some of the data that IEX provides includes:

  • pricing data (latest trade and quote data as well as summary data going back up to 5 years),
  • reference data,
  • new data,
  • earnings data, and
  • financial data.


Book Review: Efficiently Inefficient by Lasse Heje Pedersen


One of the first few books that I read about asset management firms and hedge funds was Inside the Black Box (Review | Amazon) by Rishi Narang. It was a great high level book which covered all the different components of a hedge fund such as market data, backtester, order management system, risk models, portfolio management, transactions cost analysis etc. As I continued to learn more about the industry, I wanted to read a more in-depth book that covered more than just the basics. Efficiently Inefficient by Lasse Pedersen is such a book that covers a variety of topics on hedge funds.


The book is divided into four sections:

  1. Active investment
  2. Equity strategies
  3. Asset allocation and Macro strategies
  4. Arbitrage strategies


Understanding sets in python

As I learn more and more about python’s different data types, I find myself surprised that not enough people use (or even know) sets. At my job, I am often taking some data and transforming it. Once transformed, I have to do analysis on how the data may have changed and sets are great for such comparisons.

In this post, I will cover how to create sets and show some examples on how to use them.

What is a set?
A set is an unordered collection of unique items in python. They are sort-of like lists but they only contain unique items and don’t maintain order. They also have a lot of helpful unique operations.


Book Review: The WSJ Complete Money and Investing Guidebook

Note: You can now subscribe to my blog updates here to receive latest updates.

If you had asked me to discuss bonds on the first day of my first job, I would have probably started talking about ionic or covalent bonds that I had learnt in  my high school chemistry class. I knew close to nothing about finance and financial securities. Derivatives only reminded me of derivation and integration. Options were of no significance and I was clueless about trade and quote data. Now that I look back, I wish someone had given me The WSJ Complete Money and Investing Guidebook (referred to as CM&IG from henceforth) as the first thing to read.

(Side note: I was given a book to read by my boss and it was q for mortals, so I could learn q.)

This ~200 page book is written by Dave Kansas and covers every asset class and major investment vehicles. The book never goes in-depth into any of the topics which I really like. It also assumes that you have barely any knowledge of the financial markets and investing.


Understanding list, set and dict comprehensions

Just few days ago, you were having a good time with your friends and counting down to 2017. Few days have passed and you are left with a typical cold snowy day in January. You are busy writing code for a high profile project at work. Suddenly, a situation arises where you need to create a new list from an existing one. You code it like you have always been coding:

>>> old = ['adam', 'mike', 'olga']
>>> for name in old:
        new.append(name+ ' last')
>>> new
['adam last', 'mike last', 'olga last']

But then you realize that one of your 5 new year’s resolutions is to start using list comprehensions! You have heard about them but were always a little intimidated by them. You were also not really sure of their point.

This post will help you with your new year’s resolution. However, it won’t help you with the other one about going to gym 4 times a week.


2016: Year in Review

I started EnlistQ in December 2014 as a way for me to express my thoughts and to reinforce technical concepts that I was learning (on my job or elsewhere). This is the first time I have thought of doing a review of my year. It probably has to do with the fact that a lot has changed this year – EnlistQ, my personal and professional life, and the World! All these changes convinced me to take a step back and look at what I have accomplished this year.

Let’s begin!


10 python idioms to help you improve your code

If you have ever tried to learn a new language (not a programming language), you know that we always think in our native language before we translate it to the new language. This can lead to you forming some sentences that don’t make sense in the new language but are perfectly normal in your native language. For example, in a lot of languages, you ‘open’ an electronic gadget such as fan, AC or cell phone. When you say that in English, it means to literally open the gadget instead of turning it on.

The same is true for programming languages. As we pick up new languages, such as python, we are using our prior knowledge of programming in another language (q, java, c++ etc) and translating that to python. Many times, your code will work but it won’t be ‘pretty’ or fast. In python terms, your code won’t be ‘pythonic’.

In this post, I would like to cover some python idioms that can be very helpful. These idioms will:

  1. Help your code look better,
  2. Speed up your code, and
  3. Set you apart from beginners

Let’s begin!

Note: All examples are written in python 2.

Update: Thanks to Diane and my other readers for pointing out some errors in my examples!