Setting up Apache Spark on an AWS EC2 instance

I am currently learning Apache Spark and how to use it for in-memory analytics as well as machine learning (ML). Scikit-learn is a great library for ML but when you want to deploy an ML model in prod to analyze billions of rows (‘big data’), you want to be working with some technology or framework such as hadoop that supports distributed computing.

Apache Spark is an open-source engine built on top of hadoop and provides significant improvement over just native hadoop MapReduce operations due to its support for in-memory computing. Spark also has a very nice api available in scala, java, python and R which makes it easy to use. Of course, I will be focusing on python since that’s the language I am most familiar with.

Moreover, when working with a distributed computing system, you want to make sure that it’s running on some cloud system such as AWS, Azure or Google Cloud which would allow you to scale your cluster flexibly. For example, if you had to quickly analyze billions of rows, you can spin up a bunch of EC2 instances with spark running and run your ML models on the cluster. After you are done, you can easily terminate your session.

In this blog post, I will be showing you how to spin up a free instance of AWS Elastic Compute Cloud (EC2) and install Spark on it. Let’s get started!


Implementing a Multivariate Linear Regression model in python

Earlier, I wrote about how to implement a simple linear regression (SLR) model in python. SLR is probably the easiest model to implement among the most popular machine learning algorithms. In this post, we are going to take it one step further and instead of working with just one independent variable, we will be working with multiple independent variables. Such a model is called a multivariate linear regression (MLR) model.

How does the model work?

A multivariate linear model can be described by a linear equation consisting of multiple independent variables.

For example:

In this equation, ß (beta) defines all the coefficients, x defines all the independent variables and y defines dependent variable.

An SLR model is a simplified version of an MLR model where there is only one x. Linear regression models use a technique called Ordinary Least Squares (OLS) to find the optimum value for the betas. OLS consists of calculating the error which is the difference between predicted value and actual value and then taking square of it. The goal is to find the betas that minimize the sum of the squared errors.

If you want to learn more about SLM and OLS, I highly recommend this visual explanation.


q/kdb+ api for getting market and financial data from IEX

Few months ago, I wrote an api for getting market and financial data from IEX in python. As discussed earlier, IEX makes a lot of its data available to the public through its webservice api (link).

In this post, I will show you how to use the api I wrote in q/kdb+. Let’s get started.

You can find the code here.


Use this function to get last trade data (price and size) for one or more securities.

sym  price  size time
AAPL 174.66 100  2017.11.10D20:59:58.008999936
IBM  149.18 300  2017.11.10D20:59:59.724999936 


Implementing Simple Linear Regression Model in Python

So far, I have discussed some of the theory behind machine learning algorithms and shown you how to perform vital steps when it comes to data preprocessing such as feature scaling and feature encoding. We are now ready to start with the simplest machine learning algorithm which is simple linear regression (SLR).

Remember, back in school, you would collect data for one of your science lab experiments and then use it to predict some values by plotting your data in Microsoft Excel and then drawing a line of best fit through your data? That line of best fit is an outcome of SLR model. An SLR model is a linear model which assumes that two variables (independent and dependent variables) exhibit linear relationship. A linear model with multiple independent variables is called multiple linear regression model.

How does the model work?

Since SLR model exhibits a linear relationship, we know the line of best fit is described by a linear equation of the form: y = mx + b where y is the dependent variable, x is the independent variable, m is the slope and b is the y-intercept. Alternatively, m and b are also known as betas. The key to finding a good SLR model is to find the values for these betas that get you the most accurate predictions.

SLR model uses a technique called Ordinary Least Squares (OLS) to find the optimum value for the betas. OLS consists of calculating the error which is the difference between predicted value and actual value and then taking square of it. The goal is to find the betas that minimize the sum of the squared errors.

If you want to learn more about SLM and OLS, I highly recommend this visual explanation.


Feature scaling in python using scikit-learn

In my previous post, I explained the importance of feature encoding and how to do it in python using scikit-learn. In this post, we are going to talk about another component of the preprocessing step in applying machine learning models which is feature scaling. Very rarely would you be dealing with features that share the same scale. What do I mean by that? For example, let’s look at the famous wine dataset which can be found here. This dataset contains several features such as alcohol content, malic acid and color intensity which describe a type of wine. Focusing on just these three features, we can see that they do not share same scale. Alcohol content is measured in alcohol/volume where as malic acid is measured in g/l.

Why is feature scaling important?

If we were to leave the features as they are and feed them to a machine learning algorithm, we may get incorrect predictions. This is because most algorithms such as SVM, K-nearest neighbors, and logistic regression expect features to be scaled. If the features are not scaled, your machine learning algorithm might assign increased weight to one feature compared to another solely based on its value.


Feature encoding in python using scikit-learn

Note: You can now subscribe to my blog updates here to receive latest updates.

A key step in applying machine learning models to your data is feature encoding and in this post, we are going to discuss what that consists of and how we can do that in python using scikit-learn.

Not all the fields in your dataset will be numerical. Many times you will have at least one non-numerical feature, which is also known as a categorical feature. For example, your dataset might have a feature called ‘ethnicity’ to describe the ethnicity of employees at a company. Similarly, you can also have a categorical dependent variable if you are dealing with a classification problem where your dataset is used to predict a class instead of a number (regression).

For example, let’s look at a famous machine learning dataset called Iris. This dataset has 4 numerical features: sepal length, sepal width, petal length and petal width. The output is a type of species which can be one of these three classes: setosa, versicolor and virginica.


Difference between supervised and unsupervised learning models

Note: You can now subscribe to my blog updates here to receive latest updates.

In my introductory post about machine learning (ML), I listed a bunch of ML models by their output (regression, classification and clustering). These models can be classified differently as either supervised or unsupervised learning models.

Supervised learning models

In a supervised learning model, your data consists of independent variable(s) and dependent variable(s). You build your model by feeding it this data. The goal is to have a model which can take values of your independent variable and accurately predict corresponding values of the dependent variable.

These types of models are called supervised learning models because your test dataset is labeled with the right ‘answers’ for the model to learn from. You can say that you are supervising the model during its learning process.


A brief overview of machine learning

Note: You can now subscribe to my blog updates here to receive latest updates.

A few years ago, ‘big data‘ was the latest buzzword. Since about a year or two ago, we have moved on to ‘machine learning‘ (and blockchain). Machine learning (ML) is nothing new. It has been used for at least a few decades but only recently has it become accessible enough to be used in mainstream. Only recently have the ML models been made so easily available to everyone through open source libraries. And only recently has the computer processing power become so cheap that it can be easily afforded to deploy computation heavy ML models. Never before was there a better time to learn ML and start using it!

I am no ML expert. Maybe one day I can become one but for now, I am simply an ML enthusiast. One of the main issues with getting started with ML is that it sounds very sophisticated and by all means, it is! ML consists of a lot of complex models that have been optimized over several decades. Unless you have a strong background in mathematics, you will find it extremely difficult to understand the inner workings of these models. But don’t let that intimidate you! Thanks to all the recent development in open source software, most of the ML models are easily available to be used. All you need to do is understand how the models work and when to use them! Instead of getting bogged down by the mathematical details, try to focus on high level theory and how to easily apply ML models. Once you build your basic understanding, you can pick which model(s) you want to explore further.


Python api for getting market and financial data from IEX

Most of you have probably heard about IEX: The Investors Exchange. IEX is the exchange started by Brad Katsuyama who was the protagonist of Michael Lewis’s famous book Flash Boys (review). Just last year, IEX scored a major win when SEC approved its application to register as a national securities exchange. As time passes by, IEX continues to gain more and more market share.

Just like any other exchange, one of IEX’s most valuable asset is the market data generated by all the trading. However, unlike other exchanges, IEX makes its data available to public for free via web API. On February 22, 2017, IEX wrote a blog post announcing release of its web API. Since then, IEX has made quite a few enhancements and added support for newer datasets as well.

As of today, some of the data that IEX provides includes:

  • pricing data (latest trade and quote data as well as summary data going back up to 5 years),
  • reference data,
  • new data,
  • earnings data, and
  • financial data.


Book Review: Efficiently Inefficient by Lasse Heje Pedersen


One of the first few books that I read about asset management firms and hedge funds was Inside the Black Box (Review | Amazon) by Rishi Narang. It was a great high level book which covered all the different components of a hedge fund such as market data, backtester, order management system, risk models, portfolio management, transactions cost analysis etc. As I continued to learn more about the industry, I wanted to read a more in-depth book that covered more than just the basics. Efficiently Inefficient by Lasse Pedersen is such a book that covers a variety of topics on hedge funds.


The book is divided into four sections:

  1. Active investment
  2. Equity strategies
  3. Asset allocation and Macro strategies
  4. Arbitrage strategies