Parsing command line arguments in python

Python is a very popular language for many reasons and one of them is the ability to use it for quick scripting or for an enterprise application. Professionally, I have used python for writing many scripts; some that are quick and temporary, and others that are more complex and long-term.

Whatever the purpose of the script, most of them start with parsing command line arguments. It’s always a good idea to allow users to customize the behavior of script by passing in different values for command line arguments. For example, if you have a script that runs some data quality checks against a database, you might want to pass an argument that decides for which day to run the checks.

So, how do you parse command line arguments in python? Python has a built-in module called argparse that does the job for you.

Continue reading “Parsing command line arguments in python”

Why documentation is more important than code

Taking a break from the machine learning heavy posts, I would like to talk about something slightly different but very important: documentation. Which is more important: code or documentation? If asked, most developers would say that code is more important than documentation. As a developer, you are given some business requirements and are asked to deliver a solution. That solution is your code. It works and solves the problem. Done!

There is nothing wrong with this argument. At the end of the day, code gets the job done. And if you’re on a tight schedule such as trying to fix a production issue then yes, your code is way more important. But that’s not the scenario I am talking about. I am talking about most of the development that gets done where developers are given sufficient amount of time. That’s when you must document your code. Whether that be through comments or separately on a wiki page, it is your responsibility as a developer to add that additional information for your team.

Continue reading “Why documentation is more important than code”

Implementing a Binomial Logistic Regression model in Python

Note: You can now subscribe to my blog updates here to receive latest updates.

So far, we have only discussed regression modelling. However, there is another type of modelling called classification modelling. The primary difference between regression models and classification models is that while regression models are used to predict a quantity, classification models are used to predict a category.

For example, in my post on simple linear regression, we tried to predict soda sales through day’s temperature. Total sales of soda (our label) is a quantitative value and hence we used a regression model. In the example today, we are going to predict whether someone will purchase soda or not by looking at day’s temperature. Here we have two categories, whether customer will purchase or not purchase soda. This makes our label (dependent variable) categorical and suitable for logistic regression.As there were different variations of linear regression model, we also have different types of logistic regression model.

 

Continue reading “Implementing a Binomial Logistic Regression model in Python”

2017: Year in Review

And just like that, 2017 is almost gone. Last year, around this time, I wrote a post through which I reflected what I did or did not accomplish in 2016. It is now time to do the same for 2017. A lot happened in 2017 and I would like to take a step back and reflect to make sure I am still pursuing my goals.

Let’s begin!

I started learning AWS

There are a lot of options out there these days when it comes to cloud computing such as Azure, AWS, Google Cloud etc. I decided to pick AWS and learn the basics. There is no doubt that cloud is the way to go. 5-10 years from now, most companies will not have an infrastructure department. It will be provided by third party companies as a service. As far as I am concerned, I am interested in spinning up some EC2 instances on AWS, using S3/Glacier to store files and storing data in an AWS database such as RedShift.

Continue reading “2017: Year in Review”

Implementing a Polynomial Regression Model in Python

So far, we have looked at two types of linear regression models and how to implement them in python using scikit-learn. To recap, we began with a simple linear regression (SLR) model where we have one independent variable (feature) and one dependent variable (label). We then expended it slightly to a more general use case where we had multiple independent variables and one dependent variable. We called it multivariate linear regression model.

Both of these models result in a straight line or plane (if in multiple dimensions) which is very convenient but a bit too simplistic in the real world. Most real world problems cannot be easily modeled by a simple or multivariate linear regression model. For them, you need a non-linear model such as a polynomial regression model.

A polynomial regression model can be represented by an equation of this form:

Polynomial regression model is a type of linear regression model which can be confusing to some. The reason is that while the model is nonlinear, the regression function that is used to estimate the coefficients is linear. In fact, polynomial regression is a special case of multivariate linear regression.

How can I implement polynomial regression model?

Implementing a polynomial regression model is slightly different than implementing a simple or multivariate linear regression model. You still use the linear regression model but before you do that, you have to construct polynomial features of your coefficients.

Here are the steps we are going to follow as usual:

  • Exploring the dataset
  • Splitting the dataset into training and testing set
  • Building the model
  • Evaluating the model

Only physical examination can come up with a fair cialis without valuation of the bike’s price. However, there are worse things being addicted to than an Amazon Organic Superfood so pure and potent natural herbs found in nature. viagra purchase canada You should strive to attain bliss cialis prescription online in your love life. Solution: The advertising campaign of a certain product might order viagra prescription be the best in world but it is a complete fake rumor as it does not affects the health of the person and brings him to death but just harms the health and brings too many complications in the love life and ruins the relationship of the man.
Continue reading “Implementing a Polynomial Regression Model in Python”

Analyzing NYC motor vehicle data in Spark

A while back I wrote about analyzing NYC’s traffic (motor vehicle) data in q/kdb+. Then, soon afterwards, I showed how to analyze that data in python using pandas library. Now, I would like to again analyze the same dataset but this time, in Apache Spark. As I mentioned in my last post, I am currently learning Spark so you will be seeing a lot more posts about it in the near future.

If you don’t have Spark installed, please see my previous post on how to set it up on AWS.

In this post, I will show you how to :

  • Load data from a csv
  • Transform dataframe
  • Aggregating data
  • Sorting data
  • Filter data

All types of sex problems especially lack of viagra ordering djpaulkom.tv sexual urge in you. This enzyme is named as canadian pharmacy levitra find this phosphodiesterase type-5. You can use this herbal supplement together with Vital shop cialis you could try this out M-40 capsule to increase energy levels, power, vigor and vitality to last longer and satisfy her fully controlling the PE. This class of natural herbal sexual stimulant pills are commonly referred to as natural herbal commander cialis .
Continue reading “Analyzing NYC motor vehicle data in Spark”

Setting up Apache Spark on an AWS EC2 instance

I am currently learning Apache Spark and how to use it for in-memory analytics as well as machine learning (ML). Scikit-learn is a great library for ML but when you want to deploy an ML model in prod to analyze billions of rows (‘big data’), you want to be working with some technology or framework such as hadoop that supports distributed computing.

Apache Spark is an open-source engine built on top of hadoop and provides significant improvement over just native hadoop MapReduce operations due to its support for in-memory computing. Spark also has a very nice api available in scala, java, python and R which makes it easy to use. Of course, I will be focusing on python since that’s the language I am most familiar with.

Moreover, when working with a distributed computing system, you want to make sure that it’s running on some cloud system such as AWS, Azure or Google Cloud which would allow you to scale your cluster flexibly. For example, if you had to quickly analyze billions of rows, you can spin up a bunch of EC2 instances with spark running and run your ML models on the cluster. After you are done, you can easily terminate your session.

In this blog post, I will be showing you how to spin up a free instance of AWS Elastic Compute Cloud (EC2) and install Spark on it. Let’s get started!

Continue reading “Setting up Apache Spark on an AWS EC2 instance”

Implementing a Multiple Linear Regression model in python

Earlier, I wrote about how to implement a simple linear regression (SLR) model in python. SLR is probably the easiest model to implement among the most popular machine learning algorithms. In this post, we are going to take it one step further and instead of working with just one independent variable, we will be working with multiple independent variables. Such a model is called a multiple linear regression (MLR) model.

How does the model work?

A multiple linear model can be described by a linear equation consisting of multiple independent variables.

For example:

In this equation, ß (beta) defines all the coefficients, x defines all the independent variables and y defines dependent variable.

An SLR model is a simplified version of an MLR model where there is only one x. Linear regression models use a technique called Ordinary Least Squares (OLS) to find the optimum value for the betas. OLS consists of calculating the error which is the difference between predicted value and actual value and then taking square of it. The goal is to find the betas that minimize the sum of the squared errors.

If you want to learn more about SLM and OLS, I highly recommend this visual explanation.

Continue reading “Implementing a Multiple Linear Regression model in python”

q/kdb+ api for getting market and financial data from IEX

Few months ago, I wrote an api for getting market and financial data from IEX in python. As discussed earlier, IEX makes a lot of its data available to the public through its webservice api (link).

In this post, I will show you how to use the api I wrote in q/kdb+. Let’s get started.

You can find the code here.

get_last_trade

Use this function to get last trade data (price and size) for one or more securities.

q)get_last_trade`aapl`ibm
sym  price  size time
---------------------------------------------
AAPL 174.66 100  2017.11.10D20:59:58.008999936
IBM  149.18 300  2017.11.10D20:59:59.724999936
Continue reading “q/kdb+ api for getting market and financial data from IEX”

Implementing Simple Linear Regression Model in Python

So far, I have discussed some of the theory behind machine learning algorithms and shown you how to perform vital steps when it comes to data preprocessing such as feature scaling and feature encoding. We are now ready to start with the simplest machine learning algorithm which is simple linear regression (SLR).

Remember, back in school, you would collect data for one of your science lab experiments and then use it to predict some values by plotting your data in Microsoft Excel and then drawing a line of best fit through your data? That line of best fit is an outcome of SLR model. An SLR model is a linear model which assumes that two variables (independent and dependent variables) exhibit linear relationship. A linear model with multiple independent variables is called multiple linear regression model.

How does the model work?

Since SLR model exhibits a linear relationship, we know the line of best fit is described by a linear equation of the form: y = mx + b where y is the dependent variable, x is the independent variable, m is the slope and b is the y-intercept. Alternatively, m and b are also known as betas. The key to finding a good SLR model is to find the values for these betas that get you the most accurate predictions.

SLR model uses a technique called Ordinary Least Squares (OLS) to find the optimum value for the betas. OLS consists of calculating the error which is the difference between predicted value and actual value and then taking square of it. The goal is to find the betas that minimize the sum of the squared errors.

If you want to learn more about SLM and OLS, I highly recommend this visual explanation.

Continue reading “Implementing Simple Linear Regression Model in Python”