Kaggle: Solving Titanic challenge using logistic regression in Spark

In this post, I will show how to tackle Kaggle’s entry level challenge called Titanic. In this challenge, you are given training and test dataset. Your goal is to use the training dataset to build and train a model and then use it to predict whether a passenger will survive or not listed in test …

Analyzing NYC motor vehicle data in Spark

A while back I wrote about analyzing NYC’s traffic (motor vehicle) data in q/kdb+. Then, soon afterwards, I showed how to analyze that data in python using pandas library. Now, I would like to again analyze the same dataset but this time, in Apache Spark. As I mentioned in my last post, I am currently …

Setting up Apache Spark on an AWS EC2 instance

I am currently learning Apache Spark and how to use it for in-memory analytics as well as machine learning (ML). Scikit-learn is a great library for ML but when you want to deploy an ML model in prod to analyze billions of rows (‘big data’), you want to be working with some technology or framework …