Getting started with data science

Recently, a few of my friends have shown interest in what I do and the skill set required for my job. For those who don’t know, I am a market data developer. This means that I work with time-series databases to capture and store both real-time and historical data. I am also responsible for writing queries to help my users (i.e. researchers) analyze this data efficiently. Most of the times, when people say they work with big data, they are exaggerating. But I can promise you, market data is big data. In case you don’t believe me, let me tell you that our system captures around 4 billion rows daily.

Once I explain this to my friends, they are interested in finding out more. How do I capture so much data? What tools do I use to analyze this data? How can they get into data analysis? Where should they begin?


Analyzing NYC traffic data using q

In my previous post, I analyzed some NYC traffic data using Pandas. In this post, I would like to perform the same analysis using q. As far as graphing is concerned, I won’t be showing that. q is not really used for graphing. You can use GUIs like qPad or qStudio to chart the data on your own.

This post will help you see how one can achieve same results using different methods. It’s nice to have these kind of options. Earlier a major disadvantage for not using q used to be its cost but now you can just use the 32 bit version that is now available for free!

One thing to note is that the analysis done in this post is very straight foreword. Both Pandas and q are capable of handing much more complex analysis. The point of these posts is to show you the different tools available to you.

For my analysis, I will be using the same data set that I used for the earlier post.

Let’s begin!


OOP concepts in q

Many of my friends (almost 90%) are programmers…object oriented programmers. They also have much more experience in programming than I do. Many of them have at least 4+ years of experience than me. I always hear them engaging in discussions about OOP concepts and to be honest, I feel left out sometimes. I have never programmed in an object oriented language professionally. I read some book on java before I started working and took 2 c++ classes in college. That’s pretty much it. I have no professional experience with them. But I am familiar with the basic concepts.

I thought I will cater this post to those programmers that have an object oriented background. It is usually said that functional languages are much easier to learn if you don’t have much programming experience with an object oriented language. In this post, I am going to take some basic OOP concepts and find their respective match in q.


qSQL queries for performing analytics

I realize that there are many developers out there that are not looking to get into q completely and are simply using q/kdb+ along with qsql to perform analytics (i.e. quants). My job requires all of this so I have some good experience running qsql queries. Of course, the type of query you need to run really depends on what kind of data you are looking to retrieve so I can’t possibly cover them all in this post. But I will mention some common queries you can run.

All the examples will focus on these two tables:

time         sym  ask bid
02:59:16.636 IBM  40  2
14:35:31.860 AAPL 88  39
16:36:29.214 AAPL 77  64
08:31:52.958 IBM  30  49
07:14:12.294 AAPL 17  82
time         sym  price size
10:25:30.322 AAPL 8     36
14:17:41.480 AAPL 97    12
08:50:31.645 MSFT 52    45
15:20:08.925 AAPL 66    83
09:01:27.840 MSFT 24    94


Overview of kdb+ architecture

So far, I have covered quite a few intermediate topics without covering the basics. In this post, I would like to take a step back and talk about general architecture of kdb+. A simple kdb+ setup includes a feed handler (fh), a ticker plant (tp), a real-time database (rdb), and a historical database (hdb). These processes work together to manage the data flow from.


Feed Handler

Before you can start capturing data, you first need a feed handler to get you that data. In case of stock data, there are ssl feed handlers that can be used. A feed handler’s job is to parse incoming data from the source (i.e. Reuters) and push it to a ticker plant.


Tables, keyed tables and dictionaries

When I first started learning q, I had a difficult time understanding the differences between tables, keyed tables and dictionaries. The differences seemed very subtle at the time. Just recently, I was explaining a colleague (java developer with some exposure to q/kdb) how you can check meta of a table, look up the keys and types. All this seems trivial to those of us who are full time q developers but someone who only touches the surface of q/kdb in their daily jobs will have no idea about these features.

q is all about tables/dictionaries (and lists) and if you don’t know how to differentiate them properly, then you are going to have a tough time. In this post, I have highlighted some of the key similarities and differences between tables/keyed tables and dictionaries.