How data is saved in a historical database

Real-time databases are easy to understand as everything is happening in memory. Historical databases are a different story. Recently, I have been getting much more involved with hdbs at work and just the way they work is incredible and fascinating to me. If you are serious about kdb+, you should make sure to know how hdbs work under the hood. It will also help you understand the reason behind writing qsql queries a certain view (specifying date parameter first). It will help you understand that moving data from one hdb to another hdb is not as simple as copying it from an rdb to another rdb.

In this blog post, I will cover how data is stored on disk (splayed/partitioned tables) and why that makes data retrieval fast. I will also touch attributes briefly.

Splayed tables

If a table is too big to fit in memory, it is best to save it down to disk. Recall that a table is simply a collection of column lists. We can store each of those columns as a separate file on the disk and a table stored with such a method is called a splayed table. A major benefit of splayed tables is that when you load the table, kdb+ only loads the columns in memory that you asked for and not the entire table. For example:

q)select time,sym,price from table where sym=`AAPL

will only load the time, sym and price file from the directory in memory. This speeds up data retrieval process.

Partitioned tables

A partitioned table is a type of a splayed table. Many times, you will have tables so big that you can’t even store one of its column in memory easily…think of a trade table with data for 5 years. For such large volume data, splitting a table by columns is not enough so you need to go one step further and split again. Often, this is done on the date column. Such tables are called partitioned tables.

Their directory structure would look like:

/kdb/stock

  • 2014.11.12
    • trade
      • sym
      • price
      • exch
      • .d
      Sexual Potency As mentioned, kamagra tablets work quickly by improving blood flow to the main sex organ by overcoming the lack of time and energy. buying cialis cheap On the other hand the branded purchase viagra online has to provide a proper cure, I almost stopped my search for a cure from hair loss. I eat no junk food, fast food or get cialis cheap any kind of day. This tells me that this person is passing on what someone else told them and they rarely cheapest viagra have actually experienced chiropractic.
  • 2014.11.13
    • trade
      • sym
      • price
      • exch
      • .d
  • 2014.11.14
    • trade
      • sym
      • price
      • exch
      • .d

As you can see, each date partition contains a splayed trade folder. Note that there is no file for date stored in the trade directory because kdb+ is smart enough to get the date from the date folder. The fact that most historical tables (especially trade, quote and depth) are date partitioned is the reason why you should always specify date parameter first when writing a qsql query.

q)select from trade where date=2014.11.13, sym=`AAPL

When you run such query, kdb+ knows to ONLY go to 2014.11.13 directory. If you didn’t specify the date parameter, kdb+ will search ALL the date partitions which is too time-consuming and will most likely lead to your query crashing.

You are probably wondering about the .d file in each date partition. The .d file is simply a mapping of the columns so kdb+ knows how to reconstruct the table in memory.

The sym file contains the enumeration of the sym column. You can read about enumeration more in my earlier post.

Attributes

I have covered attributes earlier but I would like to revisit them in this post to emphasize their importance. At the end of the day, the rdb processes the data and pushes it to hdb. The processing involves reorganizing the data and adding attributes to it. The rdb originally stores the data in chronological order as new trades come in but that’s not efficient for historical data because clients are more interested in data by syms for specific dates.

Historical data is usually saved with parted (`p#) attribute on the sym column whereas real-time data is stored with grouped (`g#) attribute. The parted attribute helps speed up data retrieval as kdb+ doesn’t need to search the entire table .

I hope this post helped you understand the magic that happens under the hood in an hdb. In my next post, I will discuss how to move data from one hdb to another hdb. Most kdb+ developers have done so at some point in their careers.

Leave a comment

Your email address will not be published. Required fields are marked *