artisanal bytes

“Hand-crafted in San Francisco from locally sourced bits.”

Shifting Time In HBase

For our HBase table layout, we are following an entity-centric model, evangelized to us by our friends at WibiData. The idea is to put all of the data about a single entity into a single row in HBase. When you need to do a computation that involves that entity’s data, you have quick access to it by the row key, and all of the data is stored close together on disk. Additionally, against many suggestions from the HBase community, and general confusion about how timestamps work, we are using timestamps with logical values. Instead of just letting the region server assign a timestamp version to each cell, we are explicitly setting those values so that we can use timestamp as a true queryable dimension in our gets and scans. In addition to the real timeseries data that is indexed using the cell timestamp, we also have other columns that store metadata about the entity. That data does not need to be logically timestamped, and we always just want the most recent version of it. Given this description, rows in the table look something like this:

timestamp timeseries metadata
20131.09 
20120.87 
20110.93"house"
20101.02 
20090.98 

This shows a row that has five timeseries values in it and a piece of metadata that we keep, a label on where the data came from. The metadata is stored at the timestamp when we wrote it into the table, which in this example is in 2011. Since we imported historical data, and we use logical timestamps for the timeseries column, we have data going back to 2009. One typical access pattern is to query for all data from the beginning of “last year” up to now, because we need that much data for many analyses. If we ran a get for this data, it would normally be set up like this (in pseudo-code, of course):

Get get = new Get(rowkey);
get.addColumn("timeseries");
get.addColumn("metadata");
get.setTimeRange(2012, 2013);

Unfortunately that get would not return us any data for the metadata column, because there is no valid cell in that timerange. We have two options: make a second get for the metadata, or figure out some other solution. That second solution is timeshifting.

Instead of storing the metadata column with a true server timestamp of when it was written, we shift the timestamp by 50,000 years which makes the row data now look like this:

timestamp timeseries metadata
52011 "house"
20131.09 
20120.87 
20110.93 
20101.02 
20090.98 

Since we imported the data in 2011 and timeshifted the metadata column timestamp, its new timestamp is 52,011. We now change our get slightly by setting an ending timestamp to be the logical “end of time”:

Get get = new Get(rowkey);
get.addColumn("timeseries");
get.addColumn("metadata");
get.setTimeRange(2012, Long.MAX_LONG);

Now, we will get data for the timeseries column and for the metadata column using only one RPC. The benefit of HBase being a sparse datastore is that the data for the metadata column is stored on disk right next to the data for the timeseries column even though they are logically separated by 50,000 years. There is no added overhead to the storage to account for this, and thus no added processing when fetching it.

Yes, there is a problem for the future us of the year 52,011, but I’m betting we will all be using relational databases again by that point.