Skip to content
Nathan Keyes edited this page May 16, 2013 · 31 revisions

ok-hbase [Work In Progress]

Totally cribbed from HappyBase

Why?

There is a lack of good gems to interface with HBase from Ruby (not JRuby).
Gems like hbase-ruby and hbaserb are too basic; hbase-stargate uses REST, not Thrift; Rhino hasn't been updaed in 3 years; and finally, Massive Record does not meet our performance expectations.

After limping along with Massive Record for too long, I decided it would be better and quicker to start from scratch than to try to fix Massive Record.

"But, Kiss doesn't really have the amount of message data to really require HBase! Isn't this just wasted effort?"

True, Kiss doesn't yet have the volume of data to require HBase, but we do generate a lot of event data, and, hopefully, so will other Labs products. It is the analysis of and metrics on this data that HBase would be ideal for.
Besides, it's an interesting exercise!

Goals

Speed

One of the biggest issues we've seen with usage Massive Record is speed, or lack there of. We use HBase to back the messaging system on Kiss.com, and for users with large inboxes, retrieval is painfully slow (~1000+ messages take 20sec or more). Writes were slow too, ~500ms is the norm. Some of this slowness is probably due to improper HBase configuration, but we saw signifigant differences reading the same data from different libraries. So we know it's possible to have faster HBase code in Ruby.

Simplicity

We have come to the conclusion (somewhat obvious in hindsight) that a traditional ORM pattern is a poor fit for HBase, and the overhead not only slows things down, but unnecessarily complicates things. With ok-hbase I am trying to find a balance: keep it simple, but still provide some nice features above and beyond being just a thin wrapper around the Thrift libraries.

Flexibility

Building just another thin wrapper around the Thrift libraries wouldn't be very useful, so with ok-hbase, I hope to provide a bit more. While a traditional ORM is not a good fit for HBase, there are some ORM-like features that would be nice to have. These features will be implemented as mixins or concerns: You can use the basic table class, or subclass it and add the mixins you want. Check out the TODO section to see what's planned.

Insights

  • So far, ok-hbase is performing as fast, if not faster than the fastest Ruby test code we've written.
  • It appears that we can make use of basic filters at the region server, this will hopefully improve perfomance.
  • We can make use of compression in HBase
  • We can make use on the in_memory=true setting for column families to improve performance in HBase

Progress

2013-05-12

  • Basic Connection class done
    • Table creation/deletion
    • Table enabling/disabling
    • Table listing
  • Basic Table class started
    • scanning implemented, including support for:
      • start/stop scans
      • prefix scans
      • timestamps (versions)
      • limits
      • caching (batch size)
    • column family listing completed

2013-05-14

  • Table class
    • Basic functionality
      • Cell retrieval
      • Row retrieval by id
      • Row retrieval by id list
      • Data writing
      • Data deletion
      • Atomic counter operations
  • Batch class
    • Batch writes
    • Batch deletes

2013-05-15

  • Batch class
    • Added transaction method to emulate HappyBase's use of context managers
  • The Good Stuff™
    • Table class
      • Support for optional default column families, so cells can be referenced without an explicit column family
    • Row class
      • Implicit cell access through meta-programming.

TODO

  • Basic Functionality
  • The Good Stuff™
    • Table class
      • Support for optional default column families, so cells can be referenced without an explicit column family
      • filter_string generators for table scanning (analogous to a SQL where clause that is processed at the region server)
      • Support optional indexers: save multiple copies of a row with different row keys for different access patterns
    • Row class
      • Implicit cell access through meta-programming.
      • Support optional per-cell serializers and deserializers (for numbers, hashes, etc)