-
Notifications
You must be signed in to change notification settings - Fork 92
Secondary Indexes
At the end of this guide, you should be familiar with:
- Adding secondary indexes to your values
- Performing equality and range queries across indexes
- Using the special
$bucket
and$key
indexes - Using index queries as input to MapReduce
This and the other guides in this wiki assume you have Riak installed locally. If you don't have Riak already, please read and follow how to install Riak and then come back to this guide. If you haven't yet installed the client library, please do so before starting this guide.
This guide also assumes you know how to
connect the client to Riak and
store and retrieve values. All examples
assume a local variable of client
which is an instance of
Riak::Client
and points at some Riak node you want to work with.
Secondary indexes are one of the most popular recent features of Riak since the 1.0 release; however, they work very differently than both Search and a traditional index in a relational database. Here's how:
-
Secondary indexes are discrete; that is, you can only query on the entire secondary key. Search, on the other hand, lets you query inside each field.
-
Secondary indexes are defined per object. There is no schema or automatic indexing for 2I, you just add the indexes you want to the object before you store it.
-
Secondary indexes are stored in the same location as your regular value. This means that while it remains more consistent in the long run, Riak has to query a large portion of the cluster to satisfy any query. (This is also called a "coverage query" and is implemented similarly to list-keys and list-buckets, although it is much more efficient.)
-
Secondary indexes are currently only supported on the LevelDB storage engine (and the memory engine in the upcoming 1.2 release), so if you want to query them, make sure you have the below snippet in your
app.config
file:{storage_backend, riak_kv_eleveldb_backend}
Now that you've got that set, restart your Riak node if necessary and let's start playing with 2I!
In order to find things with 2I, we have to add some indexes first.
Let's say I'm storing user profile information in Riak, and I want to
look them up by email address or their handle. Naturally, you'd want
users to be able to change their email address and handle too, so we
can't use that as the key. Instead, let's use an arbitrary identifier
(chosen by Riak in this case), and add indexes on those fields so we
can look them up later. First, I'll initialize a new RObject
to
store my profile data:
sean = client['users'].new
# => #<Riak::RObject {users} [application/json]:nil>
sean.data = {:name => "Sean Cribbs",
:email => "sean@basho.com",
:handle => "seancribbs"}
Now I'll add an index entry for the email address, and for the handle
by working with the indexes
accessor.
sean.indexes['email_bin'] << 'sean@basho.com'
sean.indexes['handle_bin'] << 'seancribbs'
You should notice two things in the above snippet:
- The key in the
indexes
Hash ends with_bin
. This means that the index we're storing is a String, or "binary". - I didn't set the value, but instead appended it to the entry in the
Hash. This is because indexes can have more than one value, which
is useful if you want to, say, "tag" something like a blog post
with multiple "tags". The
indexes
accessor is always initialized as a Hash whose default value is a Set for this reason.
Let's look at the value of indexes
and then store the object.
sean.indexes
# => {"email_bin"=>#<Set: {"sean@basho.com"}>, "handle_bin"=>#<Set: {"seancribbs"}>}
sean.store
# => => #<Riak::RObject {users,RgOVpKn6yirTTiOjlogMpkTlV1U} [application/json]:{"name"=>"Sean Cribbs", "email"=>"sean@basho.com", "handle"=>"seancribbs"}>
You'll see that Riak picked a long, quasi-random key for me. Now let's see if we can find my profile.
The simplest secondary-index query is equality, which we'll use to
look up my user profile by email and handle. Both queries will use the
get_index
method on the Bucket
. The first argument is the index to
query, the second is the value of that index to lookup.
client['users'].get_index('handle_bin', 'seancribbs')
# => ["RgOVpKn6yirTTiOjlogMpkTlV1U"]
client['users']["RgOVpKn6yirTTiOjlogMpkTlV1U"]
# => #<Riak::RObject {users,RgOVpKn6yirTTiOjlogMpkTlV1U} [application/json]:{"name"=>"Sean Cribbs", "email"=>"sean@basho.com", "handle"=>"seancribbs"}>
Now let's try the email:
client['users'].get_index('email_bin', 'sean@basho.com')
# => ["RgOVpKn6yirTTiOjlogMpkTlV1U"]
Alright, we got the same answer! Our secondary index worked.
We mentioned earlier that indexes can have multiple values. Let's add another email address to my profile and query for it.
sean.indexes['email_bin'] << 'sean.cribbs@private-mail.com'
# => #<Set: {"sean@basho.com", "sean.cribbs@private-mail.com"}>
sean.store
# Now let's query it.
client['users'].get_index('email_bin', 'sean.cribbs@private-mail.com')
# => ["RgOVpKn6yirTTiOjlogMpkTlV1U"]
The indexes we have added so far on this user profile wouldn't be very
meaningful to query in a range, so let's add another index, and also
put some more keys in our 'users'
bucket with indexes on them.
Let's assume we want to track when the user signed up, so we'll add an integer index which is a UNIX timestamp.
# We'll cheat for the first one and use the last_modified metadata.
sean.indexes["joined_int"] << sean.last_modified.utc.to_i
# => #<Set: {1335541214}>
sean.store
# Now let's make another user profile and store it
brian = client['users'].new.tap do |b|
b.data = {:name => "Brian Roach",
:email => "roach@basho.com",
:handle => "roach"}
b.indexes['email_bin'] << 'roach@basho.com'
b.indexes['handle_bin'] << 'roach'
b.indexes['joined_int'] << Time.now.utc.to_i
b.store
end
# => #<Riak::RObject {users,ITbrdX4MdIfONI9YL7bCpv4nmYV} [application/json]:{"name"=>"Brian Roach", "email"=>"roach@basho.com", "handle"=>"roach"}>
brian.indexes['joined_int']
# => #<Set: {1335548562}>
Now we can query on that index. Let's find the users that joined
today. We do that with the same get_index
method on the bucket, but
pass a Range
object as the query argument.
# Let's first figure out the boundaries of the day. If you're using
# Rails, use Time#end_of_day and Time#beginning_of_day.
now = Time.now.utc
start_of_today = Time.utc(now.year, now.month, now.day, 0, 0, 0).to_i
end_of_today = Time.utc(now.year, now.month, now.day, 23, 59, 59).to_i
# Now we can query the range.
client['users'].get_index('joined_int', start_of_today..end_of_today)
# => ["RgOVpKn6yirTTiOjlogMpkTlV1U", "ITbrdX4MdIfONI9YL7bCpv4nmYV"]
Good, we got both of our users' keys back. Now let's pick a moment between the two indexes so we can see the range query returning only a portion of our keyspace.
# Find the midpoint between when they joined:
midpoint = brian.indexes['joined_int'].first - sean.indexes['joined_int'].first) / 2 + sean.indexes['joined_int'].first
client['users'].get_index('joined_int', start_of_today..midpoint)
# => ["RgOVpKn6yirTTiOjlogMpkTlV1U"]
sean.key
# => "RgOVpKn6yirTTiOjlogMpkTlV1U"
client['users'].get_index('joined_int', midpoint..end_of_today)
# => ["ITbrdX4MdIfONI9YL7bCpv4nmYV"]
brian.key
# => "ITbrdX4MdIfONI9YL7bCpv4nmYV"
Riak also has two built-in indexes that you don't have to define, and
they are $bucket
and $key
, which unsurprisingly, are indexes over
the bucket and key, respectively. While still not-recommended, in
some cases they will be more efficient than the list-keys
functionality. Each index has only one query type you can do on it;
the bucket index only supports equality, and the key index only
supports range.
# Bucket equality query
client['users'].get_index('$bucket', 'users')
# => ["RgOVpKn6yirTTiOjlogMpkTlV1U", "ITbrdX4MdIfONI9YL7bCpv4nmYV"]
# Key range query
client['users'].get_index('$key', 'H'..'J')
# => ["ITbrdX4MdIfONI9YL7bCpv4nmYV"]
One point this example drives home about ranges on binary/String indexes is that they are strictly by byte-order, so when using them, be aware of the raw byte-ordering of your Ruby Strings.
As with Full-text Search, you can feed the results of a secondary index query into a MapReduce job. We'll just do a simple one, the MapReduce guide will have more detailed examples.
On a Riak::MapReduce
object, call the index
method to add a
secondary index query as the input. The first argument is the bucket,
followed by the index and the query (both equality and range are
supported).
Riak::MapReduce.new(client).
index('users', 'email_bin', 'sean@basho.com').
map('Riak.mapValuesJson', :keep => true).run
# => [{"name"=>"Sean Cribbs", "email"=>"sean@basho.com", "handle"=>"seancribbs"}]
Riak::MapReduce.new(client).
index('users', 'joined_int', start_of_today..end_of_today).
map('Riak.mapValuesJson', :keep => true).run
# => [{"name"=>"Sean Cribbs", "email"=>"sean@basho.com", "handle"=>"seancribbs"},
# {"name"=>"Brian Roach", "email"=>"roach@basho.com", "handle"=>"roach"}]
So, combining secondary indexes with MapReduce, we can fetch the values in a single round-trip, or if we choose, do more complicated processing.
Congratulations, you finished the Secondary Indexes guide! You might next want to compare them to Full-text Search or go into more detail of processing query outputs with MapReduce.