Better Querying in Ripple

Better querying in Ripple

The purpose of this document is to discuss and plan enhancements to the Ripple ODM to facilitate more flexible and powerful querying.

Current state of affairs

Lookup by docs by key

Key-lookup works for lots of cases, including link-walking from related documents. Simple and predictable but not flexible.

Custom MapReduce methods

The building blocks for more sophisticated queries are there - loading RObjects from MapReduce is easy (as long as the object is in its original JSON form, or the link-walking multipart/mixed form).

Tools we have available in Riak

Link-walking

Easy to understand and already well-supported. Only useful for traversing relationships.

MapReduce

Map

Load / parse, filter single objects

Link

Traverse to related objects

Reduce

Aggregate, sort, limit

Key-filters

Pre-filter objects based on key name.

Search

Full-text Solr-style search with automatic indexing capability. Powerful in many ways, limiting in others.

Desirable shape of query engine

High-level scopes

Familiar to users of ActiveRecord and Mongo(id/Mapper). Lower the barrier to entry by making scopes and abstract query-building the focus.

Metadata-aware

Add metadata to Ripple::Document that allows the engine to create approachingly-optimal queries based on available features in-play, especially:

key-format knowledge to support key-filters
search availability and schema, and when it’s appropriate to use it
format of datetimes
types of associations

Layered meta-model

Advanced devs able to mix-and-match high-level scopes with their own map and reduce phases
Able to produce equivalent MapReduce code for inspection or other meta-programming
Instrumented to find bottlenecks and suggest areas for optimization.

Idiosyncratic to Riak

Need to strike a balance between familiarity/ease-of-use and promoting proper/idiomatic usage of Riak’s features. Whitewashing with an abstract API makes it difficult to be performant and instructive. In general, developers using Riak should structure their data to the form they want to retrieve it, so Ripple’s API may need to morph to better suit that paradigm, rather than appearing to be a generic “object storage”. Because Riak is optimized for writes, we might need to increase data duplication to improve queryability.

Risks

Too big a project

Accomplishing something of this size is daunting. Other desirable features in the client and ODM are put at risk.

Prerequisites

Before the query construction/planning engine can be started, other features will need to be added to the ODM to support (see above).

Newbie understanding

If the API is too comfortable, developers will assume it works just like ActiveRecord and get burned or turned off when it doesn’t. We see this already with Ripple on occasion.

Concurrent work on higher-level querying for Riak

Basho devs are working on PoCs for secondary indices and a textual query language that may obviate or significantly alter the shape of the query engine.

Deployment

Ripple MapReduce functions in Javascript are already generated

Risk mitigations

Plan ahead

Make a clear plan that can be executed by multiple people in parallel.

Educate

Educate through good documentation and tutorial material. A screencast series would be good, even if this project takes a long time to bear fruit.

Build a flexible API

Design a flexible, abstract solution that can generate queries in multiple forms from the same criteria.

Steps forward

Riak Search support

A ThinkingSphinx-style API or property annotations for marking documents as searchable, and what attributes are searchable. Able to generate a search schema simply from the document metadata. Able to be switched on via the commit hook, or submitted singly or in batches to the Solr interface. Support cross-model search by submitting to a global search index (the default index “search”, probably).

key_on enhancements

Make the property-as-key feature more flexible and self-aware. Allow composite keys in the #key_on class method, which in turn makes those properties known to be filterable with key-filters. Automatically require key_on attributes to validate_presence.

Association enhancements

Allow transitive relationships (:through) on linked and embedded associations. Allow one-to-one by-key associations. Allow one-to-many by-bucket associations.

Manual indexes

Bring up-to-date Justin Marney’s patch for manual secondary indices. Evaluate feasibility and applicability of this approach vis-a-vis search and the to-be-built-in 2I.

Query syntax

Define call-sequence and scope-definition patterns that will be used to construct an abstract query form. Define how that abstract form will be turned into concrete MapReduce and Search queries.

Strawman proposal

First, let’s assume that the necessary enhancements to other areas listed in the “Steps forward” section above have been mostly completed. There are a number of general operations (patterns?) that can happen within a MapReduce query, and these should be the basis for the syntax used by Ripple queries.

General Operations

Filtering

Selecting/rejecting objects from a collection via predicates.

Riak features that support this pattern:

Map phases (inspecting the objects themselves)
Reduce phases (inspecting collections of objects)
Key-filters (inspecting keys)
Search (limiting by FTI)

Map phases are the simplest to implement and the Javascript functions to perform filtering map phases are already present in the generated Ripple files.

Reduce phases can be used to filter in situations where the data are not Riak objects, but values computed from previous phases.

Key-filters require knowledge of the format of the key and are also limited in application by the predefined filters in the riak_kv_mapred_filter module.

Search is the most flexible in query construction, but has its own limitations (which may be alleviated in the future), including integer padding, limited date/time support, and limited support for multi-valued fields.

Transformation

Transformation of data from one format or type to another, or decomposition of complex datatypes.

Riak features that support this pattern:

Map phases

Map phases are the only way to implement this, assuming the pattern is limited to single objects, rather than Aggregation which operates over multiple objects.

Traversal

A special case of transformation which results in producing references to other objects. Analogous to “joins” in SQL or following outgoing edges on a directed graph.

Riak features that support this pattern are:

Map phases
Link phases
Link-walking

Map phases may produce [bucket, key] pairs or [bucket, key, keydata] triples that can be loaded by subsequent map phases.

Link-walking and link phases are special cases of Map phases that simply traverse link specifications.

Aggregation

Coalescing a value by folding across a collection of values. Typical aggregate operations are sum, minimum, maximum, count, statistics.

Riak features that support this pattern:

Reduce phases

Reduce phases are the canonical way to produce aggregates and some typical operations have Javascript built-ins: sum, min, max. These operations only operate on single dimensions, however, and so are not effective at producing partitioned or qualified aggregates.

Sorting

Ordering a collection of values.

Riak features that support this pattern:

Reduce phases
Search

Reduce phases operate over collections of values and so allow sorting across the collection. On the other hand, they may apply the sort multiple times depending on where the phase executes.

Search will sort results by relevance or user-specified criteria and may potentially be used with MapReduce to siginificantly reduce the quantity of inputs.

Limiting

Restricting the quantity of results to a user-specified limit.

Riak features that support this pattern:

Reduce phases
Search

Similar to sorting, Reduce phases operate over collections of values and so allow restriction of the quantity. However, results are not reliable unless preceded by another reduce phase, such that the limiting function is applied only once.

Search will limit results from a query to a default or user-specified quantity. Those results could then be fed to a MapReduce query for more processing.

Syntax

Below are some proposals for aspects of the query syntax/API. Commentary is inline with the code.

class Person
  include Ripple::Document

  property :first_name, String
  property :last_name,  String
  property :email,      String, :presence => true
  property :birthday,   Date
  property :biography,  String

  key_on :email
  
  many :addresses
  many :friends, :class_name => "Person"
  
  timestamps!
end

class Address
  include Ripple::EmbeddedDocument

  property :street, String
  property :city, String
  property :province, String
  property :postcode, String
  property :country, String
end

##### PRIMITIVE OPS #####
# These will likely be building-blocks for the operations below.
Person.key_filter do
  # Just like the syntax available to Riak::MapReduce
  ends_with "gmail.com"
end
Person.map("Ripple.mapIdentity")
Person.link(:tag => "friends")
Person.reduce("Riak.reduceMax")
Person.search("first_name:John AND last_name:Smith")

##### FILTERING #####
# Simple filtering (equality)
Person.filter(:last_name => "Smith")
Person.filter(:birthday => Date.civil(2001, 3, 16))

# Implicit invocation of more complicated filters (matches, between, includes)
Person.filter(:email => /gmail\.com/)
Person.filter(:last_name => "A".."B")
Person.filter(:first_name => ["Adam", "Duff", "Sean"])

# Explicit invocation of filters
Person.filter(:birthday => { "<=" => Date.civil(1997, 1, 1) })
Person.filter(:email => { :ends_with => "basho.com" })

# Filtering on embedded/nested data
Person.filter("addresses/state" => "NY")

##### TRANSFORMATION #####
# Implicit transformations
Person.user_scope.extract(:birthday)
Person.user_scope.extract(:addresses)
Person.user_scope.extract(:first_name, :last_name)
Person.user_scope.count

# Explicit transformation functions
Person.user_scope.transform("MyApp.munge")
Person.user_scope.transform("function(v){ return [JSON.parse(v.values[0]).biography.length]; }")

##### TRAVERSAL #####
# Existing linked associations (not special)
person.friends

# Scope-changing link-traversal
Person.filter(:birthday => Date.civil(2001,3,16)).traverse(:friends)

##### AGGREGATION #####
Person.transform("MyApp.extractAge").average
Person.transform("MyApp.extractAge").filter(18..25).count
Person.extract(:created_at).statistics

##### SORTING #####
Person.user_scope.order(:last_name, :first_name)
Person.filter(:last_name => "Smith").order(:last_name => :desc)

##### LIMITING #####
Person.order(:last_name => :desc).limit(25)

Better Querying in Ripple

Better querying in Ripple

Current state of affairs

Lookup by docs by key

Custom MapReduce methods

Tools we have available in Riak

Link-walking

MapReduce

Map

Link

Reduce

Key-filters

Search

Desirable shape of query engine

High-level scopes

Metadata-aware

Layered meta-model

Idiosyncratic to Riak

Risks

Too big a project

Prerequisites

Newbie understanding

Concurrent work on higher-level querying for Riak

Deployment

Risk mitigations

Plan ahead

Educate

Build a flexible API

Steps forward

Riak Search support

key_on enhancements

Association enhancements

Manual indexes

Query syntax

Strawman proposal

General Operations

Filtering

Transformation

Traversal

Aggregation

Sorting

Limiting

Syntax

Clone this wiki locally