-
Notifications
You must be signed in to change notification settings - Fork 152
Better Querying in Ripple
The purpose of this document is to discuss and plan enhancements to the Ripple ODM to facilitate more flexible and powerful querying.
Key-lookup works for lots of cases, including link-walking from related documents. Simple and predictable but not flexible.
The building blocks for more sophisticated queries are there - loading RObjects from MapReduce is easy (as long as the object is in its original JSON form, or the link-walking multipart/mixed form).
Easy to understand and already well-supported. Only useful for traversing relationships.
Load / parse, filter single objects
Traverse to related objects
Aggregate, sort, limit
Pre-filter objects based on key name.
Full-text Solr-style search with automatic indexing capability. Powerful in many ways, limiting in others.
Familiar to users of ActiveRecord and Mongo(id/Mapper). Lower the barrier to entry by making scopes and abstract query-building the focus.
Add metadata to Ripple::Document that allows the engine to create approachingly-optimal queries based on available features in-play, especially:
- key-format knowledge to support key-filters
- search availability and schema, and when it’s appropriate to use it
- format of datetimes
- types of associations
- Advanced devs able to mix-and-match high-level scopes with their own map and reduce phases
- Able to produce equivalent MapReduce code for inspection or other meta-programming
- Instrumented to find bottlenecks and suggest areas for optimization.
Need to strike a balance between familiarity/ease-of-use and promoting proper/idiomatic usage of Riak’s features. Whitewashing with an abstract API makes it difficult to be performant and instructive. In general, developers using Riak should structure their data to the form they want to retrieve it, so Ripple’s API may need to morph to better suit that paradigm, rather than appearing to be a generic “object storage”. Because Riak is optimized for writes, we might need to increase data duplication to improve queryability.
Accomplishing something of this size is daunting. Other desirable features in the client and ODM are put at risk.
Before the query construction/planning engine can be started, other features will need to be added to the ODM to support (see above).
If the API is too comfortable, developers will assume it works just like ActiveRecord and get burned or turned off when it doesn’t. We see this already with Ripple on occasion.
Basho devs are working on PoCs for secondary indices and a textual query language that may obviate or significantly alter the shape of the query engine.
Ripple MapReduce functions in Javascript are already generated
Make a clear plan that can be executed by multiple people in parallel.
Educate through good documentation and tutorial material. A screencast series would be good, even if this project takes a long time to bear fruit.
Design a flexible, abstract solution that can generate queries in multiple forms from the same criteria.
A ThinkingSphinx-style API or property annotations for marking documents as searchable, and what attributes are searchable. Able to generate a search schema simply from the document metadata. Able to be switched on via the commit hook, or submitted singly or in batches to the Solr interface. Support cross-model search by submitting to a global search index (the default index “search”, probably).
Make the property-as-key feature more flexible and self-aware. Allow composite keys in the #key_on class method, which in turn makes those properties known to be filterable with key-filters. Automatically require key_on attributes to validate_presence.
Allow transitive relationships (:through) on linked and embedded associations. Allow one-to-one by-key associations. Allow one-to-many by-bucket associations.
Bring up-to-date Justin Marney’s patch for manual secondary indices. Evaluate feasibility and applicability of this approach vis-a-vis search and the to-be-built-in 2I.
Define call-sequence and scope-definition patterns that will be used to construct an abstract query form. Define how that abstract form will be turned into concrete MapReduce and Search queries.
First, let’s assume that the necessary enhancements to other areas listed in the “Steps forward” section above have been mostly completed. There are a number of general operations (patterns?) that can happen within a MapReduce query, and these should be the basis for the syntax used by Ripple queries.
Selecting/rejecting objects from a collection via predicates.
Riak features that support this pattern:
- Map phases (inspecting the objects themselves)
- Reduce phases (inspecting collections of objects)
- Key-filters (inspecting keys)
- Search (limiting by FTI)
Map phases are the simplest to implement and the Javascript functions to perform filtering map phases are already present in the generated Ripple files.
Reduce phases can be used to filter in situations where the data are not Riak objects, but values computed from previous phases.
Key-filters require knowledge of the format of the key and are also limited in application by the predefined filters in the riak_kv_mapred_filter module.
Search is the most flexible in query construction, but has its own limitations (which may be alleviated in the future), including integer padding, limited date/time support, and limited support for multi-valued fields.
Transformation of data from one format or type to another, or decomposition of complex datatypes.
Riak features that support this pattern:
- Map phases
Map phases are the only way to implement this, assuming the pattern is limited to single objects, rather than Aggregation which operates over multiple objects.
A special case of transformation which results in producing references to other objects. Analogous to “joins” in SQL or following outgoing edges on a directed graph.
Riak features that support this pattern are:
- Map phases
- Link phases
- Link-walking
Map phases may produce [bucket, key]
pairs or [bucket, key,
keydata]
triples that can be loaded by subsequent map phases.
Link-walking and link phases are special cases of Map phases that simply traverse link specifications.
Coalescing a value by folding across a collection of values. Typical aggregate operations are sum, minimum, maximum, count, statistics.
Riak features that support this pattern:
- Reduce phases
Reduce phases are the canonical way to produce aggregates and some typical operations have Javascript built-ins: sum, min, max. These operations only operate on single dimensions, however, and so are not effective at producing partitioned or qualified aggregates.
Ordering a collection of values.
Riak features that support this pattern:
- Reduce phases
- Search
Reduce phases operate over collections of values and so allow sorting across the collection. On the other hand, they may apply the sort multiple times depending on where the phase executes.
Search will sort results by relevance or user-specified criteria and may potentially be used with MapReduce to siginificantly reduce the quantity of inputs.
Restricting the quantity of results to a user-specified limit.
Riak features that support this pattern:
- Reduce phases
- Search
Similar to sorting, Reduce phases operate over collections of values and so allow restriction of the quantity. However, results are not reliable unless preceded by another reduce phase, such that the limiting function is applied only once.
Search will limit results from a query to a default or user-specified quantity. Those results could then be fed to a MapReduce query for more processing.
Below are some proposals for aspects of the query syntax/API. Commentary is inline with the code.
class Person
include Ripple::Document
property :first_name, String
property :last_name, String
property :email, String, :presence => true
property :birthday, Date
property :biography, String
key_on :email
many :addresses
many :friends, :class_name => "Person"
timestamps!
end
class Address
include Ripple::EmbeddedDocument
property :street, String
property :city, String
property :province, String
property :postcode, String
property :country, String
end
##### PRIMITIVE OPS #####
# These will likely be building-blocks for the operations below.
Person.key_filter do
# Just like the syntax available to Riak::MapReduce
ends_with "gmail.com"
end
Person.map("Ripple.mapIdentity")
Person.link(:tag => "friends")
Person.reduce("Riak.reduceMax")
Person.search("first_name:John AND last_name:Smith")
##### FILTERING #####
# Simple filtering (equality)
Person.filter(:last_name => "Smith")
Person.filter(:birthday => Date.civil(2001, 3, 16))
# Implicit invocation of more complicated filters (matches, between, includes)
Person.filter(:email => /gmail\.com/)
Person.filter(:last_name => "A".."B")
Person.filter(:first_name => ["Adam", "Duff", "Sean"])
# Explicit invocation of filters
Person.filter(:birthday => { "<=" => Date.civil(1997, 1, 1) })
Person.filter(:email => { :ends_with => "basho.com" })
# Filtering on embedded/nested data
Person.filter("addresses/state" => "NY")
##### TRANSFORMATION #####
# Implicit transformations
Person.user_scope.extract(:birthday)
Person.user_scope.extract(:addresses)
Person.user_scope.extract(:first_name, :last_name)
Person.user_scope.count
# Explicit transformation functions
Person.user_scope.transform("MyApp.munge")
Person.user_scope.transform("function(v){ return [JSON.parse(v.values[0]).biography.length]; }")
##### TRAVERSAL #####
# Existing linked associations (not special)
person.friends
# Scope-changing link-traversal
Person.filter(:birthday => Date.civil(2001,3,16)).traverse(:friends)
##### AGGREGATION #####
Person.transform("MyApp.extractAge").average
Person.transform("MyApp.extractAge").filter(18..25).count
Person.extract(:created_at).statistics
##### SORTING #####
Person.user_scope.order(:last_name, :first_name)
Person.filter(:last_name => "Smith").order(:last_name => :desc)
##### LIMITING #####
Person.order(:last_name => :desc).limit(25)