Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Bucket Types to Riak #362

Closed
jrwest opened this issue Aug 12, 2013 · 14 comments
Closed

Add Bucket Types to Riak #362

jrwest opened this issue Aug 12, 2013 · 14 comments

Comments

@jrwest
Copy link
Contributor

jrwest commented Aug 12, 2013

Today, keys in Riak are made up of two parts: the bucket they belong to and a unique identifier within that bucket. Buckets act as a namespace and allow for similar keys to be grouped. In addition, they provide a means of configuring how Riak treats that data.

In the next release of Riak several new features will take advantage of this namespacing. In discussing these features, it has become apparent that to properly support them a change to Riak's key structure must be made. The change is to introduce another layer of namespacing called the "Bucket Type". With Bucket Types each key is now made up of three parts: the bucket type it belongs to, a unique bucket within that type, and a unique identifier within that bucket.

Why?

The primary driver for adding Bucket Types is the introduction of features like security and strong consistency which may wish to deal with groups of buckets (see the Security RFC's wildcard
proposal). Without Bucket Types these features must rely using special prefixes in Bucket names. Since Bucket names have never been restricted this has undesireable edge cases. By introducing a new layer of namespacing these features can refer to groups of Buckets by Bucket Type.

There is an additional user-facing benefit of Bucket Types. In Riak, currently, users have the choice of either: a) conforming all buckets to the default bucket properties (stored in app.config) or b) setting many custom properties for buckets, which is known to scale poorly. This is especially frustrating if the user has buckets that
fit the default properties and another set that all use a different set of properties. Using Bucket Types, users will be able to define groups of buckets that share the same properties and only store information about each Bucket Type instead of individual buckets. In addition to internal changes made in Riak to help this issue, this will scale much better.

Goals and Requirements

  • Opt-In: All changes must not affect old APIs. For existing bucket (with or without custom properties) and data stored within them access and storage will remain the same.
  • Zero data migration on upgrade (or downgrade if the user has not stored any data in a Bucket Type other than the default type).
  • Properties can be set on Bucket Types. Buckets within that type are "inherited" by the Bucket. Any properties set on the Bucket override those set in the Bucket Type.
  • Unlike Buckets, Bucket Types must be explicitly created. If a Bucket Type does not exist requests for data stored in that type will be rejected. This makes listing types quick and painless (a common gripe with Buckets).

The Default Bucket Type

Internally, Riak will be changed to be aware of Bucket Types as necessary. However, they are still an opt-in feature and as such existing data (and new data written to existing buckets) must be handled appropriately.

To allow existing data to live in the world of Bucket Types, all existing buckets are assigned to the default type. Riak's existing APIs will inject and strip the type information where necessary.

To opt-in, a user must write data to in a non-default bucket type using new APIs. Unfortunately, this means additional work for client developers but since Riak will continue to happily work with the old APIs it is not necessary for clients to be updated before the release.

APIs

As mentioned previously, no changes wil be made to existing APIs and in the case of protocol buffers all fields added will be optional.

Create/Update/Read Bucket Types

HTTP endpoints and PB messages will be added to create, update and fetch the properties associated with a Bucket Type. In HTTP, this might look like:

POST /types/<type>/props
PUT  /types/<type>/props
GET /types/<type>/props

Object/Key API

The HTTP Fetch/Store/Delete API will be mirrored to take the Bucket Type. This might look like:

GET /types/<type>/buckets/<bucket>/keys/<key>
PUT /types/<type>/buckets/<bucket/keys/<key>
..etc..

The correspoding protocol buffers messages will have an optional Bucket Type field added.

2i

The HTTP 2i API will be mirrored to take the Bucket Type. This might look like:

GET /types/<type>/buckets/<buckets>/index/<index>/...

The corresponding protocol buffers messages will have an optional Bucket Type field added.

Link Walking

Link walking will only be supported for data stored within the default bucket type. No API changes will be made.

MapReduce

MapReduce will be extended to take inputs and queries in buckets other than the default type.

Datatypes API

How the counters API will be extended is still undecided since work is ongoing to extend the API to support other datatypes. Of course the existing API will continue to work as promised by this RFC.

Upgrading and Downgrading

Upgrading is handled naturally by the treatment of all existing data as a member of the default type.

Downgrading, however, is a bit more complicated. If a user has not opted-in by using the new APIs (or only writing to the default bucket type) then downgrading stands as it does for all versions of Riak.

For users that have opted in to the new features downgrading becomes more painful because older versions of Riak will not understand data stored in Bucket Types other than default. A user should only opt-in after upgrading to the new release and using the existing APIs until satisified that the upgrade does not harm the application. Rolling back to a version that does not support bucket types will require removing (or moving to the side) all data stored in the non-default type.

@gideondk
Copy link

Very good addition IMHO. Next to easier bucket configuration, a extra dimension really helps in structuring equally named buckets for different scenario's (instead of prefixing it yourself).

Don't see any large problems in client implementation though... :-)

@brunogirin
Copy link

Is there any particular reason why link walking will be restricted to the default bucket type?

@peschkaj
Copy link

I don't see any large problems in client implementation, either. I'm going to guess that the .proto implementation will include an optional field for type which means sane protocol buffers clients shouldn't have to make any changes or do any version detection dance.

Barring a sudden introduction of new PBC methods, this looks good to me.

@jrwest
Copy link
Contributor Author

jrwest commented Aug 12, 2013

@gideondk @peschkaj great! glad to hear you guys won't find it too be too much of a pain.

@brunogirin There is no technical reason why it cannot be added. We may do so in a future release depending on demand but for the upcoming release it probably won't make it in. For existing link walking users nothing will break, which I think is the most important thing. As usual an early community contribution is always welcome and I'll be happy to squeeze in a review where I can.

@uberbrady
Copy link

These don't really feel like types to me - more like "bucket groups". Separately, there seem to be some optional settings you can put in for all buckets that are in one group.

I don't know if I'm just bikeshedding here - but I feel like you could just allow for some "/" characters in a bucket - a sort of simple bucket hierarchy - and get most of what you need.

Buckets that aren't in any group - (normal-looking Riak URL's) would maybe stay the same; and grouped buckets would just have another slash in the bucket part of the URL.

e.g.

GET /riak/bucket/key

would be a GET for an object in a classic 'default' bucket

GET /riak/mygroup/bucket/key

would be a GET for an object in a grouped one.

I'm just looking at this from the point of someone who doesn't know the internals of Riak at all - so there could be all kinds of terrible things wrong with my counterproposal.

And ultimately I don't think it matters too much. But I do like the fact that the URL's I'm fetching things from stay looking pretty much the same, even if I use the new features.

@seancribbs
Copy link
Contributor

@uberbrady I think it's more constructive to think of them outside the context of the HTTP API. It clouds the issue.

@jrwest
Copy link
Contributor Author

jrwest commented Aug 13, 2013

These don't really feel like types to me - more like "bucket groups".
Separately, there seem to be some optional settings you can put in for all
buckets that are in one group.

We considered the "group" as well and ultimately the decision was between
that and "type". In the end we chose "type" because more of us in the room
felt it was a better name. One fear was that from "group" one may or may
not infer that data may be grouped (either on disk or within nodes of the
cluster) which isn't the case. Ultimately, I think "type" does a good job
of conveying that these buckets have a set of settings common between. We
also considered "family" (which has too much resemblance to a similar name
in another data model) and "class" (which has too many parallels to
object-oriented programming that we decided against it).

I don't know if I'm just bikeshedding here - but I feel like you could
just allow for some "/" characters in a bucket - a sort of simple bucket
hierarchy - and get most of what you need.

This is the problem Bucket Types set out to solve. Since there are no
restrictions on bucket names if we decide to bless "prefix/..." and an
application already uses it we have a problem. The proposed security
feature's wildcard support is another area where this clashes. It would be
possible to provide a way to audit and migrate these buckets but that would
probably be slow and not ops-friendly.

Buckets that aren't in any group - (normal-looking Riak URL's) would maybe
stay the same; and grouped buckets would just have another slash in the
bucket part of the URL.

e.g.

GET /riak/bucket/key

would be a GET for an object in a classic 'default' bucket

GET /riak/mygroup/bucket/key

would be a GET for an object in a grouped one.

This is what we plan to do exactly with bucket types.

I'm just looking at this from the point of someone who doesn't know the
internals of Riak at all - so there could be all kinds of terrible things
wrong with my counterproposal.

And ultimately I don't think it matters too much. But I do like the fact
that the URL's I'm fetching things from stay looking pretty much the same,
even if I use the new features.


Reply to this email directly or view it on GitHubhttps://github.com//issues/362#issuecomment-22579681
.

@jrwest
Copy link
Contributor Author

jrwest commented Aug 13, 2013

@uberbrady one thing I didn't realize upon first read is your version used the already "old" API and the RFC uses the "newer" APIs. The APIs you are referring to are not yet deprecated within Riak but they are also not commonly extended (for example). The effect will be the same and the API you refer to will continue to work for at-least buckets in the default type (it may be extended as well depending on time, further discussion).

@jadeallenx
Copy link

We have hit one of the problems described with the current design which is: we have some data where "last write wins" is fine and other data where "allow_multi: true" is the right way to handle things. So I support this proposal because it sounds like we can easily share the riak infrastructure even though we might want mixed behaviors about concurrent writes within the same cluster.

@brunogirin
Copy link

@jrwest Thanks for the answer. If there is no technical blocker, that's brilliant and it makes complete sense to me that you would want to limit the scope of a first delivery.

From a point of view of what it would solve for me, it would make the multi backend concept more usable by being able to configure one backend for one bucket type and a different backend for another bucket type. I also like @mrallen1 's use case.

@lafka
Copy link

lafka commented Oct 26, 2013

I'v been doing some minor testing and have some questions:

  • Currently there is no way to delete a bucket type, is the only way around this to delete the datadir?
  • Calling riak_core_bucket_type:reset/1 will only set the default values for some keys, for instance yz_index is not altered. Should reset be equivalent of calling
  • When using riak-admin, should some bucket properties be read only? (like active should not be changeable through bucket-type update). It's already there for datatype.
  • Is there a way to store custom custom attributes in the bucket properties? i.e. I would like to store information about the structure of the keys in this bucket.

@jrwest
Copy link
Contributor Author

jrwest commented Oct 26, 2013

Awesome to see you taking things for a spin! My comments are below.

I'v been doing some minor testing and have some questions:

  • Currently there is no way to delete a bucket type, is the only way
    around this to delete the datadir?

    There is not currently a way to delete a bucket type similar to buckets in
    Riak. This is not something we plan to address for Riak 2.0 (but we may in
    the future).

  • Calling riak_core_bucket_type:reset/1 will only set the default
    values for some keys, for instance yz_index is not altered. Should
    reset be equivalent of calling

As mentioned elsewhere, reset is not ready for use and the example you
found is one reason why. This will be addressed or reset will be removed
before Riak 2.0.

  • When using riak-admin, should some bucket properties be read only?
    (like active should not be changeable through bucket-type update).
    It's already there for datatype.

Good catch. We are aware not all validation is completely implemented,
however there was no issue tracking these specific cases so I opened one:
basho/riak_core#442

  • Is there a way to store custom custom attributes in the bucket
    properties? i.e. I would like to store information about the structure of
    the keys in this bucket.

You should be able to store your own properties. The validation does not
take into account custom properties and should not reject them. However,
you may have problems using the same property names Riak uses (e.g.
datatype or n_val).

@lafka
Copy link

lafka commented Oct 26, 2013

@jrwest the custom attribute is limited by list_to_existing_atom/1 in riak_kv_wm_utils:erlify_bucket_prop. The same conversion is used in HTTP API, so might not be feasible to use list_to_atom/1.

@jrwest
Copy link
Contributor Author

jrwest commented Oct 26, 2013

@lafka ah right. There is nothing stopping you from using internal APIs but the command-line parsing code will prevent it. Although we may not stay w/ JSON [1] for the riak-admin bucket-type commands I'm not sure if this restriction will be lifted. I haven't run into this issue myself as a user of Riak but I imagine there are existing workarounds (loading a module w/ your atoms?) since buckets have had the same restriction when setting them via the API.

[1] #424

rzezeski added a commit to basho/yokozuna that referenced this issue Nov 11, 2013
Integrate the bucket types functionality.

basho/riak#362

Overall
-------

Bucket types are the future of namespacing and property creation in
Riak. They allow efficient storage of "bucket properties" outside of
the Ring and 2-level namespacing of `Type` and `Name`.

Essentially the bucket type can now be either a lone `binary()`
(legacy) or a 2-tuple of `{Type :: binary(), Name ::
binary()}`. Internally, when the legacy version is encountered it is
considered to live under the `default` bucket type. For example the
bucket `<<"my_bucket">>` would become `{<<"default">>, <<"my_bucket">>}`.

Up until this point Yokozuna has used the bucket property `yz_index`
to determine where to index data. This commit changes that in some
ways. Legacy users will have existing data in buckets. Those buckets,
in 2.0, will be considered to live under the default type as described
above. For legacy buckets (the default type) Yokozuna will NOT respect
the `yz_index` property. Rather it will act like Riak Search and use
an index named the same as the bucket AS LONG AS the `search` property
is set to true. Once users upgrade to 2.0 they should start opting
into non-default bucket types since it is more efficient and newer
features require the use of non-default type. For these types of
buckets Yokozuna will still use the `yz_index` property. This property
will typically be set at the type level but can also be overridden per
name under a type. Yokozuna doesn't care. If that `{Type, Name}` has a
`yz_index` property then it will be indexed.

In summary:

* Legacy buckets (default type) will act like Riak Search. The index
  used must have same name as bucket and `search` property must be
  true. This is used to users migrating from Riak Search.

* All new users MUST use new style buckets made of Type + Name. In
  most cases the `yz_index` property will be set on the type and thus
  inherited by all names under it (many buckets to one index). The
  index DOES NOT have to have the same name.

Handoff
-------

Another important change revolves around handoff. Since Yokozuna
leeches off the KV vnode it doesn't have control over handoff like it
would if it were a true vnode. When a node joins KV can start shipping
data before the bucket type data has been shipped over. In that case
there will be no `yz_index` property and indexes will be missing. AAE
would eventually catch this but it is poor form that node join would
cause a degradation in harvest, especially in a quiescent cluster.

To fix this Yokozuna needs more control over the lifecycle of the KV
vnode. Yokozuna needs to hook into the `handoff_starting` stage and
verify that the bucket types data is shipped before data handoff
begins. This is accomplished by adding the `yz_kv:should_handoff` hook
which is hard-coded in the KV vnode for now.

This is important for removing the hack around index creation as
well. Currently Yokozuna has a pretty horrible busy-wait hack in its
index hook to make sure indexes are created on joining nodes before
doing the first write of a handoff. This busy-wait blocks the KV vnode
and is dangerous for vnode latency. In a future commit this busy-wait
will be replaced with a check in this new handoff hook.

Removal of Automatic AAE Tree Clearing
--------------------------------------

Remove all functionality around automatic clearing of trees when
adding or removing the `yz_index` property on a bucket with data. This
was referred to as `sync_data` in the `yz_events` module. Also called
"flags" harking back to when Yokozuna had a one-to-one bucket-to-index
mapping.

The original intention was that adding an index to a bucket with data
should clear the AAE trees so that exchanges would start repairing
missing indexes. If setting the index property to the tombstone value
(removal) then a) data for that bucket should be purged from the index
and b) AAE trees should be cleared. After much thought I think this
implicit behavior hurts more than helps.

Actions like clearing all AAE trees can be very expensive. It will not
be obvious to all users that adding or changing `yz_index` could cause
expensive operations to occur. For example, clearing the AAE trees for
a database with billions or trillions of objects will be expensive to
rebuild. Rather than relying on AAE a more direct operation could be
offered that allows the user to re-index a bucket or subset of
data. When removing an index it makes more sense to let the user
delete the index entirely rather than do an implicit delete-by-query
which is doing a bunch of extra work for a index that is going to be
deleted anyways.

Misc Changes
------------

* Update all tests to work with bucket types.

* Update Basho Bench driver to work with bucket types.

* Make map-reduce extraction more efficient. This is the ugly hack
  found in `yokozuna:positions`.
rzezeski added a commit to basho/yokozuna that referenced this issue Nov 12, 2013
Integrate the bucket types functionality.

basho/riak#362

Overall
-------

Bucket types are the future of namespacing and property creation in
Riak. They allow efficient storage of "bucket properties" outside of
the Ring and 2-level namespacing of `Type` and `Name`.

Essentially the bucket type can now be either a lone `binary()`
(legacy) or a 2-tuple of `{Type :: binary(), Name ::
binary()}`. Internally, when the legacy version is encountered it is
considered to live under the `default` bucket type. For example the
bucket `<<"my_bucket">>` would become `{<<"default">>, <<"my_bucket">>}`.

Up until this point Yokozuna has used the bucket property `yz_index`
to determine where to index data. This commit changes that in some
ways. Legacy users will have existing data in buckets. Those buckets,
in 2.0, will be considered to live under the default type as described
above. For legacy buckets (the default type) Yokozuna will NOT respect
the `yz_index` property. Rather it will act like Riak Search and use
an index named the same as the bucket AS LONG AS the `search` property
is set to true. Once users upgrade to 2.0 they should start opting
into non-default bucket types since it is more efficient and newer
features require the use of non-default type. For these types of
buckets Yokozuna will still use the `yz_index` property. This property
will typically be set at the type level but can also be overridden per
name under a type. Yokozuna doesn't care. If that `{Type, Name}` has a
`yz_index` property then it will be indexed.

In summary:

* Legacy buckets (default type) will act like Riak Search. The index
  used must have same name as bucket and `search` property must be
  true. This is used to users migrating from Riak Search.

* All new users MUST use new style buckets made of Type + Name. In
  most cases the `yz_index` property will be set on the type and thus
  inherited by all names under it (many buckets to one index). The
  index DOES NOT have to have the same name.

Handoff
-------

Another important change revolves around handoff. Since Yokozuna
leeches off the KV vnode it doesn't have control over handoff like it
would if it were a true vnode. When a node joins KV can start shipping
data before the bucket type data has been shipped over. In that case
there will be no `yz_index` property and indexes will be missing. AAE
would eventually catch this but it is poor form that node join would
cause a degradation in harvest, especially in a quiescent cluster.

To fix this Yokozuna needs more control over the lifecycle of the KV
vnode. Yokozuna needs to hook into the `handoff_starting` stage and
verify that the bucket types data is shipped before data handoff
begins. This is accomplished by adding the `yz_kv:should_handoff` hook
which is hard-coded in the KV vnode for now.

This is important for removing the hack around index creation as
well. Currently Yokozuna has a pretty horrible busy-wait hack in its
index hook to make sure indexes are created on joining nodes before
doing the first write of a handoff. This busy-wait blocks the KV vnode
and is dangerous for vnode latency. In a future commit this busy-wait
will be replaced with a check in this new handoff hook.

Removal of Automatic AAE Tree Clearing
--------------------------------------

Remove all functionality around automatic clearing of trees when
adding or removing the `yz_index` property on a bucket with data. This
was referred to as `sync_data` in the `yz_events` module. Also called
"flags" harking back to when Yokozuna had a one-to-one bucket-to-index
mapping.

The original intention was that adding an index to a bucket with data
should clear the AAE trees so that exchanges would start repairing
missing indexes. If setting the index property to the tombstone value
(removal) then a) data for that bucket should be purged from the index
and b) AAE trees should be cleared. After much thought I think this
implicit behavior hurts more than helps.

Actions like clearing all AAE trees can be very expensive. It will not
be obvious to all users that adding or changing `yz_index` could cause
expensive operations to occur. For example, clearing the AAE trees for
a database with billions or trillions of objects will be expensive to
rebuild. Rather than relying on AAE a more direct operation could be
offered that allows the user to re-index a bucket or subset of
data. When removing an index it makes more sense to let the user
delete the index entirely rather than do an implicit delete-by-query
which is doing a bunch of extra work for a index that is going to be
deleted anyways.

Misc Changes
------------

* Update all tests to work with bucket types.

* Update Basho Bench driver to work with bucket types.

* Make map-reduce extraction more efficient. This is the ugly hack
  found in `yokozuna:positions`.
Licenser pushed a commit to Kyorai/riak_core that referenced this issue Nov 15, 2013
Like buckets, we must teach core this because a few subsystems
use them. Bucket Types provide a method for grouping buckets
logically (see basho/riak#362).
@jaredmorrow jaredmorrow added this to the 2.0 milestone Mar 24, 2014
@rzezeski rzezeski modified the milestones: 2.0-beta, 2.0 Mar 25, 2014
@ghost ghost mentioned this issue Jan 27, 2015
hmmr pushed a commit that referenced this issue Nov 8, 2016
Update locked deps and fix reltool.config for 2.1.2 rc6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants