Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection Merge Operator #3620

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

adamretter
Copy link
Collaborator

This is a new merge operator which allows you to store a collection of records within the value of a key/value pair.

A record is just a fixed length byte string. Features include:

  1. Options for controlling the collection semantics, supporting both:

    1. a Set like collection of records where each record is unique, and,
    2. a Vector like collection where there may be duplicate records.
  2. The Collection Merge Operator also optionally supports ordering, where a Comparator can be provided to ensure the ordering of records.

A simple example of vector like behaviour without ordering would look like:

DB* db;
Options options;
options.create_if_missing = true;

auto collection_merge_operator = std::make_shared<CollectionMergeOperator>(4);
options.merge_operator = collection_merge_operator;

Status status = DB::Open(options, "/tmp/testdb", &db);
assert(status.ok());

Slice key1("key1");

Slice add1(std::string("100020003000).insert(0, 1, CollectionOperator::kAdd));
db->merge(key1, add1);

Slice remove1(std::string("2000).insert(0, 1, CollectionOperator::kRemove));
db->merge(key1, remove1);

assert(db.get(key1) == Slice("10003000");

@facebook-github-bot
Copy link
Contributor

@adamretter has updated the pull request.

@facebook-github-bot
Copy link
Contributor

@adamretter has updated the pull request. View: changes

@adamretter adamretter force-pushed the feature/collection-merge-operator branch from f6cf56e to e28b938 Compare March 21, 2018 16:33
@facebook-github-bot
Copy link
Contributor

@adamretter has updated the pull request.

@adamretter adamretter force-pushed the feature/collection-merge-operator branch from e28b938 to 2fb6642 Compare March 25, 2018 17:44
@facebook-github-bot
Copy link
Contributor

@adamretter has updated the pull request.

@adamretter adamretter force-pushed the feature/collection-merge-operator branch from 2fb6642 to f31546e Compare March 26, 2018 19:01
@facebook-github-bot
Copy link
Contributor

@adamretter has updated the pull request.

@adamretter adamretter force-pushed the feature/collection-merge-operator branch from f31546e to dfb9bb5 Compare March 26, 2018 20:57
@facebook-github-bot
Copy link
Contributor

@adamretter has updated the pull request.

@adamretter adamretter force-pushed the feature/collection-merge-operator branch from dfb9bb5 to 6c37e0a Compare March 29, 2018 15:24
@facebook-github-bot
Copy link
Contributor

@adamretter has updated the pull request.

@adamretter
Copy link
Collaborator Author

adamretter commented Apr 6, 2018

@yuslepukhin I am getting a compile error here with Visual Studio that I don't get with clang or gcc, I wonder if there is something simple you could point out to me that is Visual Studio specific? The error is:

(ClCompile target) -> 
    c:\projects\rocksdb\utilities\merge_operators\collection\collection_merge_operator.h(83): error C3861: 'toHex': identifier not found [C:\projects\rocksdb\build\rocksdb-shared.vcxproj]
    c:\projects\rocksdb\utilities\merge_operators\collection\collection_merge_operator.h(87): error C2440: 'default argument': cannot convert from 'rocksdb::<lambda_2b3d5b4d148b31b228973f801694c3b2>' to 'const std::function<std::string (const char *,::size_t,::size_t)>' [C:\projects\rocksdb\build\rocksdb-shared.vcxproj]

@yuslepukhin
Copy link
Contributor

I am tied on some production issues so it is going to be awhile before I take a look at it.

@yuslepukhin
Copy link
Contributor

Did not get a change to fetch and compile. However,
const size_t is misleading and does nothing in this case, that is why the error message does not mention constness, so is const char* const should simply be const char*. The same applies to toHex.
My recommendation is to introduce two overloads and avoid default arguments (not just in this case).
The reason toHex() is not found is probably a bug. Is this VS 2017?

@adamretter adamretter force-pushed the feature/collection-merge-operator branch 2 times, most recently from 65494a9 to 6eada61 Compare June 29, 2018 08:23
@adamretter adamretter force-pushed the feature/collection-merge-operator branch from 6eada61 to 1cf6a0c Compare April 22, 2019 10:33
@adamretter adamretter force-pushed the feature/collection-merge-operator branch 2 times, most recently from 5699cb7 to 8ed5bd9 Compare June 13, 2019 21:28
@adamretter
Copy link
Collaborator Author

@gfosco @siying I have rebased this to bring it up to date. Could someone take a look please?

@adamretter adamretter force-pushed the feature/collection-merge-operator branch from 8ed5bd9 to 7e47bf4 Compare June 17, 2019 18:15
@siying siying requested a review from vjnadimpalli September 9, 2019 19:18
@vjnadimpalli
Copy link
Contributor

@adamretter Thanks for the effort and sorry for the delay in reviewing this PR. What is the status on this one? Are you still actively interested in contributing this? The overall idea is fantastic, we have a new API (#5604) to get all the merge operands in DB, I was wondering if that can be used to do what is done in this PR. In the value users could store the operation like you mentioned [OperationType, Operand] and when they get all the merge operands then in user code they code simply iterate over all the operands and do what ever they like which includes sorting. That would avoid a lot of Ser/DeSer operations. Let me know your thoughts.

@adamretter
Copy link
Collaborator Author

@vjnadimpalli I think these are quite different things if I understand you correctly.

Your feature allows to bring all merge operands into user code. Whereas this feature allows you to just call db.merge and have it manage a collection of values for a key for you in an atomic and safe manner without you having to know the specifics. Does that make sense?

@vjnadimpalli
Copy link
Contributor

vjnadimpalli commented Sep 13, 2019

@adamretter overall the change is awesome. One thing I was curious about in collection_merge_operator.cc was why not use set data structure when set type of semantic is configured by the user. Let's say in full merge the merge_in.operand_list may first be converted to a set so all subsequent operations like add and remove can be O(1) as opposed to doing scans. Maybe it's there and I missed it so just clarifying.
Also maybe rebase so we can get it ready to merge in.

@adamretter adamretter force-pushed the feature/collection-merge-operator branch 3 times, most recently from 0d3f01c to 634a364 Compare November 24, 2019 20:16
@adamretter
Copy link
Collaborator Author

@vjnadimpalli we could use a set, but I wanted to avoid making copies which would use more memory than necessary. Also the code at the moment is succint as the algorithms work on both set and vector like storage. This could of course be further changed/enhanced in future if needed, but I think it serves quite well as a first version.

I have also rebased as requested.

@adamretter adamretter force-pushed the feature/collection-merge-operator branch from 634a364 to 66fd869 Compare February 12, 2020 17:36
@adamretter adamretter force-pushed the feature/collection-merge-operator branch from aa7e4e0 to 3f3ef65 Compare February 24, 2020 22:50
@adamretter
Copy link
Collaborator Author

@pdillinger Can you take a look at this please?

@adamretter adamretter force-pushed the feature/collection-merge-operator branch from 3f3ef65 to bdba1cd Compare February 29, 2020 14:19
@adamretter
Copy link
Collaborator Author

@pdillinger @siying could you take a look at this PR please?

@pdillinger pdillinger assigned pdillinger and unassigned vjnadimpalli Mar 2, 2020
@pdillinger pdillinger self-requested a review March 2, 2020 19:45
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

*/
struct SliceHasher {
size_t operator()(const Slice& slice) const {
size_t result = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use GetSliceNPHash64(slice) cast to size_t

* your records into some format that is readable for you. The default
* is just a simple hex serialization.
*/
CollectionMergeOperator(const uint16_t fixed_record_len,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there is often compiler tolerance for implicit narrowing conversions, I suggest using size_t with a check on supported record len.


bool CollectionMergeOperator::exists(Slice& value_record, std::string& existing) const {
const char* existing_ptr = existing.data();
for (size_t i = 0; i < existing.size(); i += fixed_record_len_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: use binary search if ordered?


// loop through existing records, remove the first that matches the record to remove
const char* existing_ptr = existing.data();
for (size_t j = 0; j < existing_records_len; j++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO?: consider sharing code with exists()


namespace ROCKSDB_NAMESPACE {

std::string Clear() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We internally have a test that makes sure that everything in the repo can be compiled together. This means we should avoid populating the top-level rocksdb namespace with generically-named but specific-use symbols, even in test code. Can you put this stuff in a sub-namespace and import that in the test?


namespace ROCKSDB_NAMESPACE {

const uint16_t DEFAULT_TEST_RECORD_SIZE = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a test using size other than DEFAULT_TEST_RECORD_SIZE?

return result;
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on adding a factory function to MergeOperators?

@adamretter adamretter force-pushed the feature/collection-merge-operator branch from bdba1cd to 4175971 Compare January 6, 2021 17:50
@facebook-github-bot
Copy link
Contributor

@adamretter has updated the pull request. You must reimport the pull request before landing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants