Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete duplicated key/value pairs recursively #2829

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lixiaoy1
Copy link

@lixiaoy1 lixiaoy1 commented Sep 4, 2017

This is to implement the idea: http://pad.ceph.com/p/rocksdb-wal-improvement
Add a new flush style called kFlushStyleDedup which users can config by setting
flush_style=kFlushStyleDedup. When flush is triggered, it dedups the key/value
pairs in the oldest memtable against other memtables before flushing the
oldest memtable into L0.

The flush solution benefits for the data which are duplicated between memtables.
With this flush, it can decrease the data flushed into L0 a lot.

Signed-off-by: Xiaoyan Li <xiaoyan.li@intel.com>

@facebook-github-bot
Copy link
Contributor

@lixiaoy1 updated the pull request - view changes

@lixiaoy1
Copy link
Author

lixiaoy1 commented Sep 5, 2017

retest please.

@siying siying requested a review from ajkr September 11, 2017 16:47
@ajkr
Copy link
Contributor

ajkr commented Sep 13, 2017

I didn't understand why we're adding write_buffer_number_to_flush. I feel it creates the problem this PR intends to solve. Restricting the number of immutable memtables in a flush job makes it likely older versions of a key are flushed and newer versions aren't flushed yet. If we leave it unrestricted, we also get the benefits of larger L0 files: accommodate bigger write bursts and lower write-amp.

@ajkr
Copy link
Contributor

ajkr commented Sep 13, 2017

Also for the benchmark results, do you mind sharing the full options used for before and after? You can find them either in a file whose name begins with "OPTIONS" in the db directory or near the top of the info log.

Also, did you use upstream rocksdb as the baseline (i.e., without limiting how many memtables a flush can contain)? Thanks!

@lixiaoy1
Copy link
Author

lixiaoy1 commented Sep 14, 2017

@ajkr Thank you for your comments.
I am sorry that the name write_buffer_number_to_flush may be confusing.
When there are N immutable tables to flush, if write_buffer_number_to_flush is set to M(M<N), this PR merges M tables at first (like in master branch), and then compares merged data to the left (N-M) tables. If a key-pair is valid (not deleted/updated in the left N-M tables), it is flushed into L0 sst files. And waiting until N immutable tables, repeat former steps again.
It can decrease data flushed into L0, but increased file numbers in L0.

I used this branch as baseline: https://github.com/ceph/rocksdb/tree/e15382c09c87a65eaeca9bda233bab503f1e5772

For the test obj40960.xlsx:
https://drive.google.com/drive/folders/0B6jqFc7e2yxVdUQ2aEpCR3ItbG8

There are 4 scenarios in the tests: normal_merge* and dup*. The scenario normal_merge* were tested based on this above baseline branch e15382c . And the scenario dup* were tested with this PR.

(The test environment doesn't exist, but I recorded the options )
The common changed options are:
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=2,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,disableWAL=true,stats_dump_period_sec=600,

And with following different options:
normal_merge2:
min_write_buffer_number_to_merge=2

dup2:
max_background_flushes=1, level0_file_num_compaction_trigger=8

normal_merge3:
min_write_buffer_number_to_merge=3

dup3:
max_background_flushes=1, min_write_buffer_number_to_merge=3, level0_file_num_compaction_trigger=8

Note: I use write_buffer_number_to_flush as 1 in dup* scenarios.

The default level0_file_num_compaction_trigger is 4. I changed it to 8 in dup* scenario as the L0 files generated in dup* scenario is much less than normal_merge*.

dup3

normal_merge3

@ajkr
Copy link
Contributor

ajkr commented Sep 19, 2017

Thanks, @lixiaoy1, I understand the use case better now.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always set write_buffer_number_to_flush to one when kFlushStyleDedup is enabled? We want to minimize the number of options introduced.

@ajkr
Copy link
Contributor

ajkr commented Sep 25, 2017

Also, btw, we plan to extend this feature to repeatedly compact the oldest two immutable memtables into one larger immutable memtable. We'll flush the compacted memtable into an L0 file only once it exceeds some size (maybe just write_buffer_size). The point is to get the same benefits without creating smaller L0 files, which generally have caused problems like write stalling. Let us know if you have any thoughts on this :).

@lixiaoy1
Copy link
Author

@ajkr Good idea to repeatedly compact the oldest immutable memtables! It seems that the repeated compaction works well with current merge style in master branch instead of this PR. In the master branch, when merging two/three or more immutable memtables, it keeps the merged result in memory instead of flushing into L0. To trigger flush, or the merged memtable exceeds its limit size, or number of logs exceeds its limit, or db_write_buffer exceeds its limit.

@facebook-github-bot
Copy link
Contributor

@lixiaoy1 has updated the pull request. View: changes

@facebook-github-bot
Copy link
Contributor

@lixiaoy1 has updated the pull request.

@facebook-github-bot
Copy link
Contributor

@lixiaoy1 has updated the pull request.

@lixiaoy1
Copy link
Author

@ajkr The option write_buffer_number_to_flush is removed.

@lixiaoy1 lixiaoy1 changed the title [WIP] Delete duplicated key/value pairs recursively Delete duplicated key/value pairs recursively Sep 29, 2017
@lixiaoy1
Copy link
Author

lixiaoy1 commented Oct 10, 2017

I also did following tests:

  1. Generated the KV pairs sequences when doing 4k IO with Ceph/BlueStore in 30 mins.
  2. Create a new RocksDB.
  3. Inject above KV pairs to the db one by one.
  4. Compare the size for every L0 SST files.

I did step 3 and 4 in the following setting with flush_style=kFlushStyleDedup or flush_style=kFlushStyleMerge:
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=2,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152.

The total size of data written into L0 SST files with flush_style=kFlushStyleDedup is 46653MB; and then total size of data written into L0 SST files with flush_style=kFlushStyleMerge is 36313MB.

This PR can decrease the data written into SST files. It can improve performance when disk is busy.

@lixiaoy1
Copy link
Author

@ajkr Any further questions about the PR?

@facebook-github-bot
Copy link
Contributor

@lixiaoy1 has updated the pull request.

@lixiaoy1
Copy link
Author

This change updates range_del parts.

@facebook-github-bot
Copy link
Contributor

@lixiaoy1 has updated the pull request.

@lixiaoy1
Copy link
Author

I get the message from AppVeyor build: "Build execution time has reached the maximum allowed time for your plan (60 minutes)."
Please retest.

@lixiaoy1
Copy link
Author

retest please.

@facebook-github-bot
Copy link
Contributor

@lixiaoy1 has updated the pull request.

    This is to implement the idea: http://pad.ceph.com/p/rocksdb-wal-improvement
    Add a new flush style called kFlushStyleDedup which users can config by setting
    flush_style=kFlushStyleDedup. When flush is triggered, it dedups the key/value
    pairs in the oldest memtable against other memtables before flushing the
    oldest memtable into L0.

    The flush solution benefits for the data which are duplicated between memtables.
    With this flush, it can decrease the data flushed into L0 a lot.

    Signed-off-by: Xiaoyan Li <xiaoyan.li@intel.com>
@ltamasi ltamasi self-requested a review September 10, 2019 17:12
@ajkr
Copy link
Contributor

ajkr commented Jul 6, 2023

We revisited this internally as it is still an interesting idea for reducing flush bytes. One thing we realized this time that we didn't notice previously is the assumption that newer memtable data that was used to deduplicate data during flush must be recoverable after a crash. If an older version of a key is deduplicated but the newer version of a key is lost in a crash, then recovery will have a hole at the seqno of the older version of the key. The newer version of the key could be lost simply because of WriteOptions::disableWAL was used, or something more complicated like the host crashed while the newer key version was not fsynced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants