-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete duplicated key/value pairs recursively #2829
base: main
Are you sure you want to change the base?
Conversation
lixiaoy1
commented
Sep 4, 2017
•
edited
Loading
edited
c2131f2
to
eeb93ff
Compare
@lixiaoy1 updated the pull request - view changes |
retest please. |
I didn't understand why we're adding |
Also for the benchmark results, do you mind sharing the full options used for before and after? You can find them either in a file whose name begins with "OPTIONS" in the db directory or near the top of the info log. Also, did you use upstream rocksdb as the baseline (i.e., without limiting how many memtables a flush can contain)? Thanks! |
@ajkr Thank you for your comments. I used this branch as baseline: https://github.com/ceph/rocksdb/tree/e15382c09c87a65eaeca9bda233bab503f1e5772 For the test obj40960.xlsx: There are 4 scenarios in the tests: normal_merge* and dup*. The scenario normal_merge* were tested based on this above baseline branch e15382c . And the scenario dup* were tested with this PR. (The test environment doesn't exist, but I recorded the options ) And with following different options: dup2: normal_merge3: dup3: Note: I use write_buffer_number_to_flush as 1 in dup* scenarios. The default level0_file_num_compaction_trigger is 4. I changed it to 8 in dup* scenario as the L0 files generated in dup* scenario is much less than normal_merge*. |
Thanks, @lixiaoy1, I understand the use case better now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we always set write_buffer_number_to_flush
to one when kFlushStyleDedup
is enabled? We want to minimize the number of options introduced.
Also, btw, we plan to extend this feature to repeatedly compact the oldest two immutable memtables into one larger immutable memtable. We'll flush the compacted memtable into an L0 file only once it exceeds some size (maybe just |
@ajkr Good idea to repeatedly compact the oldest immutable memtables! It seems that the repeated compaction works well with current merge style in master branch instead of this PR. In the master branch, when merging two/three or more immutable memtables, it keeps the merged result in memory instead of flushing into L0. To trigger flush, or the merged memtable exceeds its limit size, or number of logs exceeds its limit, or db_write_buffer exceeds its limit. |
41f3aab
to
eeb93ff
Compare
@lixiaoy1 has updated the pull request. |
eeb93ff
to
873b7fc
Compare
@lixiaoy1 has updated the pull request. |
@ajkr The option write_buffer_number_to_flush is removed. |
I also did following tests:
I did step 3 and 4 in the following setting with flush_style=kFlushStyleDedup or flush_style=kFlushStyleMerge: The total size of data written into L0 SST files with flush_style=kFlushStyleDedup is 46653MB; and then total size of data written into L0 SST files with flush_style=kFlushStyleMerge is 36313MB. This PR can decrease the data written into SST files. It can improve performance when disk is busy. |
@ajkr Any further questions about the PR? |
873b7fc
to
0c59462
Compare
@lixiaoy1 has updated the pull request. |
This change updates range_del parts. |
0c59462
to
91d5f2e
Compare
@lixiaoy1 has updated the pull request. |
I get the message from AppVeyor build: "Build execution time has reached the maximum allowed time for your plan (60 minutes)." |
retest please. |
@lixiaoy1 has updated the pull request. |
This is to implement the idea: http://pad.ceph.com/p/rocksdb-wal-improvement Add a new flush style called kFlushStyleDedup which users can config by setting flush_style=kFlushStyleDedup. When flush is triggered, it dedups the key/value pairs in the oldest memtable against other memtables before flushing the oldest memtable into L0. The flush solution benefits for the data which are duplicated between memtables. With this flush, it can decrease the data flushed into L0 a lot. Signed-off-by: Xiaoyan Li <xiaoyan.li@intel.com>
We revisited this internally as it is still an interesting idea for reducing flush bytes. One thing we realized this time that we didn't notice previously is the assumption that newer memtable data that was used to deduplicate data during flush must be recoverable after a crash. If an older version of a key is deduplicated but the newer version of a key is lost in a crash, then recovery will have a hole at the seqno of the older version of the key. The newer version of the key could be lost simply because of |