-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Delete duplicated key/value pairs recursively #19
base: master
Are you sure you want to change the base?
Conversation
will this change go to upstream? |
As this idea is from Sage http://pad.ceph.com/p/rocksdb-wal-improvement, at first hope Sage or you can review the changes. |
@liewegas Please help to review. |
@@ -236,6 +236,9 @@ class ColumnFamilyData { | |||
MemTable* mem() { return mem_; } | |||
Version* current() { return current_; } | |||
Version* dummy_versions() { return dummy_versions_; } | |||
void set_stop() { stop = true; } | |||
bool is_stop() { return stop; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the stop flag needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used to decide whether choosing all immutable tables to flush when MemTableList::PickMemtablesToFlush. In the new flush style, if the db process is going to shutdown, it needs to flush all existed immutable memtables. https://github.com/ceph/rocksdb/pull/19/files#diff-f8dc8f7b1ea2cb77b0eb9b156bc434f2R133
db/builder.cc
Outdated
assert((column_family_id == | ||
TablePropertiesCollectorFactory::Context::kUnknownColumnFamily) == | ||
column_family_name.empty()); | ||
// Reports the IOStats for flush for every following bytes. | ||
const size_t kReportFlushIOStatsEvery = 1048576; | ||
Status s; | ||
uint64_t num_duplicated = 0, num_total=0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: spaces around =
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update it later.
db/builder.cc
Outdated
@@ -138,17 +140,52 @@ Status BuildTable( | |||
&snapshots, earliest_write_conflict_snapshot, env, | |||
true /* internal key corruption is not ok */, range_del_agg.get()); | |||
c_iter.SeekToFirst(); | |||
|
|||
ParsedInternalKey c_ikey, comp_ikey; | |||
if(compare_iter != nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: space between if and (
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update it later.
db/builder.cc
Outdated
break; | ||
} | ||
else if(result > 0) | ||
compare_iter->Next(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: braces around single statement blocks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update it later.
Yay! This looks reasonable to me, but I have pretty limited rocksdb experience so it really needs to be looked at by those folks. I'd fix the style issues before sending it there since they'll undoubtable complain about it. (Also, the automated github pull request checks include a style check.) I don't understand why the new set_stop/is_stop checks are needed; what makes this change need it? A description of the approach and new flag somewhere is probably needed. It looks like you are always flushing a single memtable at a time and always deduping against all others. We might need a tunable there? (In contrast, the current behavior is to flush all memtables? That doesn't seem quite right as there are a few tunables controlling the how many to flush at a time... sorry, I'm not very familiar with this code.) The expected savings is probably most evident in the overall write-amp and amount of data going into level0. Did you happen to look at this metrics? @markhpc might know which values to look for. Thanks! |
The current behavior is in following scenarios it begins to flush:
In this branch, it changes scenario 1. You are right. In this patch, it flushes a single memtable at a time and always deduping against other min_write_buffer_number_to_merge-1 memtables. I agree that we can make it more flexible: use min_write_buffer_number_to_merge to indicate how many memtables to flush, and add another options to specify how many other memtables to compare/dedup. And sum of these two numbers should be <= max_write_buffer_number. The branch decreases the data flushed to L0. When min_write_buffer_number_to_merge is set 2, the following are summary of L0 sst files every 10 minutes from my tests: Master branch: |
Awesome work Li! I'm really excited to take a look at this. |
awesome work. actually the minor-compaction which called kFlushStyleDedup is similar with L0 sstable's compaction. we could set |
level0_file_num_compaction_trigger=1 means to trigger L0->L1 compaction when number of l0 sst files is 1. I think their mechanism are different. Meanwhile if you set level0_file_num_compaction_trigger as 1, more duplicated data will put into l2. |
@lixiaoy1 oh....sorry, |
@liuchang0812 I think you mean min_write_buffer_number_to_merge. When setting min_write_buffer_number_to_merge as 2, the test result on this branch is better. You may have a look at |
@markhpc Could you please have a look? |
dad3d2a
to
bf29157
Compare
This is to implement the idea: http://pad.ceph.com/p/rocksdb-wal-improvement Add a new flush style called kFlushStyleDedup which users can config by setting flush_style=kFlushStyleDedup and write_buffer_number_to_flush. Flush is triggered while number of non-flushed memtables is >= min_write_buffer_number_to_merge. When flushing memtables to L0, it merges write_buffer_number_to_flush memtables and deduping against other non-flushed memtables. With bluestore fio plugin to do 4k random write test, I compared the two stats db.get.micros and db.write.micros from rocksdb every ten minutes. db.get.micros is the average of get operations, and db.write.micros is the average time of put/delete functions. From the result, rocksdb read performance can be improved up to 38%, and write performance can be improved to 15%. More detailed info please see the excel: https://drive.google.com/drive/folders/0B6jqFc7e2yxVdUQ2aEpCR3ItbG8 And to complement, the io perfermance of bluestore fio doesn't improve much. From time of txc states, the time spends a lot in kv_state_commiting_lat (up to 80%), but submit_transaction and submit_transaction_sync takes little time compared. Signed-off-by: Xiaoyan Li xiaoyan.li@intel.com
This pr is based on Rocksdb@e15382c09c87a65eaeca9bda233bab503f1e5772, I submitted a new PR to facebook/Rocksdb facebook#2829, which is based on Rocksdb master branch. |
Hi Li, Sorry, I haven't gotten a chance to test this yet. Have you gotten any feedback from the rocksdb folks? |
@markhpc currently Andrew Kryczka from Facebook asked some questions. What I am doing now is that I get the kv operation sequences out from BlueStore when running fio+librbd, and use the kv operation seqs to feed into Rocksdb, and check how Rocksdb is improved. lixiaoy1/ceph@e6a4e8d |
Hi Li, I'm still swamped, but I wanted to mention that we're going to hopefully look at this soon (either me or someone else). Any update from the rocksdb folks? Mark |
Hi Mark,
I have some discussion about the PR with Facebook people. It is still under review.
And another point is that Facebook people raised the following idea which I think also can achieve same objective:
Repeatedly compact the oldest two immutable memtables into one larger immutable memtable. We'll flush the compacted memtable into an L0 file only once it exceeds some size.
Anyway I think both of above solution can benefit BlueStore.
Best wishes
Lisa
From: Mark Nelson [mailto:notifications@github.com]
Sent: Wednesday, October 11, 2017 10:42 PM
To: ceph/rocksdb <rocksdb@noreply.github.com>
Cc: Li, Xiaoyan <xiaoyan.li@intel.com>; Mention <mention@noreply.github.com>
Subject: Re: [ceph/rocksdb] [WIP] Delete duplicated key/value pairs recursively (#19)
Hi Li,
I'm still swamped, but I wanted to mention that we're going to hopefully look at this soon (either me or someone else). Any update from the rocksdb folks?
Mark
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#19 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AM1sAUICifAcsMCK7NiNPvawQemttHZHks5srNO4gaJpZM4OVSCp>.
|
This is to implement the idea: http://pad.ceph.com/p/rocksdb-wal-improvement
Add a new flush style called kFlushStyleDedup which users can config by setting
flush_style=kFlushStyleDedup and write_buffer_number_to_flush. Flush
is triggered while number of non-flushed memtables is >= min_write_buffer_number_to_merge.
When flushing memtables to L0, it merges write_buffer_number_to_flush memtables
and deduping against other non-flushed memtables.
With bluestore fio plugin to do 4k random write test, I compared the two stats
db.get.micros and db.write.micros from rocksdb every ten minutes. db.get.micros
is the average of get operations, and db.write.micros is the average time of put/delete
functions. From the result, rocksdb read performance can be improved up to 38%,
and write performance can be improved to 15%. More detailed info please see the excel:
https://drive.google.com/drive/folders/0B6jqFc7e2yxVdUQ2aEpCR3ItbG8
And to complement, the io perfermance of bluestore fio doesn't improve much. From time of
txc states, the time spends a lot in kv_state_commiting_lat (up to 80%), but submit_transaction and
submit_transaction_sync takes little time compared.
Signed-off-by: Xiaoyan Li xiaoyan.li@intel.com