-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize for serial commits in 2PC #2345
Optimize for serial commits in 2PC #2345
Conversation
Seems to me now all synchronization is mutex-based. Will we get more improvement if we do the following?
And I personally prefer to have a separate WriteImpl method for the mutex-based write, to make the implementation clean. Not sure what do @siying think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I think this is a reasonable direction to go. My main comments are:
(1) have a clear separation of the new logic for MyRocks and the current behavior to other RocksDB users.
(2) needs to think about the design of locking so that it is easier to understand and can perform better.
Also, just think about what can make the code easier to maintain going forward: patching the code everywhere, or have a new function and reuse code by helper function. I'm worried that mixed the two locking approaches everywhere in the code may make it harder to maintain in the future.
db/db_impl_write.cc
Outdated
@@ -395,9 +441,20 @@ Status DBImpl::WriteToWAL(const autovector<WriteThread::Writer*>& write_group, | |||
WriteBatchInternal::SetSequence(merged_batch, sequence); | |||
|
|||
Slice log_entry = WriteBatchInternal::Contents(merged_batch); | |||
if (concurrent) { | |||
// We need to lock mutex_ since logs_ etc. might change concurrently | |||
mutex_.Lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holding this lock in LogWriter::AddRecord() makes me nervous. This heavy duty lock can block reads and compactions. Even if not doing I/O, LogWriter::AddRecord() is still a relatively expensive operation. It calculates CRC for everything and makes several copies. Is there a way to define another mutex just to protect those WAL related data structures? You already defined two more mutexes. A third one doesn't feel like a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. It would require more work since log_ (and its family) are assumed to be protected by mutex_ all over the code. But it is doable.
include/rocksdb/db.h
Outdated
// Flush the WAL memory buffer to the file. If sync is true, it calls SyncWAL | ||
// afterwards. | ||
virtual Status FlushWAL(bool sync) { | ||
throw std::runtime_error("FlushWAL not implemented"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should return Status::NotSupported() instead of run time error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
db/db_impl_write.cc
Outdated
serial_memtable_guarantee = std::unique_ptr<InstrumentedMutexLock>( | ||
new InstrumentedMutexLock(&mem_mutex_)); | ||
} | ||
const bool batch_writes = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be the interface eventually? Add a more parameter to WriteImpl()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I am thinking something like options.disable_group_batch and options.enable_wal_buffer
db/version_set.h
Outdated
last_sequence_.store(s, std::memory_order_release); | ||
} | ||
|
||
void SetLastToBeWrittenSequence(uint64_t s) { | ||
assert(s >= last_to_be_written_sequence_); | ||
last_to_be_written_sequence_.store(s, std::memory_order_release); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are only accessed within DB mutex. So I assume relaxed order is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. Let me double check and confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SetLastToBeWrittenSequence is used only during initialization so its stronger ordering guarantee does not make much difference in performance. The newly added FetchAddLastToBeWrittenSequence could be accessed outside mutex and its ordering guarantees cannot be relaxed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might sound paranoid but how about we stick to either memory_order_seq_cst
or memory_order_relaxed
to make it easier to reason about correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this function I can increase it to memory_order_seq_cst but for FetchAddLastToBeWrittenSequence it might affect the performance. I can run a test and will increase it if the performance was not affected.
db/db_impl_write.cc
Outdated
auto l2bws = versions_->LastToBeWrittenSequence(); | ||
assert(l2bws >= last_sequence); | ||
versions_->SetLastToBeWrittenSequence(l2bws + count); | ||
last_sequence = l2bws; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you make sure in the WAL the entries are written in the sequence number order, if the lock is released after here? Should we get the ID in the same mutex before writing to WAL? Or maybe never release the mutex until then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this logic to inside the WriteToWAL under mutex.
db/db_impl_write.cc
Outdated
@@ -180,6 +206,9 @@ Status DBImpl::WriteImpl(const WriteOptions& write_options, | |||
// that nobody else can be writing to these particular stats. | |||
// We're optimistic, updating the stats before we successfully | |||
// commit. That lets us release our leader status early. | |||
if (!batch_writes) { | |||
stat_mutex_.Lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which stat introduces a data risk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AddDBStats is not atomic.
275 void AddDBStats(InternalDBStatsType type, uint64_t value) {
276 auto& v = db_stats_[type];
277 v.store(v.load(std::memory_order_relaxed) + value,
278 std::memory_order_relaxed);
279 }
StatisticsImpl::recordTick seems to be fine (using fetch_add) and PERF_TIMER_* seem to be thread-local and thus ok.
db/db_impl.cc
Outdated
@@ -2520,6 +2543,9 @@ Status DBImpl::IngestExternalFile( | |||
|
|||
// Stop writes to the DB | |||
WriteThread::Writer w; | |||
mutex_.Unlock(); | |||
mem_mutex_.Lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To play safe, I suggest we only do this extra locking logic in a mode that is only enabled in MyRocks. I'm scared of potential deadlock here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah me too ;) In the past two days I have been fixing deadlocks in unit tests. Let me think further to see how we can simplify this.
db/db_impl.h
Outdated
@@ -791,6 +797,8 @@ class DBImpl : public DB { | |||
// NOTE: should never acquire options_file_mutex_ and mutex_ at the | |||
// same time. | |||
InstrumentedMutex options_files_mutex_; | |||
InstrumentedMutex mem_mutex_; | |||
InstrumentedMutex stat_mutex_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the lock holding order of all those mutexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
db/db_impl_write.cc
Outdated
std::unique_ptr<InstrumentedMutexLock> serial_memtable_guarantee; | ||
if (lock_memtable) { | ||
serial_memtable_guarantee = std::unique_ptr<InstrumentedMutexLock>( | ||
new InstrumentedMutexLock(&mem_mutex_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be avoided in non-MyRocks case.
Also, consider to avoid this malloc. We can add a function in InstrumentedMutexLock like set_mutex().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the new approach I am using a separate write group and the mem_mutex_ is no longer needed
Thanks for the feedback @yiwu-arbug. A couple of questions:
PS: My first impl had this split to two workflows. Then I realized that much duplication between the two (stats, etc.) and figured that a single function would make keeping them consistent easier in future. I am open to suggestions though. |
@maysamyabandeh seems I didn't understand the patch very well. Will sync with you offline. |
Thanks @yiwu-arbug for your comment. I enabled batching for preparers as you suggested and the throughput increased to 49k. I guess it is because it reduces competition with the serial commit over shared logs. |
@maysamyabandeh updated the pull request - view changes |
@maysamyabandeh updated the pull request - view changes |
@maysamyabandeh updated the pull request - view changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't deep dive into the logic, just some idea of reorganizing the code:
- Move the new logic in WriteImpl() into WriteImpl2PCPrepare() and WriteImpl2PCCommit() to avoid mixing of normal write path and 2pc write path. Also remove
disable_memtable
from WriteImpl() since it is use by 2pc only. - Have WriteImpl2PCCommit() (or even FlushWAL()) update db stats, to avoid the need of
stat_mutex_
. - Not sure if it is a good idea, but you can reuse
WriteThread::newest_memtable_writer_
to replacenonmem_write_thread_
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this diff makes sense. In MyRocks case, we anyway need a way to allow MyRocks to have find control of WAL writing, and WAL flushing. I believe this is a good change to have anyway.
From coding problem of the view, this is much better than the previous version. I still think it can be made even better if we have a separate lightweight PrepareToWALBuffer() call and build it on top of @yiwu-arbug 's diff.
db/db_impl_write.cc
Outdated
total_log_size_ += log_entry.size(); | ||
alive_log_files_.back().AddSize(log_entry.size()); | ||
if (concurrent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we put line 423 to 434 to a separate helper function, and then directly call it from a function like DBImpl::PrepareToWALBuffer() where the mutex is protected. And we can do the same for line 470-478.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
db/db_impl_write.cc
Outdated
WriteBatchInternal::SetSequence(merged_batch, sequence); | ||
|
||
Slice log_entry = WriteBatchInternal::Contents(merged_batch); | ||
log::Writer* log_writer = logs_.back().writer; | ||
status = log_writer->AddRecord(log_entry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe whether writing to buffer should be passed in per record basis. In this case, in case, someone issues a write without going through this prepare API, it will call AddRecord() with the flush option so it should still work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more? We are currently enabling the WAL buffer with an option, which can be enabled independently from from 2PC.
// If true WAL is not flushed automatically after each write. Instead it | ||
// relies on manual invocation of FlushWAL to write the WAL buffer to its | ||
// file. | ||
bool manual_wal_flush = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use one option?
The only place I think we need an option is whether to grab mutex for normal writes.
If we have a direct PrepareToWALBuffer() function, there isn't a second queue, so the extra queue logic can be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two options is better. One might enable wal buffer without improvements for 2PC commit: https://groups.google.com/d/msg/rocksdb/ngNeSSQHfao/_AV_KIf1CAAJ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more? The second queue was added after Yi's suggestion, which made the code much cleaner and even improved throughput a bit.
If we have a direct PrepareToWALBuffer() function, there isn't a second queue, so the extra queue logic can be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a queue if we don't do group commit and don't care about the order out follows the order in? I also don't understand why it can improve throughput. Can you explain it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In theory the other queue should help: without queue 128 prepare threads competing with commit on wal_mutex_ but with queue only 1 thread does that. In practice I have seen mixed results: in some it improved throughput by 3-4k.
- Without queue other part of the code that wanted to be the sole writer would still need to hold a mutex. Combination of mutex and enter queue unbatched was a bit ugly. Instead now they enter two queues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an existing use case for only WAL buffering: https://groups.google.com/d/msg/rocksdb/ngNeSSQHfao/_AV_KIf1CAAJ
This feature is going to stay even after we discard concurrent wal writes in future so it does need its own config.
To me, WAL buffer is not a performance config, and it is more of a durability one. Normal users should not be bothered with playing around with them. Advanced users could make use of it only after making changes to their application to call FlushWAL at right times (like myrocks does). So I do not think we should include such options in the tuning complexity issue that we have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this user also use the same option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we would also use two write queues for this user, which is irrelevant to WAL buffer feature that they intended to have.
I do think the WAL buffer is something that will stay in the code base for long time. The other feature is temporary and will be removed entirely once we push the commits to an earlier stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK then.
As a separate thing, maybe it's a good time to separate DBOptions to an advanced part and put it to advanced_options.h too, just like what we do for CFOptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Opening a task for it.
bc8f85a
to
77db97e
Compare
@maysamyabandeh updated the pull request - view changes |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@maysamyabandeh updated the pull request - view changes - changes since last import |
6c3582d
to
696484a
Compare
@maysamyabandeh updated the pull request - view changes - changes since last import |
@maysamyabandeh updated the pull request - view changes - changes since last import |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't finish reading the whole patch. Some minor comments.
@@ -607,6 +610,26 @@ int DBImpl::FindMinimumEmptyLevelFitting(ColumnFamilyData* cfd, | |||
return minimum_level; | |||
} | |||
|
|||
Status DBImpl::FlushWAL(bool sync) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to db_impl_write.cc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would not make sense to be in the same file that SyncWAL is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think SyncWAL can go to db_impl_write.cc too. But, yeah, let's leave them here.
db/db_impl.cc
Outdated
s.ToString().c_str()); | ||
} | ||
if (!sync) { | ||
ROCKS_LOG_DEBUG(immutable_db_options_.info_log, "FlushWAL-nosync"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
improve the logging message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
db/db_impl.cc
Outdated
WriteThread::Writer w; | ||
write_thread_.EnterUnbatched(&w, &mutex_); | ||
WriteThread::Writer nonmem_w; | ||
nonmem_write_thread_.EnterUnbatched(&nonmem_w, &mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call only for 2pc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense.
db/version_set.h
Outdated
last_sequence_.store(s, std::memory_order_release); | ||
} | ||
|
||
void SetLastToBeWrittenSequence(uint64_t s) { | ||
assert(s >= last_to_be_written_sequence_); | ||
last_to_be_written_sequence_.store(s, std::memory_order_release); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might sound paranoid but how about we stick to either memory_order_seq_cst
or memory_order_relaxed
to make it easier to reason about correctness.
db/version_set.h
Outdated
last_to_be_written_sequence_.store(s, std::memory_order_release); | ||
} | ||
|
||
uint64_t FetchAddLastToBeWrittenSequence(uint64_t s) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update PipelinedWriteImpl()
to use this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PipelinedWriteImpl has a nice technique for avoiding the to_be_written_seqeunce_number. When we move forward with next steps and move write to an earlier phases we might remove this function entirely. Then rolling back the changes from PipelinedWriteImpl would be difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me the technique is the same. What's the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up added an extra last_to_be_written_sequence_ that has to be set properly all over the code base: initialization, file injection, ... But PipelinedWriteImpl contains the complexity inside itself and does not expose existence of this separate seq number to the reset of code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that these two counters should be consolidated. Both of the code paths have two counters, one next seq for WAL and one seq visible to reads. We don't have a reason to keep two variables for the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siying I guess you mean for long-term. For the purpose of this patch we do need two seq numbers.
db/db_impl_write.cc
Outdated
@@ -62,6 +62,11 @@ Status DBImpl::WriteImpl(const WriteOptions& write_options, | |||
WriteBatch* my_batch, WriteCallback* callback, | |||
uint64_t* log_used, uint64_t log_ref, | |||
bool disable_memtable) { | |||
// The current implementation does not support sync with concurrenet writes | |||
assert(!concurrent_writes_ || !write_options.sync); | |||
if (concurrent_writes_ && write_options.sync) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name concurrent_writes_ confused me. It feels to me that if it is false, writes cannot be executed concurrently. Can we rename to something more specific, like buffer_wal_writes_in_prepare_ or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. I am trying to avoid 2pc specific names here. I will change it to concurrent_wal_writes_.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concurrent_wal_writes_
still feels confusing to me, as we don't actually make all WAL writes concurrent, but only prepare writes or prepare writes with one commit write. We aren't making commit WAL writes concurrent. Maybe it can be something like concurrent_prepare_wal_writes_, or concurrent_prepare_.
db/db_impl_write.cc
Outdated
@@ -62,6 +62,11 @@ Status DBImpl::WriteImpl(const WriteOptions& write_options, | |||
WriteBatch* my_batch, WriteCallback* callback, | |||
uint64_t* log_used, uint64_t log_ref, | |||
bool disable_memtable) { | |||
// The current implementation does not support sync with concurrenet writes | |||
assert(!concurrent_writes_ || !write_options.sync); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't tend to assert user requests, unless they are used in radically wrong way, like calling key() with !Valid(). The risk is that we crash users' whole service with a per request misconfig if they run in debug mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
db/db_impl_write.cc
Outdated
write_thread_.EnterAsBatchGroupLeader(&w, &write_group); | ||
last_batch_group_size_.fetch_add(batch_group_size, | ||
std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last_batch_group_size_ is now accessed concurrently. I therefore change it to std::atomic. It is reset when it is read by rate limiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 2pc-optmal case, you don't go through throttling logic at all (if I understand correctly), then why keep this last_batch_group_size_
updated at all? For non-2pc-optimal case, there is a regression of performance replacing a normal set to an atomic update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at it more. Since the throttling is done only in commit(), we should only update last_batch_group_size_
in commit too. Maybe the concurrent code count the same write twice when calculating the write rate, one in prepare and one in commit, and this is wrong. We should only calculate it once. If we only call it in commit(), then we don't need this atomic update. A simply set would be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the last commit writes 1 bytes while the last prepares have written 1GB to WAL, should not we throttle the writes to WAL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. The throttle is to protect memtable and compaction. Writing to WAL is OK.
@@ -584,6 +716,38 @@ Status DBImpl::WriteToWAL(const WriteThread::WriteGroup& write_group, | |||
return status; | |||
} | |||
|
|||
Status DBImpl::ConcurrentWriteToWAL(const WriteThread::WriteGroup& write_group, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name confuses me. It's not really concurrent, but not flush, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually concurrent. With new changes we could have two threads trying to write to the WAL at the same time: the thread aggregating prepares and the thread writing commit.
Not flushing WAL buffer is a totally separate feature. It happen to be in the same patch since these two features together give performance boost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it. WAL is one single file. How you can write to it concurrently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It meant to say that this function can be called concurrently. Inside, the actual write to WAL is serialized. I chose this name since previously WriteToWAL assumed to be called only by one thread at a time. ConcurrentWriteToWAL is supposed to suggest that this function can now be called concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
db/db_impl_write.cc
Outdated
assert(log_size != nullptr); | ||
Slice log_entry = WriteBatchInternal::Contents(&merged_batch); | ||
*log_size = log_entry.size(); | ||
log::Writer* log_writer = logs_.back().writer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the normal path, is it inside DB mutex? I moved it from here to inside the DB mutex to avoid a data race bug: 5dad9d6
Are we getting the bug back here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you tell me more about the bug? If you assume that logs_.back() could change concurrently, how can you safely write to that last pointer you got of logs_.back()? Then you could write to a stale WAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I'm not assuming logs_.back() can change. Still we got segfault in stress tests. I am explaining in another comment.
db/db_impl_write.cc
Outdated
if (!logs_.empty()) { | ||
// Alway flush the buffer of the last log before switching to a new one | ||
log::Writer* cur_log_writer = logs_.back().writer; | ||
cur_log_writer->WriteBuffer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why it isn't inside log_write_mutex_.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From sync policy on logs_:
908 // - back() and items with getting_synced=true are not popped,
909 // - it follows that write thread with unlocked mutex_ can safely access
910 // back() and items with getting_synced=true.
...
913 std::deque<LogWriterNumber> logs_;
@@ -645,6 +668,7 @@ Status DBImpl::SyncWAL() { | |||
need_log_dir_sync = !log_dir_synced_; | |||
} | |||
|
|||
TEST_SYNC_POINT("DBWALTest::SyncWALNotWaitWrite:1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe every access to logs_ should grab log_write_mutex_. In SyncWAL(), it is not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The access policy to logs_ is bit more fine tuned:
904 // Log files that aren't fully synced, and the current log file.
905 // Synchronization:
906 // - push_back() is done from write thread with locked mutex_,
907 // - pop_front() is done from any thread with locked mutex_,
908 // - back() and items with getting_synced=true are not popped,
909 // - it follows that write thread with unlocked mutex_ can safely access
910 // back() and items with getting_synced=true.
911 // - When concurrent write threads is enabled, back() and push_back() must be
912 // called within log_write_mutex_
913 std::deque<LogWriterNumber> logs_;
We would potentially lose performance if make it stricter than it already is in the code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maysamyabandeh this may be old but I realized later that back() needs to be protected by the same mutex as push_back() and pop_front() too. It actually caused segfault in stress tests. I think the problem may be that deque may get copy-on-write updated, so that back() is not safely called while push_back() or pop_front() is executed even if they don't touch the element we are accessing with back(). I would continue playing safe that everything should be protected by the same mutex. In non-2pc-optimal case, it is DB mutex, and in 2pc-optimal case, it can be log_write_mutex_.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will do.
Regarding the current sync guarantees, I believe the idea was that push_back() could not be called while back() is accessed (otherwise back() would be stale). pop_front() could be called but according to our documentations it should be safe to call pop_front() and back() concurrently on deque. If it is not, I will update our documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internet didn't say any thread safety guarantee them, so I assume it is safer for any operation to be locked, unless two threads are only doing reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I could not find anything either. I updated the sync documentation of logs_ to clarify that.
@maysamyabandeh updated the pull request - view changes - changes since last import |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@maysamyabandeh updated the pull request - view changes - changes since last import |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@yiwu-arbug @siying any more comments? |
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Test Plan: Sysbench testing is ongoing Reviewers: alexyang, yoshinori Reviewed By: yoshinori Subscribers: webscalesql-eng@fb.com, mcallaghan, sdong, myabandeh Differential Revision: https://phabricator.intern.facebook.com/D5503719 Tasks: 19690529 Signature: t1:5503719:1502127815:a1831a7c7ef1f9966952e4f78bf9fde7d3a53761
@maysamyabandeh - the test |
@ajkr thanks. let me take a look at it. |
Upstream commit ID : fb-mysql-5.6.35/8b9734948e0023f6adeadb536375f22016d8e521 Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Reviewed By: yoshinorim Differential Revision: D5503719 fbshipit-source-id: c29f0a2
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Reviewed By: yoshinorim Differential Revision: D5503719 fbshipit-source-id: 38ed897
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Reviewed By: yoshinorim Differential Revision: D5503719 fbshipit-source-id: 38ed897
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Reviewed By: yoshinorim Differential Revision: D5503719 fbshipit-source-id: 38ed897
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Reviewed By: yoshinorim Differential Revision: D5503719 fbshipit-source-id: 38ed897
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Reviewed By: yoshinorim Differential Revision: D5503719 fbshipit-source-id: 38ed897
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Reviewed By: yoshinorim Differential Revision: D5503719 fbshipit-source-id: 38ed897
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719 (8b97349) fbshipit-source-id: 38fb348abc3
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719 (facebook@8b97349) fbshipit-source-id: 38fb348abc3
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719 (facebook@8b97349) fbshipit-source-id: 38fb348abc3
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719 (facebook@8b97349) fbshipit-source-id: 38fb348abc3
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719 (facebook@8b97349) fbshipit-source-id: 38fb348abc3
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719 (facebook@8b97349) fbshipit-source-id: 38fb348abc3
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719 (facebook@8b97349) fbshipit-source-id: 38fb348abc3
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719 (facebook@8b97349) fbshipit-source-id: 38fb348abc3
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719
Summary: RocksDB has recently added a FlushWAL API which will improve upon the performance of MySQL 2PC (more details here facebook/rocksdb#2345). This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour. Finally, it updates the submodule to include the removal of an unneeded assertion in the write path, which was tripped by this change. Differential Revision: D5503719
Throughput: 46k tps in our sysbench settings (filling the details later)
The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.
Major design changes:
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.
Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
consistent with all unit tests. Will make this optional via a config.
serial commit of 2PC taken into account.
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.