Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize for serial commits in 2PC #2345

Closed

Conversation

maysamyabandeh
Copy link
Contributor

Throughput: 46k tps in our sysbench settings (filling the details later)

The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.

Major design changes:

  1. The WAL file internal buffer is not flushed after each write. Instead
    it is flushed before critical operations (WAL copy via fs) or when
    FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
    via mutex_.
  2. Use two sequence numbers: last seq, and last seq for write. Last seq
    is the last visible sequence number for reads. Last seq for write is the
    next sequence number that should be used to write to WAL/memtable. This
    allows to have a memtable write be in parallel to WAL writes.
  3. BatchGroup is not used for writes. This means that we can have
    parallel writers which changes a major assumption in the code base. To
    accommodate for that i) allow only 1 WriteImpl that intends to write to
    memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
    come via group commit phase which is serial anyway, ii) make all the
    parts in the code base that assumed to be the only writer (via
    EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
    protected via a stat_mutex_.

Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes

  1. Currently batching is disabled by default to make sure that it will be
    consistent with all unit tests. Will make this optional via a config.
  2. A couple of unit tests are disabled. They need to be updated with the
    serial commit of 2PC taken into account.
  3. Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
    releasing mutex_ beforehand (the same way EnterUnbatched does). This
    needs to be cleaned up.

@yiwu-arbug
Copy link
Contributor

Seems to me now all synchronization is mutex-based. Will we get more improvement if we do the following?

  • Prepare():
    • Use existing WriteImpl approach
  • Commit():
    • lock mem_mutex_
    • not responsible for trigger flush (i.e. Do not call PreprocessWrite())
  • FlushWAL():
    • lock mem_mutex_ (to access commit marker buffer)
    • Use existing WriteImpl approach

And I personally prefer to have a separate WriteImpl method for the mutex-based write, to make the implementation clean. Not sure what do @siying think.

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I think this is a reasonable direction to go. My main comments are:
(1) have a clear separation of the new logic for MyRocks and the current behavior to other RocksDB users.
(2) needs to think about the design of locking so that it is easier to understand and can perform better.

Also, just think about what can make the code easier to maintain going forward: patching the code everywhere, or have a new function and reuse code by helper function. I'm worried that mixed the two locking approaches everywhere in the code may make it harder to maintain in the future.

@@ -395,9 +441,20 @@ Status DBImpl::WriteToWAL(const autovector<WriteThread::Writer*>& write_group,
WriteBatchInternal::SetSequence(merged_batch, sequence);

Slice log_entry = WriteBatchInternal::Contents(merged_batch);
if (concurrent) {
// We need to lock mutex_ since logs_ etc. might change concurrently
mutex_.Lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holding this lock in LogWriter::AddRecord() makes me nervous. This heavy duty lock can block reads and compactions. Even if not doing I/O, LogWriter::AddRecord() is still a relatively expensive operation. It calculates CRC for everything and makes several copies. Is there a way to define another mutex just to protect those WAL related data structures? You already defined two more mutexes. A third one doesn't feel like a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. It would require more work since log_ (and its family) are assumed to be protected by mutex_ all over the code. But it is doable.

// Flush the WAL memory buffer to the file. If sync is true, it calls SyncWAL
// afterwards.
virtual Status FlushWAL(bool sync) {
throw std::runtime_error("FlushWAL not implemented");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should return Status::NotSupported() instead of run time error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

serial_memtable_guarantee = std::unique_ptr<InstrumentedMutexLock>(
new InstrumentedMutexLock(&mem_mutex_));
}
const bool batch_writes = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be the interface eventually? Add a more parameter to WriteImpl()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am thinking something like options.disable_group_batch and options.enable_wal_buffer

db/version_set.h Outdated
last_sequence_.store(s, std::memory_order_release);
}

void SetLastToBeWrittenSequence(uint64_t s) {
assert(s >= last_to_be_written_sequence_);
last_to_be_written_sequence_.store(s, std::memory_order_release);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are only accessed within DB mutex. So I assume relaxed order is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Let me double check and confirm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetLastToBeWrittenSequence is used only during initialization so its stronger ordering guarantee does not make much difference in performance. The newly added FetchAddLastToBeWrittenSequence could be accessed outside mutex and its ordering guarantees cannot be relaxed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might sound paranoid but how about we stick to either memory_order_seq_cst or memory_order_relaxed to make it easier to reason about correctness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this function I can increase it to memory_order_seq_cst but for FetchAddLastToBeWrittenSequence it might affect the performance. I can run a test and will increase it if the performance was not affected.

auto l2bws = versions_->LastToBeWrittenSequence();
assert(l2bws >= last_sequence);
versions_->SetLastToBeWrittenSequence(l2bws + count);
last_sequence = l2bws;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you make sure in the WAL the entries are written in the sequence number order, if the lock is released after here? Should we get the ID in the same mutex before writing to WAL? Or maybe never release the mutex until then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this logic to inside the WriteToWAL under mutex.

@@ -180,6 +206,9 @@ Status DBImpl::WriteImpl(const WriteOptions& write_options,
// that nobody else can be writing to these particular stats.
// We're optimistic, updating the stats before we successfully
// commit. That lets us release our leader status early.
if (!batch_writes) {
stat_mutex_.Lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which stat introduces a data risk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddDBStats is not atomic.

275   void AddDBStats(InternalDBStatsType type, uint64_t value) {
276     auto& v = db_stats_[type];
277     v.store(v.load(std::memory_order_relaxed) + value,
278             std::memory_order_relaxed);
279   }

StatisticsImpl::recordTick seems to be fine (using fetch_add) and PERF_TIMER_* seem to be thread-local and thus ok.

db/db_impl.cc Outdated
@@ -2520,6 +2543,9 @@ Status DBImpl::IngestExternalFile(

// Stop writes to the DB
WriteThread::Writer w;
mutex_.Unlock();
mem_mutex_.Lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To play safe, I suggest we only do this extra locking logic in a mode that is only enabled in MyRocks. I'm scared of potential deadlock here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah me too ;) In the past two days I have been fixing deadlocks in unit tests. Let me think further to see how we can simplify this.

db/db_impl.h Outdated
@@ -791,6 +797,8 @@ class DBImpl : public DB {
// NOTE: should never acquire options_file_mutex_ and mutex_ at the
// same time.
InstrumentedMutex options_files_mutex_;
InstrumentedMutex mem_mutex_;
InstrumentedMutex stat_mutex_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the lock holding order of all those mutexes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::unique_ptr<InstrumentedMutexLock> serial_memtable_guarantee;
if (lock_memtable) {
serial_memtable_guarantee = std::unique_ptr<InstrumentedMutexLock>(
new InstrumentedMutexLock(&mem_mutex_));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be avoided in non-MyRocks case.
Also, consider to avoid this malloc. We can add a function in InstrumentedMutexLock like set_mutex().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new approach I am using a separate write group and the mem_mutex_ is no longer needed

@maysamyabandeh
Copy link
Contributor Author

maysamyabandeh commented May 22, 2017

Thanks for the feedback @yiwu-arbug. A couple of questions:

  1. Which overhead is expected to be reduced with the proposed approach?
  2. Commit also has WAL writes. How does the proposed approach ensure serial access to the WAL having commits not joining the group commit?
  3. How does the background process sync with memtable access in commits? Currently the background process acquires mem_mutex_ (in addition of EnterUnbatched. Do you suggest we keep that?
  4. I like the idea of removing PreprocessWrite from commit phase. My experiments however did not suggest that it would improve the throughput. I guess it is because the check is fast and the actual work of switching memtable happens very rarely.
  5. Currently FlushWAL does not acquire mem_mutex_. Is that incorrect? Can you clarify why flushing WAL buffer would require mem_mutex_.

PS: My first impl had this split to two workflows. Then I realized that much duplication between the two (stats, etc.) and figured that a single function would make keeping them consistent easier in future. I am open to suggestions though.

@yiwu-arbug
Copy link
Contributor

@maysamyabandeh seems I didn't understand the patch very well. Will sync with you offline.

@maysamyabandeh
Copy link
Contributor Author

Thanks @yiwu-arbug for your comment. I enabled batching for preparers as you suggested and the throughput increased to 49k. I guess it is because it reduces competition with the serial commit over shared logs.

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes

Copy link
Contributor

@yiwu-arbug yiwu-arbug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't deep dive into the logic, just some idea of reorganizing the code:

  1. Move the new logic in WriteImpl() into WriteImpl2PCPrepare() and WriteImpl2PCCommit() to avoid mixing of normal write path and 2pc write path. Also remove disable_memtable from WriteImpl() since it is use by 2pc only.
  2. Have WriteImpl2PCCommit() (or even FlushWAL()) update db stats, to avoid the need of stat_mutex_.
  3. Not sure if it is a good idea, but you can reuse WriteThread::newest_memtable_writer_ to replace nonmem_write_thread_.

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this diff makes sense. In MyRocks case, we anyway need a way to allow MyRocks to have find control of WAL writing, and WAL flushing. I believe this is a good change to have anyway.

From coding problem of the view, this is much better than the previous version. I still think it can be made even better if we have a separate lightweight PrepareToWALBuffer() call and build it on top of @yiwu-arbug 's diff.

total_log_size_ += log_entry.size();
alive_log_files_.back().AddSize(log_entry.size());
if (concurrent) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we put line 423 to 434 to a separate helper function, and then directly call it from a function like DBImpl::PrepareToWALBuffer() where the mutex is protected. And we can do the same for line 470-478.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

WriteBatchInternal::SetSequence(merged_batch, sequence);

Slice log_entry = WriteBatchInternal::Contents(merged_batch);
log::Writer* log_writer = logs_.back().writer;
status = log_writer->AddRecord(log_entry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe whether writing to buffer should be passed in per record basis. In this case, in case, someone issues a write without going through this prepare API, it will call AddRecord() with the flush option so it should still work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more? We are currently enabling the WAL buffer with an option, which can be enabled independently from from 2PC.

// If true WAL is not flushed automatically after each write. Instead it
// relies on manual invocation of FlushWAL to write the WAL buffer to its
// file.
bool manual_wal_flush = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use one option?
The only place I think we need an option is whether to grab mutex for normal writes.

If we have a direct PrepareToWALBuffer() function, there isn't a second queue, so the extra queue logic can be avoided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two options is better. One might enable wal buffer without improvements for 2PC commit: https://groups.google.com/d/msg/rocksdb/ngNeSSQHfao/_AV_KIf1CAAJ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more? The second queue was added after Yi's suggestion, which made the code much cleaner and even improved throughput a bit.

If we have a direct PrepareToWALBuffer() function, there isn't a second queue, so the extra queue logic can be avoided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a queue if we don't do group commit and don't care about the order out follows the order in? I also don't understand why it can improve throughput. Can you explain it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In theory the other queue should help: without queue 128 prepare threads competing with commit on wal_mutex_ but with queue only 1 thread does that. In practice I have seen mixed results: in some it improved throughput by 3-4k.
  2. Without queue other part of the code that wanted to be the sole writer would still need to hold a mutex. Combination of mutex and enter queue unbatched was a bit ugly. Instead now they enter two queues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an existing use case for only WAL buffering: https://groups.google.com/d/msg/rocksdb/ngNeSSQHfao/_AV_KIf1CAAJ
This feature is going to stay even after we discard concurrent wal writes in future so it does need its own config.

To me, WAL buffer is not a performance config, and it is more of a durability one. Normal users should not be bothered with playing around with them. Advanced users could make use of it only after making changes to their application to call FlushWAL at right times (like myrocks does). So I do not think we should include such options in the tuning complexity issue that we have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this user also use the same option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we would also use two write queues for this user, which is irrelevant to WAL buffer feature that they intended to have.
I do think the WAL buffer is something that will stay in the code base for long time. The other feature is temporary and will be removed entirely once we push the commits to an earlier stage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK then.

As a separate thing, maybe it's a good time to separate DBOptions to an advanced part and put it to advanced_options.h too, just like what we do for CFOptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Opening a task for it.

@maysamyabandeh maysamyabandeh force-pushed the buffered-commit-4github branch from bc8f85a to 77db97e Compare June 8, 2017 18:33
@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@maysamyabandeh maysamyabandeh force-pushed the buffered-commit-4github branch from 6c3582d to 696484a Compare June 11, 2017 20:59
@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@yiwu-arbug yiwu-arbug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't finish reading the whole patch. Some minor comments.

@@ -607,6 +610,26 @@ int DBImpl::FindMinimumEmptyLevelFitting(ColumnFamilyData* cfd,
return minimum_level;
}

Status DBImpl::FlushWAL(bool sync) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to db_impl_write.cc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not make sense to be in the same file that SyncWAL is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SyncWAL can go to db_impl_write.cc too. But, yeah, let's leave them here.

db/db_impl.cc Outdated
s.ToString().c_str());
}
if (!sync) {
ROCKS_LOG_DEBUG(immutable_db_options_.info_log, "FlushWAL-nosync");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improve the logging message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

db/db_impl.cc Outdated
WriteThread::Writer w;
write_thread_.EnterUnbatched(&w, &mutex_);
WriteThread::Writer nonmem_w;
nonmem_write_thread_.EnterUnbatched(&nonmem_w, &mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call only for 2pc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense.

db/version_set.h Outdated
last_sequence_.store(s, std::memory_order_release);
}

void SetLastToBeWrittenSequence(uint64_t s) {
assert(s >= last_to_be_written_sequence_);
last_to_be_written_sequence_.store(s, std::memory_order_release);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might sound paranoid but how about we stick to either memory_order_seq_cst or memory_order_relaxed to make it easier to reason about correctness.

db/version_set.h Outdated
last_to_be_written_sequence_.store(s, std::memory_order_release);
}

uint64_t FetchAddLastToBeWrittenSequence(uint64_t s) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update PipelinedWriteImpl() to use this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PipelinedWriteImpl has a nice technique for avoiding the to_be_written_seqeunce_number. When we move forward with next steps and move write to an earlier phases we might remove this function entirely. Then rolling back the changes from PipelinedWriteImpl would be difficult.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me the technique is the same. What's the difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up added an extra last_to_be_written_sequence_ that has to be set properly all over the code base: initialization, file injection, ... But PipelinedWriteImpl contains the complexity inside itself and does not expose existence of this separate seq number to the reset of code base.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that these two counters should be consolidated. Both of the code paths have two counters, one next seq for WAL and one seq visible to reads. We don't have a reason to keep two variables for the same thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siying I guess you mean for long-term. For the purpose of this patch we do need two seq numbers.

@@ -62,6 +62,11 @@ Status DBImpl::WriteImpl(const WriteOptions& write_options,
WriteBatch* my_batch, WriteCallback* callback,
uint64_t* log_used, uint64_t log_ref,
bool disable_memtable) {
// The current implementation does not support sync with concurrenet writes
assert(!concurrent_writes_ || !write_options.sync);
if (concurrent_writes_ && write_options.sync) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name concurrent_writes_ confused me. It feels to me that if it is false, writes cannot be executed concurrently. Can we rename to something more specific, like buffer_wal_writes_in_prepare_ or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. I am trying to avoid 2pc specific names here. I will change it to concurrent_wal_writes_.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concurrent_wal_writes_ still feels confusing to me, as we don't actually make all WAL writes concurrent, but only prepare writes or prepare writes with one commit write. We aren't making commit WAL writes concurrent. Maybe it can be something like concurrent_prepare_wal_writes_, or concurrent_prepare_.

@@ -62,6 +62,11 @@ Status DBImpl::WriteImpl(const WriteOptions& write_options,
WriteBatch* my_batch, WriteCallback* callback,
uint64_t* log_used, uint64_t log_ref,
bool disable_memtable) {
// The current implementation does not support sync with concurrenet writes
assert(!concurrent_writes_ || !write_options.sync);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't tend to assert user requests, unless they are used in radically wrong way, like calling key() with !Valid(). The risk is that we crash users' whole service with a per request misconfig if they run in debug mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

write_thread_.EnterAsBatchGroupLeader(&w, &write_group);
last_batch_group_size_.fetch_add(batch_group_size,
std::memory_order_relaxed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_batch_group_size_ is now accessed concurrently. I therefore change it to std::atomic. It is reset when it is read by rate limiter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 2pc-optmal case, you don't go through throttling logic at all (if I understand correctly), then why keep this last_batch_group_size_ updated at all? For non-2pc-optimal case, there is a regression of performance replacing a normal set to an atomic update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at it more. Since the throttling is done only in commit(), we should only update last_batch_group_size_ in commit too. Maybe the concurrent code count the same write twice when calculating the write rate, one in prepare and one in commit, and this is wrong. We should only calculate it once. If we only call it in commit(), then we don't need this atomic update. A simply set would be good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the last commit writes 1 bytes while the last prepares have written 1GB to WAL, should not we throttle the writes to WAL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. The throttle is to protect memtable and compaction. Writing to WAL is OK.

@@ -584,6 +716,38 @@ Status DBImpl::WriteToWAL(const WriteThread::WriteGroup& write_group,
return status;
}

Status DBImpl::ConcurrentWriteToWAL(const WriteThread::WriteGroup& write_group,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name confuses me. It's not really concurrent, but not flush, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually concurrent. With new changes we could have two threads trying to write to the WAL at the same time: the thread aggregating prepares and the thread writing commit.
Not flushing WAL buffer is a totally separate feature. It happen to be in the same patch since these two features together give performance boost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. WAL is one single file. How you can write to it concurrently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It meant to say that this function can be called concurrently. Inside, the actual write to WAL is serialized. I chose this name since previously WriteToWAL assumed to be called only by one thread at a time. ConcurrentWriteToWAL is supposed to suggest that this function can now be called concurrently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

assert(log_size != nullptr);
Slice log_entry = WriteBatchInternal::Contents(&merged_batch);
*log_size = log_entry.size();
log::Writer* log_writer = logs_.back().writer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the normal path, is it inside DB mutex? I moved it from here to inside the DB mutex to avoid a data race bug: 5dad9d6

Are we getting the bug back here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me more about the bug? If you assume that logs_.back() could change concurrently, how can you safely write to that last pointer you got of logs_.back()? Then you could write to a stale WAL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I'm not assuming logs_.back() can change. Still we got segfault in stress tests. I am explaining in another comment.

if (!logs_.empty()) {
// Alway flush the buffer of the last log before switching to a new one
log::Writer* cur_log_writer = logs_.back().writer;
cur_log_writer->WriteBuffer();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why it isn't inside log_write_mutex_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From sync policy on logs_:

908   //  - back() and items with getting_synced=true are not popped,
909   //  - it follows that write thread with unlocked mutex_ can safely access
910   //    back() and items with getting_synced=true.
...
913   std::deque<LogWriterNumber> logs_;

@@ -645,6 +668,7 @@ Status DBImpl::SyncWAL() {
need_log_dir_sync = !log_dir_synced_;
}

TEST_SYNC_POINT("DBWALTest::SyncWALNotWaitWrite:1");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe every access to logs_ should grab log_write_mutex_. In SyncWAL(), it is not the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The access policy to logs_ is bit more fine tuned:

 904   // Log files that aren't fully synced, and the current log file.
 905   // Synchronization:
 906   //  - push_back() is done from write thread with locked mutex_,
 907   //  - pop_front() is done from any thread with locked mutex_,
 908   //  - back() and items with getting_synced=true are not popped,
 909   //  - it follows that write thread with unlocked mutex_ can safely access
 910   //    back() and items with getting_synced=true.
 911   //  - When concurrent write threads is enabled, back() and push_back() must be
 912   //  called within log_write_mutex_
 913   std::deque<LogWriterNumber> logs_;

We would potentially lose performance if make it stricter than it already is in the code base.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maysamyabandeh this may be old but I realized later that back() needs to be protected by the same mutex as push_back() and pop_front() too. It actually caused segfault in stress tests. I think the problem may be that deque may get copy-on-write updated, so that back() is not safely called while push_back() or pop_front() is executed even if they don't touch the element we are accessing with back(). I would continue playing safe that everything should be protected by the same mutex. In non-2pc-optimal case, it is DB mutex, and in 2pc-optimal case, it can be log_write_mutex_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will do.
Regarding the current sync guarantees, I believe the idea was that push_back() could not be called while back() is accessed (otherwise back() would be stale). pop_front() could be called but according to our documentations it should be safe to call pop_front() and back() concurrently on deque. If it is not, I will update our documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internet didn't say any thread safety guarantee them, so I assume it is safer for any operation to be locked, unless two threads are only doing reads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I could not find anything either. I updated the sync documentation of logs_ to clarify that.

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@maysamyabandeh
Copy link
Contributor Author

@yiwu-arbug @siying any more comments?

samcmuldroch pushed a commit to samcmuldroch/mysql-5.6 that referenced this pull request Sep 6, 2017
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Test Plan: Sysbench testing is ongoing

Reviewers: alexyang, yoshinori

Reviewed By: yoshinori

Subscribers: webscalesql-eng@fb.com, mcallaghan, sdong, myabandeh

Differential Revision: https://phabricator.intern.facebook.com/D5503719

Tasks: 19690529

Signature: t1:5503719:1502127815:a1831a7c7ef1f9966952e4f78bf9fde7d3a53761
@ajkr
Copy link
Contributor

ajkr commented Sep 12, 2017

@maysamyabandeh - the test WriteCallbackTest.WriteWithCallbackTest isn't working with concurrent_prepare = true. The problem was exposed by #2864, where adding initialization in DBOptions(const Options&) caused this test to fail.

@maysamyabandeh
Copy link
Contributor Author

@ajkr thanks. let me take a look at it.

@maysamyabandeh
Copy link
Contributor Author

@ajkr fix is here: #2875

george-lorch pushed a commit to george-lorch/percona-server that referenced this pull request Oct 4, 2017
Upstream commit ID : fb-mysql-5.6.35/8b9734948e0023f6adeadb536375f22016d8e521

Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Reviewed By: yoshinorim

Differential Revision: D5503719

fbshipit-source-id: c29f0a2
facebook-github-bot pushed a commit to facebook/mysql-5.6 that referenced this pull request Dec 23, 2019
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Reviewed By: yoshinorim

Differential Revision: D5503719

fbshipit-source-id: 38ed897
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Aug 12, 2020
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Reviewed By: yoshinorim

Differential Revision: D5503719

fbshipit-source-id: 38ed897
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Sep 9, 2020
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Reviewed By: yoshinorim

Differential Revision: D5503719

fbshipit-source-id: 38ed897
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Sep 16, 2020
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Reviewed By: yoshinorim

Differential Revision: D5503719

fbshipit-source-id: 38ed897
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Oct 5, 2020
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Reviewed By: yoshinorim

Differential Revision: D5503719

fbshipit-source-id: 38ed897
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Nov 11, 2020
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Reviewed By: yoshinorim

Differential Revision: D5503719

fbshipit-source-id: 38ed897
facebook-github-bot pushed a commit to facebook/mysql-5.6 that referenced this pull request Mar 8, 2021
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719 (8b97349)

fbshipit-source-id: 38fb348abc3
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Aug 16, 2021
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719 (facebook@8b97349)

fbshipit-source-id: 38fb348abc3
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Aug 17, 2021
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719 (facebook@8b97349)

fbshipit-source-id: 38fb348abc3
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Aug 30, 2021
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719 (facebook@8b97349)

fbshipit-source-id: 38fb348abc3
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Aug 30, 2021
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719 (facebook@8b97349)

fbshipit-source-id: 38fb348abc3
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Aug 31, 2021
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719 (facebook@8b97349)

fbshipit-source-id: 38fb348abc3
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Sep 1, 2021
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719 (facebook@8b97349)

fbshipit-source-id: 38fb348abc3
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Sep 2, 2021
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719 (facebook@8b97349)

fbshipit-source-id: 38fb348abc3
hermanlee pushed a commit to facebook/mysql-5.6 that referenced this pull request Jan 10, 2022
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Jan 17, 2022
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719
ldonoso pushed a commit to ldonoso/percona-server that referenced this pull request Mar 15, 2022
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request May 20, 2022
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request May 23, 2022
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719
laurynas-biveinis pushed a commit to laurynas-biveinis/mysql-5.6 that referenced this pull request Aug 11, 2022
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Mar 28, 2023
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Jun 1, 2023
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719
inikep pushed a commit to inikep/mysql-5.6 that referenced this pull request Jun 14, 2023
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719
inikep pushed a commit to inikep/percona-server that referenced this pull request Apr 9, 2024
Summary:
RocksDB has recently added a FlushWAL API which will improve upon the
performance of MySQL 2PC (more details here facebook/rocksdb#2345).
This patch adds support for using the FlushWAL API in MyRocks and also matches flush_log_at_trx_commit with innodb_flush_log_at_trx_commit behaviour.

Finally, it updates the submodule to include the removal of an unneeded
assertion in the write path, which was tripped by this change.

Differential Revision: D5503719
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants