Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Use batches for direct RocksDB mutations #55708

Merged
merged 1 commit into from
Oct 20, 2020

Conversation

itsbilal
Copy link
Member

Currently, doing direct mutations on a RocksDB instance bypasses
custom batching / syncing logic that we've built on top of it.
This, or something internal to RocksDB, started leading to some bugs
when all direct mutations started passing in WriteOptions.sync = true
(see #55240 for when this change went in).

In this change, direct mutations still commit the batch with sync=true
to guarantee WAL syncing, but they go through the batch commit pipeline
too, just like the vast majority of operations already do.

Fixes #55362.

Release note: None.

@itsbilal itsbilal self-assigned this Oct 19, 2020
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presuming this fixes the bug, let's disable the DBImpl::{Put,Merge,Delete,SingleDelete,DeleteRange} code paths, or at least reverting the use of sync = true there.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, @petermattis, and @tbg)


pkg/storage/rocksdb.go, line 516 at r1 (raw file):

		return err
	}
	return b.Commit(true)

Let's annotate these trues with Commit(true /* sync */).

@itsbilal itsbilal force-pushed the rocksdb-use-batch-ops branch 2 times, most recently from 935c1cd to e0a0e7a Compare October 19, 2020 21:25
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Removed the use of sync = true in those code paths. Can't remove them entirely as some tests depend on them

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jbowens, @petermattis, and @tbg)


pkg/storage/rocksdb.go, line 516 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Let's annotate these trues with Commit(true /* sync */).

Done.

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: bugfix, I've run engine/switch/nodes=3 ~50 times with no repro, and engine/switch/encrypted ~20 times. Given the latter was failing approx. 50% of the time before this change, I'm pretty confident this fixes it.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jbowens, @petermattis, and @tbg)

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Were you able to reproduce the engine/switch/nodes=3 failure without this PR?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, and @tbg)


pkg/storage/rocksdb.go, line 561 at r2 (raw file):

// It is safe to modify the contents of the arguments after ApplyBatchRepr
// returns.
func (r *RocksDB) ApplyBatchRepr(repr []byte, sync bool) error {

Can we get rid of the sync argument here? I don't think it is every used. Or perhaps we should just ignore this as we're remove RocksDB shortly anyways.

@tbg
Copy link
Member

tbg commented Oct 20, 2020

LGTM

Currently, doing direct mutations on a RocksDB instance bypasses
custom batching / syncing logic that we've built on top of it.
This, or something internal to RocksDB, started leading to some bugs
when all direct mutations started passing in WriteOptions.sync = true
(see cockroachdb#55240 for when this change went in).

In this change, direct mutations still commit the batch with sync=true
to guarantee WAL syncing, but they go through the batch commit pipeline
too, just like the vast majority of operations already do.

Fixes cockroachdb#55362.

Release note: None.
@itsbilal
Copy link
Member Author

Yes, I was able to repro it after ~50 runs. It's a lot less frequent than the engine/switch/encrypted reproduction, which was almost 1 in 2. It does seem like this fix fixes both.

@itsbilal
Copy link
Member Author

TFTRs!

bors r+

@petermattis
Copy link
Collaborator

We'll want to backport this PR to 20.2, 20.1, and 19.2.

@craig
Copy link
Contributor

craig bot commented Oct 20, 2020

Build succeeded:

@craig craig bot merged commit d58b0dc into cockroachdb:master Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: engine/switch/encrypted/nodes=3 failed
4 participants