-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libroach: enable rocksdb WAL recycling #35591
Conversation
Some kv0 results (summary: 6% higher throughput): Commands:
Results before this PR:
Results after:
|
The kv95 results show decent improvement too. Maybe the frequent metadata/journal writes were slowing down reads. Commands:
Results before:
Results after:
|
That's a nice little perf boost! I recall |
It looks like the last known bug related to this feature was 2.5 years ago (facebook/rocksdb@1b7af5f), so I believe it is pretty stable. Looks like Ceph Bluestore is using it. There appears to be an infinitesimally small chance of a wrong record to be replayed during recovery -- a user key or value written to an old WAL could contain bytes that form a valid entry for the recycled WAL, and those bytes would have to immediately follow the final entry written to the recycled WAL. One thing we should do to be safer is enable this feature in db_crashtest.py so FB's CI runs crash-recovery correctness checks with WAL recycling enabled. |
I imagine this is because we're not zeroing the log file when it is being reused. It strikes me, though, that there shouldn't be any danger due to reuse because we can terminate recovery when we hit the old portion of the log. The entries in the log need to occur in sequence number order (and I believe every batch contains a sequence number), so it seems straightforward to stop log recovery if we encounter a batch with a lower sequence number than the previous batch. Is this not currently done? Am I missing anything about why this wouldn't work? I'm mildly anxious about enabling this feature at this point in the release cycle, especially due to the handful of ongoing stability investigations. My preference would be to hold off on merging this until early in the 19.2 cycle (i.e. when the 19.1 release branch is cut). @nvanbenschoten, @tbg any additional thoughts? |
Also, we'll need to verify this is copacetic with our encryption at rest implementation. That should be as easy as running |
Same here. |
Hmm, thinking about the scenario you described more, I don't think my suggestion is sufficient. And perhaps it is already done by RocksDB. I was thinking that you were worried about a previously valid WAL entry, but your comment was about a key or value that was formatted like a WAL entry. |
Right there is already a similar solution to what you described. Instead of using seqnum, it writes the WAL number as part of each entry (and the WAL number changes each time the WAL is recycled): https://github.com/facebook/rocksdb/blob/8a1ecd1982341cfe073924d36717e11446cbe492/db/log_writer.h#L60-L64. The case I described shouldn't happen by accident, particularly due to the checksum in the record. It could be an attack vector but also requires ability to make cockroach crash at a particular point, read access to the WAL file, and write access to the cockroach instance. I don't know if these conditions are realistic to worry about, or if we even expect multitenant use cases. |
Ran kv0 and kv95 for 4KB row size. All results so far:
I suspect the experiments for smaller row size are bottlenecked trying to sync the same page over and over again, whereas with 4KB rows we usually will be syncing different pages each time. |
Why is the |
I was surprised too to get sub-three ms latencies, which I hadn't seen before. Will investigate how it happened. |
Nice find @ajkr! I agree with others that we should wait on this until 19.2, but let's get it in shortly after the branch opens. I'm also curious about improved performance for 4kb writes with kv95. |
I reran 4KB vs. 2B kv95 today and observed 2B is faster than 4KB, as expected. Not sure what happened last time - probably a mistake. Anyways, should we land this now?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely good to land, just a small question as to whether 1
is the right number.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)
c-deps/libroach/options.cc, line 145 at r1 (raw file):
// readily available to recycle. // // We could pick a higher value if we see memtable flush backing up, or if we
Did you experiment with higher numbers? I've been able to cause memtable flushing to back up with Pebble, but perhaps RocksDB is different here. I think a kv0 workload is the worst case scenario.
Let me try kv0 with 4KB. If that doesn't back up flush we can probably say it won't happen in common cases. It is easy to back up flush when running the KV store directly, but when run under cockroach I haven't seen it happen yet. |
I watched for ~10 minutes and ~12GB data written. It seemed able to recycle the WAL successfully every time. |
bors r+ |
👎 Rejected by PR status |
This avoids frequent inode writeback during `fdatasync()` from the database's third WAL onwards, at least on XFS and ext4. Release note: None
bors r+ |
Build failed (retrying...) |
35591: libroach: enable rocksdb WAL recycling r=ajkr a=ajkr This avoids frequent inode writeback during `fdatasync()` from the database's third WAL onwards. It helps on filesystems that preallocate unwritten extents, like XFS and ext4, and also filesystems that don't support `fallocate()`, like ext2 and ext3. Release note: None 37571: exec: check for unset output columnTypes r=asubiotto a=asubiotto Future planning PRs need these to be set, so this commit introduces a check for unset columnTypes that will return an error. Release note: None Co-authored-by: Andrew Kryczka <andrew.kryczka2@gmail.com> Co-authored-by: Alfonso Subiotto Marqués <alfonso@cockroachlabs.com>
Build succeeded |
This avoids frequent inode writeback during
fdatasync()
from thedatabase's third WAL onwards. It helps on filesystems that preallocate
unwritten extents, like XFS and ext4, and also filesystems that don't
support
fallocate()
, like ext2 and ext3.Release note: None