libroach: enable rocksdb WAL recycling #35591

ajkr · 2019-03-10T17:40:58Z

This avoids frequent inode writeback during fdatasync() from the
database's third WAL onwards. It helps on filesystems that preallocate
unwritten extents, like XFS and ext4, and also filesystems that don't
support fallocate(), like ext2 and ext3.

Release note: None

cockroach-teamcity · 2019-03-10T17:41:04Z

This change is

ajkr · 2019-03-10T19:14:23Z

Some kv0 results (summary: 6% higher throughput):

Commands:

$ bin/roachprod create andrewk-bench -n 4 --gce-machine-type=n1-standard-8 --local-ssd-no-ext4-barrier
$ ./bin/roachtest --user andrewk --cockroach ./cockroach-linux-2.6.32-gnu-amd64.before --workload ./bin/workload run '^kv0/enc=false/nodes=3$$' --roachprod ./bin/roachprod --cluster andrewk-bench
$ ./bin/roachtest --user andrewk --cockroach ./cockroach-linux-2.6.32-gnu-amd64.after --workload ./bin/workload run '^kv0/enc=false/nodes=3$$' --roachprod ./bin/roachprod --cluster andrewk-bench

Results before this PR:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0        8484142        14140.2     13.6     11.5     32.5     46.1    151.0

Results after:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0        8993425        14989.0     12.8     11.0     30.4     44.0    906.0

ajkr · 2019-03-10T20:45:01Z

The kv95 results show decent improvement too. Maybe the frequent metadata/journal writes were slowing down reads.

Commands:

$ bin/roachprod create andrewk-bench -n 4 --gce-machine-type=n1-standard-8 --local-ssd-no-ext4-barrier
$ for b in ./cockroach-linux-2.6.32-gnu-amd64.before ./cockroach-linux-2.6.32-gnu-amd64.after ; do ./bin/roachtest --user andrewk --cockroach "$b" --workload ./bin/workload run '^kv95/enc=false/nodes=3$$' --roachprod ./bin/roachprod --cluster andrewk-bench |& tee "$b.out" ; done

Results before:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0       19803815        33006.2      5.0      4.1     15.2     25.2    226.5  read
  600.0s        0        1041768         1736.3     14.6     13.1     30.4     41.9    234.9  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0       20845583        34742.5      5.5      4.5     16.8     27.3    234.9

Results after:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0       20543710        34239.4      4.9      4.1     14.2     24.1    134.2  read
  600.0s        0        1081037         1801.7     13.4     12.1     29.4     41.9    151.0  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0       21624747        36041.1      5.3      4.2     16.3     26.2    151.0

petermattis · 2019-03-11T00:37:21Z

That's a nice little perf boost! I recall recycle_log_file_num had some complex code associated with it. Am I misremembering that? Perhaps a better question is how stable is recycle_log_file_num. Is there any risk to enabling this feature?

ajkr · 2019-03-11T03:33:16Z

It looks like the last known bug related to this feature was 2.5 years ago (facebook/rocksdb@1b7af5f), so I believe it is pretty stable. Looks like Ceph Bluestore is using it.

There appears to be an infinitesimally small chance of a wrong record to be replayed during recovery -- a user key or value written to an old WAL could contain bytes that form a valid entry for the recycled WAL, and those bytes would have to immediately follow the final entry written to the recycled WAL.

One thing we should do to be safer is enable this feature in db_crashtest.py so FB's CI runs crash-recovery correctness checks with WAL recycling enabled.

petermattis · 2019-03-11T13:24:00Z

There appears to be an infinitesimally small chance of a wrong record to be replayed during recovery -- a user key or value written to an old WAL could contain bytes that form a valid entry for the recycled WAL, and those bytes would have to immediately follow the final entry written to the recycled WAL.

I imagine this is because we're not zeroing the log file when it is being reused. It strikes me, though, that there shouldn't be any danger due to reuse because we can terminate recovery when we hit the old portion of the log. The entries in the log need to occur in sequence number order (and I believe every batch contains a sequence number), so it seems straightforward to stop log recovery if we encounter a batch with a lower sequence number than the previous batch. Is this not currently done? Am I missing anything about why this wouldn't work?

I'm mildly anxious about enabling this feature at this point in the release cycle, especially due to the handful of ongoing stability investigations. My preference would be to hold off on merging this until early in the 19.2 cycle (i.e. when the 19.1 release branch is cut).

@nvanbenschoten, @tbg any additional thoughts?

petermattis · 2019-03-11T13:24:43Z

Also, we'll need to verify this is copacetic with our encryption at rest implementation. That should be as easy as running roachtest run kv0/enc=true/nodes=1, though you might need to bump the running time to make sure WAL file reuse takes place. I'm guessing that none of our unit tests write enough data to hit WAL file reuse code path.

tbg · 2019-03-11T13:26:01Z

My preference would be to hold off on merging this until early in the 19.2 cycle (i.e. when the 19.1 release branch is cut).

Same here.

petermattis · 2019-03-11T13:29:40Z

Am I missing anything about why this wouldn't work?

Hmm, thinking about the scenario you described more, I don't think my suggestion is sufficient. And perhaps it is already done by RocksDB. I was thinking that you were worried about a previously valid WAL entry, but your comment was about a key or value that was formatted like a WAL entry.

ajkr · 2019-03-11T16:35:56Z

Hmm, thinking about the scenario you described more, I don't think my suggestion is sufficient. And perhaps it is already done by RocksDB. I was thinking that you were worried about a previously valid WAL entry, but your comment was about a key or value that was formatted like a WAL entry.

Right there is already a similar solution to what you described. Instead of using seqnum, it writes the WAL number as part of each entry (and the WAL number changes each time the WAL is recycled): https://github.com/facebook/rocksdb/blob/8a1ecd1982341cfe073924d36717e11446cbe492/db/log_writer.h#L60-L64.

The case I described shouldn't happen by accident, particularly due to the checksum in the record. It could be an attack vector but also requires ability to make cockroach crash at a particular point, read access to the WAL file, and write access to the cockroach instance. I don't know if these conditions are realistic to worry about, or if we even expect multitenant use cases.

ajkr · 2019-03-11T20:24:19Z

Ran kv0 and kv95 for 4KB row size. All results so far:

| benchmark | row size | recycle_log_file_num | avg latency(ms) | throughput(ops/sec) |
|-----------|----------|----------------------|-----------------|---------------------|
| kv0       | 2        | 0                    | 13.6            | 14140.2             |
| kv0       | 2        | 1                    | 12.8            | 14989.0             |
| kv0       | 4096     | 0                    | 40.1            | 4786.5              |
| kv0       | 4096     | 1                    | 29.6            | 6472.3              |
| kv95      | 2        | 0                    | 5.5             | 34742.5             |
| kv95      | 2        | 1                    | 5.3             | 36041.1             |
| kv95      | 4096     | 0                    | 2.6             | 73006.1             |
| kv95      | 4096     | 1                    | 2.4             | 80775.8             |

I suspect the experiments for smaller row size are bottlenecked trying to sync the same page over and over again, whereas with 4KB rows we usually will be syncing different pages each time.

petermattis · 2019-03-11T21:48:45Z

Why is the kv95 throughput so much higher for 4k rows? Shouldn't it be lower than for 2 byte rows?

ajkr · 2019-03-11T22:25:46Z

I was surprised too to get sub-three ms latencies, which I hadn't seen before. Will investigate how it happened.

nvanbenschoten · 2019-03-12T04:22:43Z

Nice find @ajkr! I agree with others that we should wait on this until 19.2, but let's get it in shortly after the branch opens.

I'm also curious about improved performance for 4kb writes with kv95.

ajkr · 2019-05-22T19:03:55Z

I reran 4KB vs. 2B kv95 today and observed 2B is faster than 4KB, as expected. Not sure what happened last time - probably a mistake. Anyways, should we land this now?

block bytes	wal recycling	throughput	avg latency
2	0	30309.6	6.3
2	1	33067.8	5.8
4096	0	25427.1	7.5
4096	1	27478.8	7.0

petermattis

Definitely good to land, just a small question as to whether 1 is the right number.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)

c-deps/libroach/options.cc, line 145 at r1 (raw file):

  // readily available to recycle.
  //
  // We could pick a higher value if we see memtable flush backing up, or if we

Did you experiment with higher numbers? I've been able to cause memtable flushing to back up with Pebble, but perhaps RocksDB is different here. I think a kv0 workload is the worst case scenario.

ajkr · 2019-05-22T19:16:03Z

Let me try kv0 with 4KB. If that doesn't back up flush we can probably say it won't happen in common cases. It is easy to back up flush when running the KV store directly, but when run under cockroach I haven't seen it happen yet.

ajkr · 2019-05-22T19:31:37Z

I watched for ~10 minutes and ~12GB data written. It seemed able to recycle the WAL successfully every time.

ajkr · 2019-05-22T19:31:57Z

bors r+

craig · 2019-05-22T19:52:40Z

👎 Rejected by PR status

This avoids frequent inode writeback during `fdatasync()` from the database's third WAL onwards, at least on XFS and ext4. Release note: None

ajkr · 2019-05-22T20:36:40Z

bors r+

craig · 2019-05-22T21:01:47Z

Build failed (retrying...)

GitHub CI (Cockroach)

35591: libroach: enable rocksdb WAL recycling r=ajkr a=ajkr This avoids frequent inode writeback during `fdatasync()` from the database's third WAL onwards. It helps on filesystems that preallocate unwritten extents, like XFS and ext4, and also filesystems that don't support `fallocate()`, like ext2 and ext3. Release note: None 37571: exec: check for unset output columnTypes r=asubiotto a=asubiotto Future planning PRs need these to be set, so this commit introduces a check for unset columnTypes that will return an error. Release note: None Co-authored-by: Andrew Kryczka <andrew.kryczka2@gmail.com> Co-authored-by: Alfonso Subiotto Marqués <alfonso@cockroachlabs.com>

craig · 2019-05-22T21:18:51Z

Build succeeded

GitHub CI (Cockroach)

ajkr requested a review from a team March 10, 2019 17:40

ajkr requested a review from petermattis May 22, 2019 19:05

petermattis approved these changes May 22, 2019

View reviewed changes

libroach: enable rocksdb WAL recycling

2e843cd

This avoids frequent inode writeback during `fdatasync()` from the database's third WAL onwards, at least on XFS and ext4. Release note: None

ajkr force-pushed the rocksdb-wal-recycle branch from 626de50 to 2e843cd Compare May 22, 2019 19:57

craig bot merged commit 2e843cd into cockroachdb:master May 22, 2019

knz mentioned this pull request Nov 10, 2019

User-facing changes in 19.2 that were not picked up in release notes cockroachdb/docs#5819

Closed

ajwerner mentioned this pull request Jun 1, 2020

storage: large value performance degradation since switching to pebble #49750

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libroach: enable rocksdb WAL recycling #35591

libroach: enable rocksdb WAL recycling #35591

ajkr commented Mar 10, 2019 •

edited

Loading

cockroach-teamcity commented Mar 10, 2019

ajkr commented Mar 10, 2019

ajkr commented Mar 10, 2019

petermattis commented Mar 11, 2019

ajkr commented Mar 11, 2019

petermattis commented Mar 11, 2019

petermattis commented Mar 11, 2019

tbg commented Mar 11, 2019

petermattis commented Mar 11, 2019

ajkr commented Mar 11, 2019 •

edited

Loading

ajkr commented Mar 11, 2019

petermattis commented Mar 11, 2019

ajkr commented Mar 11, 2019

nvanbenschoten commented Mar 12, 2019

ajkr commented May 22, 2019

petermattis left a comment

ajkr commented May 22, 2019

ajkr commented May 22, 2019

ajkr commented May 22, 2019

craig bot commented May 22, 2019

ajkr commented May 22, 2019

craig bot commented May 22, 2019

craig bot commented May 22, 2019

libroach: enable rocksdb WAL recycling #35591

libroach: enable rocksdb WAL recycling #35591

Conversation

ajkr commented Mar 10, 2019 • edited Loading

cockroach-teamcity commented Mar 10, 2019

ajkr commented Mar 10, 2019

ajkr commented Mar 10, 2019

petermattis commented Mar 11, 2019

ajkr commented Mar 11, 2019

petermattis commented Mar 11, 2019

petermattis commented Mar 11, 2019

tbg commented Mar 11, 2019

petermattis commented Mar 11, 2019

ajkr commented Mar 11, 2019 • edited Loading

ajkr commented Mar 11, 2019

petermattis commented Mar 11, 2019

ajkr commented Mar 11, 2019

nvanbenschoten commented Mar 12, 2019

ajkr commented May 22, 2019

petermattis left a comment

Choose a reason for hiding this comment

ajkr commented May 22, 2019

ajkr commented May 22, 2019

ajkr commented May 22, 2019

craig bot commented May 22, 2019

ajkr commented May 22, 2019

craig bot commented May 22, 2019

Build failed (retrying...)

craig bot commented May 22, 2019

Build succeeded

ajkr commented Mar 10, 2019 •

edited

Loading

ajkr commented Mar 11, 2019 •

edited

Loading