Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libroach: enable rocksdb WAL recycling #35591

Merged
merged 1 commit into from
May 22, 2019

Conversation

ajkr
Copy link
Contributor

@ajkr ajkr commented Mar 10, 2019

This avoids frequent inode writeback during fdatasync() from the
database's third WAL onwards. It helps on filesystems that preallocate
unwritten extents, like XFS and ext4, and also filesystems that don't
support fallocate(), like ext2 and ext3.

Release note: None

@ajkr ajkr requested a review from a team March 10, 2019 17:40
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajkr
Copy link
Contributor Author

ajkr commented Mar 10, 2019

Some kv0 results (summary: 6% higher throughput):

Commands:

$ bin/roachprod create andrewk-bench -n 4 --gce-machine-type=n1-standard-8 --local-ssd-no-ext4-barrier
$ ./bin/roachtest --user andrewk --cockroach ./cockroach-linux-2.6.32-gnu-amd64.before --workload ./bin/workload run '^kv0/enc=false/nodes=3$$' --roachprod ./bin/roachprod --cluster andrewk-bench
$ ./bin/roachtest --user andrewk --cockroach ./cockroach-linux-2.6.32-gnu-amd64.after --workload ./bin/workload run '^kv0/enc=false/nodes=3$$' --roachprod ./bin/roachprod --cluster andrewk-bench

Results before this PR:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0        8484142        14140.2     13.6     11.5     32.5     46.1    151.0

Results after:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0        8993425        14989.0     12.8     11.0     30.4     44.0    906.0

@ajkr
Copy link
Contributor Author

ajkr commented Mar 10, 2019

The kv95 results show decent improvement too. Maybe the frequent metadata/journal writes were slowing down reads.

Commands:

$ bin/roachprod create andrewk-bench -n 4 --gce-machine-type=n1-standard-8 --local-ssd-no-ext4-barrier
$ for b in ./cockroach-linux-2.6.32-gnu-amd64.before ./cockroach-linux-2.6.32-gnu-amd64.after ; do ./bin/roachtest --user andrewk --cockroach "$b" --workload ./bin/workload run '^kv95/enc=false/nodes=3$$' --roachprod ./bin/roachprod --cluster andrewk-bench |& tee "$b.out" ; done

Results before:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0       19803815        33006.2      5.0      4.1     15.2     25.2    226.5  read
  600.0s        0        1041768         1736.3     14.6     13.1     30.4     41.9    234.9  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0       20845583        34742.5      5.5      4.5     16.8     27.3    234.9

Results after:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0       20543710        34239.4      4.9      4.1     14.2     24.1    134.2  read
  600.0s        0        1081037         1801.7     13.4     12.1     29.4     41.9    151.0  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0       21624747        36041.1      5.3      4.2     16.3     26.2    151.0

@petermattis
Copy link
Collaborator

That's a nice little perf boost! I recall recycle_log_file_num had some complex code associated with it. Am I misremembering that? Perhaps a better question is how stable is recycle_log_file_num. Is there any risk to enabling this feature?

@ajkr
Copy link
Contributor Author

ajkr commented Mar 11, 2019

It looks like the last known bug related to this feature was 2.5 years ago (facebook/rocksdb@1b7af5f), so I believe it is pretty stable. Looks like Ceph Bluestore is using it.

There appears to be an infinitesimally small chance of a wrong record to be replayed during recovery -- a user key or value written to an old WAL could contain bytes that form a valid entry for the recycled WAL, and those bytes would have to immediately follow the final entry written to the recycled WAL.

One thing we should do to be safer is enable this feature in db_crashtest.py so FB's CI runs crash-recovery correctness checks with WAL recycling enabled.

@petermattis
Copy link
Collaborator

There appears to be an infinitesimally small chance of a wrong record to be replayed during recovery -- a user key or value written to an old WAL could contain bytes that form a valid entry for the recycled WAL, and those bytes would have to immediately follow the final entry written to the recycled WAL.

I imagine this is because we're not zeroing the log file when it is being reused. It strikes me, though, that there shouldn't be any danger due to reuse because we can terminate recovery when we hit the old portion of the log. The entries in the log need to occur in sequence number order (and I believe every batch contains a sequence number), so it seems straightforward to stop log recovery if we encounter a batch with a lower sequence number than the previous batch. Is this not currently done? Am I missing anything about why this wouldn't work?

I'm mildly anxious about enabling this feature at this point in the release cycle, especially due to the handful of ongoing stability investigations. My preference would be to hold off on merging this until early in the 19.2 cycle (i.e. when the 19.1 release branch is cut).

@nvanbenschoten, @tbg any additional thoughts?

@petermattis
Copy link
Collaborator

Also, we'll need to verify this is copacetic with our encryption at rest implementation. That should be as easy as running roachtest run kv0/enc=true/nodes=1, though you might need to bump the running time to make sure WAL file reuse takes place. I'm guessing that none of our unit tests write enough data to hit WAL file reuse code path.

@tbg
Copy link
Member

tbg commented Mar 11, 2019

My preference would be to hold off on merging this until early in the 19.2 cycle (i.e. when the 19.1 release branch is cut).

Same here.

@petermattis
Copy link
Collaborator

Am I missing anything about why this wouldn't work?

Hmm, thinking about the scenario you described more, I don't think my suggestion is sufficient. And perhaps it is already done by RocksDB. I was thinking that you were worried about a previously valid WAL entry, but your comment was about a key or value that was formatted like a WAL entry.

@ajkr
Copy link
Contributor Author

ajkr commented Mar 11, 2019

Hmm, thinking about the scenario you described more, I don't think my suggestion is sufficient. And perhaps it is already done by RocksDB. I was thinking that you were worried about a previously valid WAL entry, but your comment was about a key or value that was formatted like a WAL entry.

Right there is already a similar solution to what you described. Instead of using seqnum, it writes the WAL number as part of each entry (and the WAL number changes each time the WAL is recycled): https://github.com/facebook/rocksdb/blob/8a1ecd1982341cfe073924d36717e11446cbe492/db/log_writer.h#L60-L64.

The case I described shouldn't happen by accident, particularly due to the checksum in the record. It could be an attack vector but also requires ability to make cockroach crash at a particular point, read access to the WAL file, and write access to the cockroach instance. I don't know if these conditions are realistic to worry about, or if we even expect multitenant use cases.

@ajkr
Copy link
Contributor Author

ajkr commented Mar 11, 2019

Ran kv0 and kv95 for 4KB row size. All results so far:

| benchmark | row size | recycle_log_file_num | avg latency(ms) | throughput(ops/sec) |
|-----------|----------|----------------------|-----------------|---------------------|
| kv0       | 2        | 0                    | 13.6            | 14140.2             |
| kv0       | 2        | 1                    | 12.8            | 14989.0             |
| kv0       | 4096     | 0                    | 40.1            | 4786.5              |
| kv0       | 4096     | 1                    | 29.6            | 6472.3              |
| kv95      | 2        | 0                    | 5.5             | 34742.5             |
| kv95      | 2        | 1                    | 5.3             | 36041.1             |
| kv95      | 4096     | 0                    | 2.6             | 73006.1             |
| kv95      | 4096     | 1                    | 2.4             | 80775.8             |

I suspect the experiments for smaller row size are bottlenecked trying to sync the same page over and over again, whereas with 4KB rows we usually will be syncing different pages each time.

@petermattis
Copy link
Collaborator

Why is the kv95 throughput so much higher for 4k rows? Shouldn't it be lower than for 2 byte rows?

@ajkr
Copy link
Contributor Author

ajkr commented Mar 11, 2019

I was surprised too to get sub-three ms latencies, which I hadn't seen before. Will investigate how it happened.

@nvanbenschoten
Copy link
Member

Nice find @ajkr! I agree with others that we should wait on this until 19.2, but let's get it in shortly after the branch opens.

I'm also curious about improved performance for 4kb writes with kv95.

@ajkr
Copy link
Contributor Author

ajkr commented May 22, 2019

I reran 4KB vs. 2B kv95 today and observed 2B is faster than 4KB, as expected. Not sure what happened last time - probably a mistake. Anyways, should we land this now?

block bytes wal recycling throughput avg latency
2 0 30309.6 6.3
2 1 33067.8 5.8
4096 0 25427.1 7.5
4096 1 27478.8 7.0

@ajkr ajkr requested a review from petermattis May 22, 2019 19:05
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Definitely good to land, just a small question as to whether 1 is the right number.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)


c-deps/libroach/options.cc, line 145 at r1 (raw file):

  // readily available to recycle.
  //
  // We could pick a higher value if we see memtable flush backing up, or if we

Did you experiment with higher numbers? I've been able to cause memtable flushing to back up with Pebble, but perhaps RocksDB is different here. I think a kv0 workload is the worst case scenario.

@ajkr
Copy link
Contributor Author

ajkr commented May 22, 2019

Let me try kv0 with 4KB. If that doesn't back up flush we can probably say it won't happen in common cases. It is easy to back up flush when running the KV store directly, but when run under cockroach I haven't seen it happen yet.

@ajkr
Copy link
Contributor Author

ajkr commented May 22, 2019

I watched for ~10 minutes and ~12GB data written. It seemed able to recycle the WAL successfully every time.

@ajkr
Copy link
Contributor Author

ajkr commented May 22, 2019

bors r+

@craig
Copy link
Contributor

craig bot commented May 22, 2019

👎 Rejected by PR status

This avoids frequent inode writeback during `fdatasync()` from the
database's third WAL onwards, at least on XFS and ext4.

Release note: None
@ajkr
Copy link
Contributor Author

ajkr commented May 22, 2019

bors r+

@craig
Copy link
Contributor

craig bot commented May 22, 2019

Build failed (retrying...)

craig bot pushed a commit that referenced this pull request May 22, 2019
35591: libroach: enable rocksdb WAL recycling r=ajkr a=ajkr

This avoids frequent inode writeback during `fdatasync()` from the
database's third WAL onwards. It helps on filesystems that preallocate
unwritten extents, like XFS and ext4, and also filesystems that don't
support `fallocate()`, like ext2 and ext3.

Release note: None

37571: exec: check for unset output columnTypes r=asubiotto a=asubiotto

Future planning PRs need these to be set, so this commit introduces a
check for unset columnTypes that will return an error.

Release note: None

Co-authored-by: Andrew Kryczka <andrew.kryczka2@gmail.com>
Co-authored-by: Alfonso Subiotto Marqués <alfonso@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented May 22, 2019

Build succeeded

@craig craig bot merged commit 2e843cd into cockroachdb:master May 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants