Skip to content

Comments

kv: mark subsume as isWrite#136649

Closed
kvoli wants to merge 1 commit intocockroachdb:masterfrom
kvoli:241203.make-subsume-a-write
Closed

kv: mark subsume as isWrite#136649
kvoli wants to merge 1 commit intocockroachdb:masterfrom
kvoli:241203.make-subsume-a-write

Conversation

@kvoli
Copy link
Contributor

@kvoli kvoli commented Dec 4, 2024

Now that Subsume is required to force flush all replication streams up
to the lease applied index it froze the range at (#136321),
SubsumeRequest also needs to be marked as a write in order to pass
through the replication code path, which updates the range controller
force flush index and replicates the force flush index.

Part of: #132614
Release note: None

@kvoli kvoli self-assigned this Dec 4, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: to the non-kvnemesis change

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli)

@kvoli kvoli requested review from arulajmani and removed request for arulajmani December 4, 2024 09:55
@kvoli
Copy link
Contributor Author

kvoli commented Dec 4, 2024

Following some discussion in our weekly meeting, we agreed that this change requires some further thought, as merge is rather subtle. I'm going to conduct further research into potential hazards this change has.

Now that `Subsume` is required to force flush all replication streams up
to the lease applied index it froze the range at (cockroachdb#136321),
`SubsumeRequest` also needs to be marked as a write in order to pass
through the replication code path, which updates the range controller
force flush index and replicates the force flush index.

Part of: cockroachdb#132614
Release note: None
@kvoli kvoli force-pushed the 241203.make-subsume-a-write branch from 21976f3 to 46f793d Compare December 4, 2024 22:18
@blathers-crl
Copy link

blathers-crl bot commented Dec 4, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@kvoli
Copy link
Contributor Author

kvoli commented Dec 4, 2024

I've removed the kvnemesis change for now. I'm going to make sure the failure that cropped up (https://mesolite.cluster.engflow.com/invocations/default/b6237d65-083d-40e1-afa5-a87f92ea6a83) is well understood before making any testing changes there.

kvoli added a commit to kvoli/cockroach that referenced this pull request Dec 9, 2024
Add a new rac2 flow control integration test,
`TestFlowControlSendQueueRangeSplitMerge`.

This test takes the following steps:

```sql
-- We will exhaust the tokens across all streams while admission is blocked on
-- n3, using a single 4 MiB (deduction, the write itself is small) write. Then,
-- we will write a 1 MiB put to the range, split it, write a 1 MiB put to the
-- LHS range, merge the ranges, and write a 1 MiB put to the merged range. We
-- expect that at each stage where a send queue develops n1->s3, the send queue
-- will be flushed by the range merge and range split range operations.``sql
```

Note that the RHS is not written to post-split, pre-merge. See the
relevant comments, this will be resolved via cockroachdb#136649, or some variation,
which enforces the timely replication on subsume requests.

Part of: cockroachdb#132614
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this pull request Dec 10, 2024
Add a new rac2 flow control integration test,
`TestFlowControlSendQueueRangeSplitMerge`.

This test takes the following steps:

```sql
-- We will exhaust the tokens across all streams while admission is blocked on
-- n3, using a single 4 MiB (deduction, the write itself is small) write. Then,
-- we will write a 1 MiB put to the range, split it, write a 1 MiB put to the
-- LHS range, merge the ranges, and write a 1 MiB put to the merged range. We
-- expect that at each stage where a send queue develops n1->s3, the send queue
-- will be flushed by the range merge and range split range operations.``sql
```

Note that the RHS is not written to post-split, pre-merge. See the
relevant comments, this will be resolved via cockroachdb#136649, or some variation,
which enforces the timely replication on subsume requests.

Also note that epoch leases are used in the test, the added todo
describes why this is currently necessary and an intention to re-enable
them.

Part of: cockroachdb#132614
Release note: None
craig bot pushed a commit that referenced this pull request Dec 10, 2024
136258: kvserver: add TestFlowControlSendQueueRangeSplitMerge test  r=sumeerbhola a=kvoli

Add a new rac2 flow control integration test,
`TestFlowControlSendQueueRangeSplitMerge`.

This test takes the following steps:

```sql
-- We will exhaust the tokens across all streams while admission is blocked on
-- n3, using a single 4 MiB (deduction, the write itself is small) write. Then,
-- we will write a 1 MiB put to the range, split it, write a 1 MiB put to the
-- LHS range, merge the ranges, and write a 1 MiB put to the merged range. We
-- expect that at each stage where a send queue develops n1->s3, the send queue
-- will be flushed by the range merge and range split range operations.``sql
```

Note that the RHS is not written to post-split, pre-merge. See the
relevant comments, this will be resolved via #136649, or some variation,
which enforces the timely replication on subsume requests.

Part of: #132614
Release note: None

136648: rpc: reuse gRPC streams across unary BatchRequest RPCs r=tbg a=nvanbenschoten

Closes #136572.

This commit introduces pooling of gRPC streams that are used to send requests and receive corresponding responses in a manner that mimics unary RPC invocation. Pooling these streams allows for reuse of gRPC resources across calls, as opposed to native unary RPCs, which create a new stream and throw it away for each request (see grpc.invoke).

The new pooling mechanism is used for the Internal/Batch RPC method, which is the dominant RPC method used to communicate between the KV client and KV server. A new Internal/BatchStream RPC method is introduced to allow a client to send and receive BatchRequest/BatchResponse pairs over a long-lived, pooled stream. A pool of these streams is then maintained alongside each gRPC connection. The pool grows and shrinks dynamically based on demand.

The change demonstrates a large performance improvement in both microbenchmarks and full system benchmarks, which reveals just how expensive the gRPC stream setup on each unary RPC is.

### Microbenchmarks:
```
name                                            old time/op    new time/op    delta
Sysbench/KV/1node_remote/oltp_point_select-10     45.9µs ± 1%    28.8µs ± 2%  -37.31%  (p=0.000 n=9+8)
Sysbench/KV/1node_remote/oltp_read_only-10         958µs ± 6%     709µs ± 1%  -26.00%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_only-10       3.65ms ± 6%    2.81ms ± 8%  -23.06%  (p=0.000 n=8+9)
Sysbench/KV/1node_remote/oltp_read_write-10       1.77ms ± 5%    1.38ms ± 1%  -22.09%  (p=0.000 n=10+8)
Sysbench/KV/1node_remote/oltp_write_only-10        688µs ± 4%     557µs ± 1%  -19.11%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_point_select-10     181µs ± 8%     159µs ± 2%  -12.10%  (p=0.000 n=8+9)
Sysbench/SQL/1node_remote/oltp_write_only-10      2.16ms ± 4%    1.92ms ± 3%  -11.08%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_write-10      5.89ms ± 2%    5.36ms ± 1%   -8.89%  (p=0.000 n=9+9)

name                                            old alloc/op   new alloc/op   delta
Sysbench/KV/1node_remote/oltp_point_select-10     16.3kB ± 0%     6.4kB ± 0%  -60.70%  (p=0.000 n=8+10)
Sysbench/KV/1node_remote/oltp_write_only-10        359kB ± 1%     256kB ± 1%  -28.92%  (p=0.000 n=10+10)
Sysbench/SQL/1node_remote/oltp_write_only-10       748kB ± 0%     548kB ± 1%  -26.78%  (p=0.000 n=8+10)
Sysbench/SQL/1node_remote/oltp_point_select-10    40.9kB ± 0%    30.8kB ± 0%  -24.74%  (p=0.000 n=9+10)
Sysbench/KV/1node_remote/oltp_read_write-10       1.11MB ± 1%    0.88MB ± 1%  -21.17%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_read_write-10      2.00MB ± 0%    1.65MB ± 0%  -17.60%  (p=0.000 n=9+10)
Sysbench/KV/1node_remote/oltp_read_only-10         790kB ± 0%     655kB ± 0%  -17.11%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_only-10       1.33MB ± 0%    1.19MB ± 0%  -10.97%  (p=0.000 n=10+9)

name                                            old allocs/op  new allocs/op  delta
Sysbench/KV/1node_remote/oltp_point_select-10        210 ± 0%        61 ± 0%  -70.95%  (p=0.000 n=10+10)
Sysbench/KV/1node_remote/oltp_read_only-10         3.98k ± 0%     1.88k ± 0%  -52.68%  (p=0.019 n=6+8)
Sysbench/KV/1node_remote/oltp_read_write-10        7.10k ± 0%     3.47k ± 0%  -51.07%  (p=0.000 n=10+9)
Sysbench/KV/1node_remote/oltp_write_only-10        3.10k ± 0%     1.58k ± 0%  -48.89%  (p=0.000 n=10+9)
Sysbench/SQL/1node_remote/oltp_write_only-10       6.73k ± 0%     3.82k ± 0%  -43.30%  (p=0.000 n=10+10)
Sysbench/SQL/1node_remote/oltp_read_write-10       14.4k ± 0%      9.2k ± 0%  -36.29%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_point_select-10       429 ± 0%       277 ± 0%  -35.46%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_read_only-10        7.52k ± 0%     5.37k ± 0%  -28.60%  (p=0.000 n=10+10)
```

### Roachtests:
```
name                                            old queries/s  new queries/s  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64     17.6k ± 7%     19.2k ± 2%  +9.22%  (p=0.008 n=5+5)

name                                            old avg_ms/op  new avg_ms/op  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64      72.9 ± 7%      66.6 ± 2%  -8.57%  (p=0.008 n=5+5)

name                                            old p95_ms/op  new p95_ms/op  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64       116 ± 8%       106 ± 3%  -9.02%  (p=0.016 n=5+5)
```

### Manual tests:
Running in a similar configuration to `sysbench/oltp_read_write/nodes=3/cpu=8/conc=64`, but with a benchmarking related cluster settings (before and after) to reduce variance.
```
-- Before
Mean: 19771.03
Median: 19714.22
Standard Deviation: 282.96
Coefficient of variance: .0143

-- After
Mean: 21908.23
Median: 21923.03
Standard Deviation: 200.88
Coefficient of variance: .0091
```

----

Release note (performance improvement): gRPC streams are now pooled across unary intra-cluster RPCs, allowing for reuse of gRPC resources to reduce the cost of remote key-value layer access. This pooling can be disabled using the `rpc.batch_stream_pool.enabled` cluster setting.

137019: roachtest: increase the token return time with disk bandwidth limit r=kvoli a=andrewbaptist

Previously the test would wait 10m for tokens to be returned. Without the disk bandwidth limit set, they typically return almost immediately but with a limit they can take ~30m to return in some cases even after the workload is stopped and the system is idle. This change fixes some of the perturbation/metamorphic/* tests that are hitting this slow token return.

Epic: none
Fixes: #136982
Fixes: #136553
Informs: #137017

Release note: None

137044: kvserver: deflake TestConsistencyQueueRecomputeStats r=miraradeva a=miraradeva

The test manually adds voters and expects a leaseholder to be established before forcing a stats re-computation (which runs on the leaseholder). With leader leases, it might take an extra election timeout for the leader lease to be established after adding the new voters, so the test flaked if the re-computation ran (and failed) before the leaseholder was ready.

This commit makes the test retry the re-computation until a leasholder is established.

Fixes: #136596

Release note: None

137059: catalog/lease: deflake TestDescriptorRefreshOnRetry r=rafiss a=rafiss

The test was flaky since the background thread to refresh leases could run and cause the acquisition counts to be off.

fixes #137033
Release note: None

137099: kvcoord: deflake TestDistSenderReplicaStall r=miraradeva a=miraradeva

The test runs with expiration leases but when fortification is enabled the lease doesn't move off of the stalled replica because the deadlocked leader doesn't step down while it's receiving store liveness support.

This commit ensures fortification is off when expiration leases are used for the test.

Fixes: #136564

Release note: None

137118: crosscluster/logical: update udf test to expect at-least-once r=dt a=dt

We don't provide exactly once so we don't want to test for it.

Release note: none.
Epic: none.

Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Andrew Baptist <baptist@cockroachlabs.com>
Co-authored-by: Mira Radeva <mira@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
@kvoli kvoli closed this Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants