kv: internal operations, especially writes, should not set deadline #65191

sumeerbhola · 2021-05-14T13:41:42Z

(this came up when investigating a production issue, though may not be the root cause)

GCRequest, and perhaps other internal operations, are setting a deadline. The GCRequest deadline is 1min, which according to @aayushshah15 originates here

cockroach/pkg/kv/kvserver/queue.go

Line 53 in ea9074b

var queueGuaranteedProcessingTimeBudget = settings.RegisterDurationSetting(

Internal work deadlines, especially for costly work, can lead to instability - some situation, however rare or unexpected, and typically load related, can cause the deadline to be exceeded, and the work that was already done is wasted, and needs to repeat. This adds more work to the system, which makes things worse.

When discussing this internally, @andreimatei mentioned:

I have sympathy for Sumeer's view that redoing work is not good. I also have sympathy for the desire to not have an unbounded number of goroutines using memory for arbitrarily slow operations (also keep in mind unavailable ranges where ops just block forever). I've never reached great enlightenment about what the right way to structure this kind of async work is, and we have the same discussion currently in the context of intent cleanups.

I'd like to understand:

what is the list of internal operations that currently have a deadline (perhaps it is all).
what is the concern regarding arbitrarily slow operations? If slowness is because of CPU (the aforementioned GCRequests were probably consuming large amount of cpu because the LSM had high read amplification), then admission control in the 21.2 timeframe will likely queue it prior to executing, and we can have a hard rejection path that limits the number of server-side goroutines. The hard rejection would also end up limiting the number of client-side goroutines that are waiting on the rpc to complete.
I am assuming slow processing does not prevent range rebalancing or splits or merges, and that the slow operation will observe this change in a timely manner and fail if those happen. Is that correct? I think it is acceptable to repeat work in that case.

Jira issue: CRDB-7474

nvanbenschoten · 2021-11-23T22:58:41Z

what is the list of internal operations that currently have a deadline (perhaps it is all).

All queues run with a deadline. So that includes splits, merges, snapshots, rebalance operations, GC, etc. We've recently started making some of these timeouts dynamic based on the projected amount of work we expect the queue to have to perform.

what is the concern regarding arbitrarily slow operations? If slowness is because of CPU (the aforementioned GCRequests were probably consuming large amount of cpu because the LSM had high read amplification), then admission control in the 21.2 timeframe will likely queue it prior to executing, and we can have a hard rejection path that limits the number of server-side goroutines. The hard rejection would also end up limiting the number of client-side goroutines that are waiting on the rpc to complete.

The best defense I can make for timeouts on internal operations is that timeout are often coupled with concurrency limits, and as such, they prevent a failure of part of the system from indefinitely stalling progress elsewhere in the system. For instance, if we only initiate 2 range splits at a time from a given store, we wouldn't want two unavailable ranges from stalling that store from initiating any other splits indefinitely.

sumeerbhola · 2021-11-24T11:38:13Z

(for the record, including some other parts of the offline conversation)

The best defense I can make for timeouts on internal operations is that timeout are often coupled with concurrency limits, and as such, they prevent a failure of part of the system from indefinitely stalling progress elsewhere in the system.

Makes sense.
We also discussed

failing fast for unavailable ranges (lack of leaseholder for read, lack of write quorum) for which there is an existing issue kvserver: circuit-break requests to unavailable ranges #71806
there is the risk that if fail-fast misses something, and we have no deadline, we get stuck.

some situation, however rare or unexpected, and typically load related, can cause the deadline to be exceeded, and the work that was already done is wasted

The above concern about partial work and then exceeding deadline would be less of a concern when the prototyped epoch-LIFO scheme is completed, and if the work passes through admission control. However, there are gaps in what is subject to admission control, as indicated in #65957.

github-actions · 2023-08-29T11:08:59Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

sumeerbhola added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-client Relating to the KV client and the KV interface. labels May 14, 2021

erikgrinaker mentioned this issue May 14, 2021

kvserver: use a queue for intent resolution and txn cleanup #65204

Open

github-actions bot added the no-issue-activity label Aug 29, 2023

github-actions bot added the X-stale label Sep 11, 2023

github-actions bot closed this as completed Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: internal operations, especially writes, should not set deadline #65191

kv: internal operations, especially writes, should not set deadline #65191

sumeerbhola commented May 14, 2021 •

edited by cockroach-jira-scripts

Loading

nvanbenschoten commented Nov 23, 2021

sumeerbhola commented Nov 24, 2021

github-actions bot commented Aug 29, 2023

kv: internal operations, especially writes, should not set deadline #65191

kv: internal operations, especially writes, should not set deadline #65191

Comments

sumeerbhola commented May 14, 2021 • edited by cockroach-jira-scripts Loading

nvanbenschoten commented Nov 23, 2021

sumeerbhola commented Nov 24, 2021

github-actions bot commented Aug 29, 2023

sumeerbhola commented May 14, 2021 •

edited by cockroach-jira-scripts

Loading