Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: internal operations, especially writes, should not set deadline #65191

Closed
sumeerbhola opened this issue May 14, 2021 · 3 comments
Closed
Labels
A-kv-client Relating to the KV client and the KV interface. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity X-stale

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented May 14, 2021

(this came up when investigating a production issue, though may not be the root cause)

GCRequest, and perhaps other internal operations, are setting a deadline. The GCRequest deadline is 1min, which according to @aayushshah15 originates here

var queueGuaranteedProcessingTimeBudget = settings.RegisterDurationSetting(

Internal work deadlines, especially for costly work, can lead to instability - some situation, however rare or unexpected, and typically load related, can cause the deadline to be exceeded, and the work that was already done is wasted, and needs to repeat. This adds more work to the system, which makes things worse.

When discussing this internally, @andreimatei mentioned:

I have sympathy for Sumeer's view that redoing work is not good. I also have sympathy for the desire to not have an unbounded number of goroutines using memory for arbitrarily slow operations (also keep in mind unavailable ranges where ops just block forever). I've never reached great enlightenment about what the right way to structure this kind of async work is, and we have the same discussion currently in the context of intent cleanups.

I'd like to understand:

  • what is the list of internal operations that currently have a deadline (perhaps it is all).
  • what is the concern regarding arbitrarily slow operations? If slowness is because of CPU (the aforementioned GCRequests were probably consuming large amount of cpu because the LSM had high read amplification), then admission control in the 21.2 timeframe will likely queue it prior to executing, and we can have a hard rejection path that limits the number of server-side goroutines. The hard rejection would also end up limiting the number of client-side goroutines that are waiting on the rpc to complete.
  • I am assuming slow processing does not prevent range rebalancing or splits or merges, and that the slow operation will observe this change in a timely manner and fail if those happen. Is that correct? I think it is acceptable to repeat work in that case.

Jira issue: CRDB-7474

@sumeerbhola sumeerbhola added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-client Relating to the KV client and the KV interface. labels May 14, 2021
@nvanbenschoten
Copy link
Member

what is the list of internal operations that currently have a deadline (perhaps it is all).

All queues run with a deadline. So that includes splits, merges, snapshots, rebalance operations, GC, etc. We've recently started making some of these timeouts dynamic based on the projected amount of work we expect the queue to have to perform.

what is the concern regarding arbitrarily slow operations? If slowness is because of CPU (the aforementioned GCRequests were probably consuming large amount of cpu because the LSM had high read amplification), then admission control in the 21.2 timeframe will likely queue it prior to executing, and we can have a hard rejection path that limits the number of server-side goroutines. The hard rejection would also end up limiting the number of client-side goroutines that are waiting on the rpc to complete.

The best defense I can make for timeouts on internal operations is that timeout are often coupled with concurrency limits, and as such, they prevent a failure of part of the system from indefinitely stalling progress elsewhere in the system. For instance, if we only initiate 2 range splits at a time from a given store, we wouldn't want two unavailable ranges from stalling that store from initiating any other splits indefinitely.

@sumeerbhola
Copy link
Collaborator Author

(for the record, including some other parts of the offline conversation)

The best defense I can make for timeouts on internal operations is that timeout are often coupled with concurrency limits, and as such, they prevent a failure of part of the system from indefinitely stalling progress elsewhere in the system.

Makes sense.
We also discussed

some situation, however rare or unexpected, and typically load related, can cause the deadline to be exceeded, and the work that was already done is wasted

The above concern about partial work and then exceeding deadline would be less of a concern when the prototyped epoch-LIFO scheme is completed, and if the work passes through admission control. However, there are gaps in what is subject to admission control, as indicated in #65957.

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity X-stale
Projects
None yet
Development

No branches or pull requests

2 participants