-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: internal operations, especially writes, should not set deadline #65191
Comments
All queues run with a deadline. So that includes splits, merges, snapshots, rebalance operations, GC, etc. We've recently started making some of these timeouts dynamic based on the projected amount of work we expect the queue to have to perform.
The best defense I can make for timeouts on internal operations is that timeout are often coupled with concurrency limits, and as such, they prevent a failure of part of the system from indefinitely stalling progress elsewhere in the system. For instance, if we only initiate 2 range splits at a time from a given store, we wouldn't want two unavailable ranges from stalling that store from initiating any other splits indefinitely. |
(for the record, including some other parts of the offline conversation)
Makes sense.
The above concern about partial work and then exceeding deadline would be less of a concern when the prototyped epoch-LIFO scheme is completed, and if the work passes through admission control. However, there are gaps in what is subject to admission control, as indicated in #65957. |
We have marked this issue as stale because it has been inactive for |
(this came up when investigating a production issue, though may not be the root cause)
GCRequest, and perhaps other internal operations, are setting a deadline. The GCRequest deadline is 1min, which according to @aayushshah15 originates here
cockroach/pkg/kv/kvserver/queue.go
Line 53 in ea9074b
Internal work deadlines, especially for costly work, can lead to instability - some situation, however rare or unexpected, and typically load related, can cause the deadline to be exceeded, and the work that was already done is wasted, and needs to repeat. This adds more work to the system, which makes things worse.
When discussing this internally, @andreimatei mentioned:
I'd like to understand:
Jira issue: CRDB-7474
The text was updated successfully, but these errors were encountered: