kv: Fix a race in catchup scan completion that may send an error. #77696

miretskiy · 2022-03-11T19:16:12Z

As noticed by @nvanbenschoten , it is possible that while the catchup scan running,
the buffer for raft events may get over full, and in that case, when catchup scan completes,
we may send an overfull error instead of checkpoint event. We should prioritize sending
checkpoint event.

CRDB-8666

Jira issue: CRDB-13720

blathers-crl · 2022-03-11T19:16:14Z

cc @cockroachdb/cdc

nvanbenschoten · 2022-03-11T20:11:53Z

We should prioritize sending checkpoint event.

Specifically, we should prioritize sending a checkpoint event that contains a non-zero timestamp after a catch-up scan completes and before the registration consults its overflowed flag. In doing so, we ensure progress¹ of the rangefeed across per-range retries due to REASON_SLOW_CONSUMER errors that are often tied to large catch-up scans. Without this, we have no guarantee that a large catch-up scan's progress will be locked in and that it won't retry indefinitely.

Ensuring this is going to be a little tricky, because a non-zero checkpoint timestamp depends on the completion of the initResolvedTSScan. Now that this scan only consults the separated lock table (as of #71295), it seems reasonable to make tailing the raft log wait on its completion. Further care will need to be taken with the resolvedTimestamp to avoid clobering the initial closed timestamp (as of the point of registration) so that the resolved timestamp after the catch-up scan is applicable to the current state of the registration and not too advanced.

"progress" meaning that on a reconnection, the per-range "cursor" timestamp is larger than before. The client-side interaction points between checkpoints and retry timestamps are here, here, and here. ↩

nvanbenschoten · 2022-03-11T20:36:38Z

Making guarantees here seems difficult due to the shared resolvedTimestamp tracking (and associated unresolvedTxnHeap), which lives above the level of individual registration buffers. This means that even when a single registration's buffer overflows, the shared resolvedTimestamp tracking continues to advance. As a result, once an individual registration's buffer overflows, it often hasn't and can't publish the incremental changes necessary to use the shared resolved timestamp.

An option that avoids this issue is to allow each individual registration to perform its own initResolvedTSScan at its original log index (i.e. with an iterator captured under the raftMu when it began tailing the raft log). The registration could combine this with the closed timestamp from the original log index to construct a checkpoint to publish before erroring out. To avoid this costing too much under steady-state operation, we could limit this extra scan to only registrations that hit REASON_SLOW_CONSUMER errors before they publish their first non-zero checkpoint.

nvanbenschoten · 2022-03-11T23:16:45Z

I think the original premise for this issue was moderately incorrect. We only return a REASON_SLOW_CONSUMER error when the registration's buffer hits a size limit after draining the buffer. So the initial empty checkpoint will be published in this case.

However, the later realization that an empty checkpoint is insufficient to ensure forward progress remains. There's no guarantee that any non-empty checkpoint will land in the buffer before it reaches capacity.

I have a prototype that implements the "allow each individual registration to perform its own initResolvedTSScan" approach, but it is moderately complex. I'd be interested in determining how frequently we hit this case of only publishing an empty checkpoint before returning a REASON_SLOW_CONSUMER error. @miretskiy do we have any tests that demonstrate this starvation behavior, combining large catch-up scans with heavy foreground traffic?

miretskiy · 2022-03-12T21:18:42Z

No, I don't think we have that. What we know is that the catchup scan was long time in the past, and that the data included large JSONb blobs, so that the consumer was not too fast to consume.

nvanbenschoten · 2022-03-12T23:58:29Z

#77724 prototypes a fix for this issue.

As I mentioned in the PR, I don't think we should try to merge it right away. Instead, I think we should add instrumentation (logging or a metric) to help us determine whether this is a real problem in these cases. There is a race here, but it is less severe than previously believed, so it's not clear whether this is actually leading to starvation or not.

miretskiy · 2022-08-25T14:09:50Z

Customer enabled vlogging in distsender, and we appear to be hitting slow consumers errors very frequently.
I think ability to make forward progress upon catchup completion is very important.

miretskiy · 2022-09-03T17:00:49Z

@nvanbenschoten just want to make sure you saw the comment above -- do you think we should
resurrect your PR?

nvanbenschoten · 2022-09-19T16:18:40Z

@miretskiy thanks for letting me know. If we are seeing the theorized starvation in practice then I do think it's worthwhile to resurrect the PR. In the logs, did we repeatedly see the REASON_SLOW_CONSUMER error before a non-empty checkpoint?

I can take this as an action item for v23.1. Do we need it sooner?

do we have any tests that demonstrate this starvation behavior, combining large catch-up scans with heavy foreground traffic?

I think this kind of test would be worthwhile to demonstrate this behavior. It would also force us to express the relationship between catch-up scan throughput per range, new write throughput per range, and rangefeed-level buffering per-range. It would be nice to be able to say that a rangefeed (with retry loop) will always be able to eventually catch-up and establish a stable connection if catch_up_rate > new_write_rate, regardless of buffer size.

…MER error Fixes cockroachdb#77696. This commit ensures progress of per-range rangefeeds in the presence of large catch-up scans by guaranteeing that a non-empty checkpoint is published before returning a REASON_SLOW_CONSUMER error to the client. This ensures that the client doesn't spin in `DistSender` on the same catch-up span without advancing its frontier. It does so by having rangefeed registrations perform an ad-hoc resolved timestamp computation in cases where the registration's buffer hits a memory limit before it succeeds in publishing a non-empty checkpoint. In doing so, we can make a loose guarantee (assuming timely closed timestamp progression) that a rangefeed with a client-side retry loop will always be able to catch-up and converge towards a stable connection as long as its rate of consumption is greater than the rate of production on the table. In other words, if `catch_up_scan_rate > new_write_rate`, the retry loop will make forward progress and eventually stop hitting REASON_SLOW_CONSUMER errors. A nearly viable alternative to this ad-hoc scan is to ensure that the processor-wide resolved timestamp tracker publishes at least one non-zero checkpoint on each registration before the registration is allowed to fail. This runs into the issues described in cockroachdb#77696 (comment). Specifically, because this tracker is shared with other registrations, it continues to advance even after the stream of events has been broken to an overflowing registration. That means that a checkpoint computed after the registration has overflowed cannot be published without violating the ordering contract of rangefeed checkpoints ("all prior events have been seen"). The checkpoint published after the initial catch-up scan needs to be coherent with the state of the range that the catch-up scan saw. This change should be followed up with system-level testing that exercises a changefeed's ability to be unpaused after a long amount of time on a table with a high rate of writes. That is an example of the kind of situation that this change aims to improve. Release justification: None. Wait on this. Release note (enterprise change): Changefeeds are now guaranteed to make forward progress while performing a large catch-up scan on a table with a high rate of writes.

miretskiy added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc T-kv KV Team labels Mar 11, 2022

blathers-crl bot added the A-cdc Change Data Capture label Mar 11, 2022

nvanbenschoten self-assigned this Mar 11, 2022

nvanbenschoten linked a pull request Mar 12, 2022 that will close this issue

kv/rangefeed: guarantee non-empty checkpoint before REASON_SLOW_CONSUMER error #77724

Open

exalate-issue-sync bot removed the T-kv KV Team label Mar 18, 2022

miretskiy removed the T-cdc label Mar 23, 2022

nvanbenschoten changed the title ~~kv: Fix a race in catchup scan completion that may send an error.~~ kv: Fix a race in catchup scan completion that may send an error Sep 27, 2022

exalate-issue-sync bot changed the title ~~kv: Fix a race in catchup scan completion that may send an error~~ kv: Fix a race in catchup scan completion that may send an error. Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: Fix a race in catchup scan completion that may send an error. #77696

kv: Fix a race in catchup scan completion that may send an error. #77696

miretskiy commented Mar 11, 2022 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Mar 11, 2022

nvanbenschoten commented Mar 11, 2022

nvanbenschoten commented Mar 11, 2022

nvanbenschoten commented Mar 11, 2022

miretskiy commented Mar 12, 2022

nvanbenschoten commented Mar 12, 2022

miretskiy commented Aug 25, 2022

miretskiy commented Sep 3, 2022

nvanbenschoten commented Sep 19, 2022

kv: Fix a race in catchup scan completion that may send an error. #77696

kv: Fix a race in catchup scan completion that may send an error. #77696

Comments

miretskiy commented Mar 11, 2022 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Mar 11, 2022

nvanbenschoten commented Mar 11, 2022

Footnotes

nvanbenschoten commented Mar 11, 2022

nvanbenschoten commented Mar 11, 2022

miretskiy commented Mar 12, 2022

nvanbenschoten commented Mar 12, 2022

miretskiy commented Aug 25, 2022

miretskiy commented Sep 3, 2022

nvanbenschoten commented Sep 19, 2022

miretskiy commented Mar 11, 2022 •

edited by cockroach-jira-scripts

Loading