Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: Fix a race in catchup scan completion that may send an error. #77696

Open
miretskiy opened this issue Mar 11, 2022 · 9 comments · May be fixed by #77724
Open

kv: Fix a race in catchup scan completion that may send an error. #77696

miretskiy opened this issue Mar 11, 2022 · 9 comments · May be fixed by #77724
Assignees
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@miretskiy
Copy link
Contributor

miretskiy commented Mar 11, 2022

As noticed by @nvanbenschoten , it is possible that while the catchup scan running,
the buffer for raft events may get over full, and in that case, when catchup scan completes,
we may send an overfull error instead of checkpoint event. We should prioritize sending
checkpoint event.

CRDB-8666

Jira issue: CRDB-13720

@miretskiy miretskiy added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc T-kv KV Team labels Mar 11, 2022
@blathers-crl
Copy link

blathers-crl bot commented Mar 11, 2022

cc @cockroachdb/cdc

@blathers-crl blathers-crl bot added the A-cdc Change Data Capture label Mar 11, 2022
@nvanbenschoten
Copy link
Member

We should prioritize sending checkpoint event.

Specifically, we should prioritize sending a checkpoint event that contains a non-zero timestamp after a catch-up scan completes and before the registration consults its overflowed flag. In doing so, we ensure progress1 of the rangefeed across per-range retries due to REASON_SLOW_CONSUMER errors that are often tied to large catch-up scans. Without this, we have no guarantee that a large catch-up scan's progress will be locked in and that it won't retry indefinitely.

Ensuring this is going to be a little tricky, because a non-zero checkpoint timestamp depends on the completion of the initResolvedTSScan. Now that this scan only consults the separated lock table (as of #71295), it seems reasonable to make tailing the raft log wait on its completion. Further care will need to be taken with the resolvedTimestamp to avoid clobering the initial closed timestamp (as of the point of registration) so that the resolved timestamp after the catch-up scan is applicable to the current state of the registration and not too advanced.

Footnotes

  1. "progress" meaning that on a reconnection, the per-range "cursor" timestamp is larger than before. The client-side interaction points between checkpoints and retry timestamps are here, here, and here.

@nvanbenschoten
Copy link
Member

Making guarantees here seems difficult due to the shared resolvedTimestamp tracking (and associated unresolvedTxnHeap), which lives above the level of individual registration buffers. This means that even when a single registration's buffer overflows, the shared resolvedTimestamp tracking continues to advance. As a result, once an individual registration's buffer overflows, it often hasn't and can't publish the incremental changes necessary to use the shared resolved timestamp.

An option that avoids this issue is to allow each individual registration to perform its own initResolvedTSScan at its original log index (i.e. with an iterator captured under the raftMu when it began tailing the raft log). The registration could combine this with the closed timestamp from the original log index to construct a checkpoint to publish before erroring out. To avoid this costing too much under steady-state operation, we could limit this extra scan to only registrations that hit REASON_SLOW_CONSUMER errors before they publish their first non-zero checkpoint.

@nvanbenschoten nvanbenschoten self-assigned this Mar 11, 2022
@nvanbenschoten
Copy link
Member

I think the original premise for this issue was moderately incorrect. We only return a REASON_SLOW_CONSUMER error when the registration's buffer hits a size limit after draining the buffer. So the initial empty checkpoint will be published in this case.

However, the later realization that an empty checkpoint is insufficient to ensure forward progress remains. There's no guarantee that any non-empty checkpoint will land in the buffer before it reaches capacity.

I have a prototype that implements the "allow each individual registration to perform its own initResolvedTSScan" approach, but it is moderately complex. I'd be interested in determining how frequently we hit this case of only publishing an empty checkpoint before returning a REASON_SLOW_CONSUMER error. @miretskiy do we have any tests that demonstrate this starvation behavior, combining large catch-up scans with heavy foreground traffic?

@miretskiy
Copy link
Contributor Author

No, I don't think we have that. What we know is that the catchup scan was long time in the past, and that the data included large JSONb blobs, so that the consumer was not too fast to consume.

@nvanbenschoten
Copy link
Member

#77724 prototypes a fix for this issue.

As I mentioned in the PR, I don't think we should try to merge it right away. Instead, I think we should add instrumentation (logging or a metric) to help us determine whether this is a real problem in these cases. There is a race here, but it is less severe than previously believed, so it's not clear whether this is actually leading to starvation or not.

@exalate-issue-sync exalate-issue-sync bot removed the T-kv KV Team label Mar 18, 2022
@miretskiy miretskiy removed the T-cdc label Mar 23, 2022
@miretskiy
Copy link
Contributor Author

Customer enabled vlogging in distsender, and we appear to be hitting slow consumers errors very frequently.
I think ability to make forward progress upon catchup completion is very important.

@miretskiy
Copy link
Contributor Author

@nvanbenschoten just want to make sure you saw the comment above -- do you think we should
resurrect your PR?

@nvanbenschoten
Copy link
Member

@miretskiy thanks for letting me know. If we are seeing the theorized starvation in practice then I do think it's worthwhile to resurrect the PR. In the logs, did we repeatedly see the REASON_SLOW_CONSUMER error before a non-empty checkpoint?

I can take this as an action item for v23.1. Do we need it sooner?

do we have any tests that demonstrate this starvation behavior, combining large catch-up scans with heavy foreground traffic?

I think this kind of test would be worthwhile to demonstrate this behavior. It would also force us to express the relationship between catch-up scan throughput per range, new write throughput per range, and rangefeed-level buffering per-range. It would be nice to be able to say that a rangefeed (with retry loop) will always be able to eventually catch-up and establish a stable connection if catch_up_rate > new_write_rate, regardless of buffer size.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 23, 2022
…MER error

Fixes cockroachdb#77696.

This commit ensures progress of per-range rangefeeds in the presence of large
catch-up scans by guaranteeing that a non-empty checkpoint is published before
returning a REASON_SLOW_CONSUMER error to the client. This ensures that the
client doesn't spin in `DistSender` on the same catch-up span without advancing
its frontier. It does so by having rangefeed registrations perform an ad-hoc
resolved timestamp computation in cases where the registration's buffer hits a
memory limit before it succeeds in publishing a non-empty checkpoint.

In doing so, we can make a loose guarantee (assuming timely closed timestamp
progression) that a rangefeed with a client-side retry loop will always be able
to catch-up and converge towards a stable connection as long as its rate of
consumption is greater than the rate of production on the table. In other words,
if `catch_up_scan_rate > new_write_rate`, the retry loop will make forward
progress and eventually stop hitting REASON_SLOW_CONSUMER errors.

A nearly viable alternative to this ad-hoc scan is to ensure that the
processor-wide resolved timestamp tracker publishes at least one non-zero
checkpoint on each registration before the registration is allowed to fail. This
runs into the issues described in cockroachdb#77696 (comment).
Specifically, because this tracker is shared with other registrations, it
continues to advance even after the stream of events has been broken to an
overflowing registration. That means that a checkpoint computed after the
registration has overflowed cannot be published without violating the ordering
contract of rangefeed checkpoints ("all prior events have been seen"). The
checkpoint published after the initial catch-up scan needs to be coherent with
the state of the range that the catch-up scan saw.

This change should be followed up with system-level testing that exercises a
changefeed's ability to be unpaused after a long amount of time on a table with
a high rate of writes. That is an example of the kind of situation that this
change aims to improve.

Release justification: None. Wait on this.

Release note (enterprise change): Changefeeds are now guaranteed to make forward
progress while performing a large catch-up scan on a table with a high rate of
writes.
@nvanbenschoten nvanbenschoten changed the title kv: Fix a race in catchup scan completion that may send an error. kv: Fix a race in catchup scan completion that may send an error Sep 27, 2022
@exalate-issue-sync exalate-issue-sync bot changed the title kv: Fix a race in catchup scan completion that may send an error kv: Fix a race in catchup scan completion that may send an error. Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
2 participants