kv/rangefeed: guarantee non-empty checkpoint before REASON_SLOW_CONSUMER error #77724

nvanbenschoten · 2022-03-12T23:56:54Z

This commit ensures progress of per-range rangefeeds in the presence of large catch-up scans by guaranteeing that a non-empty checkpoint is published before returning a REASON_SLOW_CONSUMER error to the client. This ensures that the client doesn't spin in DistSender on the same catch-up span without advancing its frontier. It does so by having rangefeed registrations perform an ad-hoc resolved timestamp computation in cases where the registration's buffer hits a memory limit before it succeeds in publishing a non-empty checkpoint.

In doing so, we can make a loose guarantee (assuming timely closed timestamp progression) that a rangefeed with a client-side retry loop will always be able to catch-up and converge towards a stable connection as long as its rate of consumption is greater than the rate of production on the table. In other words, if catch_up_scan_rate > new_write_rate, the retry loop will make forward progress and eventually stop hitting REASON_SLOW_CONSUMER errors.

A nearly viable alternative to this ad-hoc scan is to ensure that the processor-wide resolved timestamp tracker publishes at least one non-zero checkpoint on each registration before the registration is allowed to fail. This runs into the issues described in #77696 (comment). Specifically, because this tracker is shared with other registrations, it continues to advance even after the stream of events has been broken to an overflowing registration. That means that a checkpoint computed after the registration has overflowed cannot be published without violating the ordering contract of rangefeed checkpoints ("all prior events have been seen"). The checkpoint published after the initial catch-up scan needs to be coherent with the state of the range that the catch-up scan saw.

This change should be followed up with system-level testing that exercises a changefeed's ability to be unpaused after a long amount of time on a table with a high rate of writes. That is an example of the kind of situation that this change aims to improve.

Release justification: None. Wait on this.

Release note (enterprise change): Changefeeds are now guaranteed to make forward progress while performing a large catch-up scan on a table with a high rate of writes.

cockroach-teamcity · 2022-03-12T23:57:03Z

This change is

miretskiy · 2022-03-13T02:45:46Z

Agreed re instrumentation; perhaps even less of a priority if we add #77725

87877: kvnemesis: simplify and document validation logic r=erikgrinaker a=tbg It took me a while to fully understand `processOp` and `checkAtomic`. This commit simplifies them. It does so in a few ways: - remove a map that only ever had one entry. - avoid the need to use fake transaction IDs, and in particular clarify that nothing is really about txn IDs, it's about collecting atomic units (which might originate in a batch or a txn) - thread the optional execution timestamp directly, as opposed to indirecting through an optional `*Transaction` proto. The last point deserves a few more words. At its core, `kvnemesis` wants to figure out valid execution timestamps by relying on unique values coming in over the rangestream. But deletion tombstones carry no value and thus aren't unique. There then needs to be some way to match up a deletion tombstone with an operation that might have written it. This requires knowledge of the timestamp at which the operation executed, and kvnemesis was, at least for `ClosureTxnOperation`s, using its knowledge of the commit timestamp for that purpose. We can actually get that timestamp for all operations, though, and we should switch `kvnemesis` to sort operations by their execution timestamp, and then verify that the observed MVCC history is congruent with that execution order[^1]. This commit doesn't quite do that but it sets the stage by abstracting away from the txn commit timestamp. This is related to #69642 in that this is the issue that prompted this refactor. [^1]: which happens to have been something also envisioned by the original author: https://github.com/cockroachdb/cockroach/blob/7cde315da539fe3d790f546a1ddde6cc882fca6b/pkg/kv/kvnemesis/validator.go#L43-L46 Release note: None 88308: kv/rangefeed: reduce size of event struct from 200 bytes to 72 bytes r=nvanbenschoten a=nvanbenschoten This commit restructures the event struct and reduces its size from 200 bytes to 72 bytes. This is accomplished primarily by pushing large, infrequently used struct fields into pointers. This is mostly just a drive-by cleanup found while working on #77724. Release justification: None. Don't merge yet. Release note: None. Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

…MER error Fixes cockroachdb#77696. This commit ensures progress of per-range rangefeeds in the presence of large catch-up scans by guaranteeing that a non-empty checkpoint is published before returning a REASON_SLOW_CONSUMER error to the client. This ensures that the client doesn't spin in `DistSender` on the same catch-up span without advancing its frontier. It does so by having rangefeed registrations perform an ad-hoc resolved timestamp computation in cases where the registration's buffer hits a memory limit before it succeeds in publishing a non-empty checkpoint. In doing so, we can make a loose guarantee (assuming timely closed timestamp progression) that a rangefeed with a client-side retry loop will always be able to catch-up and converge towards a stable connection as long as its rate of consumption is greater than the rate of production on the table. In other words, if `catch_up_scan_rate > new_write_rate`, the retry loop will make forward progress and eventually stop hitting REASON_SLOW_CONSUMER errors. A nearly viable alternative to this ad-hoc scan is to ensure that the processor-wide resolved timestamp tracker publishes at least one non-zero checkpoint on each registration before the registration is allowed to fail. This runs into the issues described in cockroachdb#77696 (comment). Specifically, because this tracker is shared with other registrations, it continues to advance even after the stream of events has been broken to an overflowing registration. That means that a checkpoint computed after the registration has overflowed cannot be published without violating the ordering contract of rangefeed checkpoints ("all prior events have been seen"). The checkpoint published after the initial catch-up scan needs to be coherent with the state of the range that the catch-up scan saw. This change should be followed up with system-level testing that exercises a changefeed's ability to be unpaused after a long amount of time on a table with a high rate of writes. That is an example of the kind of situation that this change aims to improve. Release justification: None. Wait on this. Release note (enterprise change): Changefeeds are now guaranteed to make forward progress while performing a large catch-up scan on a table with a high rate of writes.

nvanbenschoten · 2022-09-27T18:28:09Z

This PR could use a new metric that tracks ad-hoc resolved timestamp scans.

aliher1911

This looks workable. I have some nits regarding comments which makes it easier to understand.

aliher1911 · 2022-10-17T10:38:27Z

pkg/kv/kvserver/testing_knobs.go

+	// RangeFeedEventChanCap overrides the default value for
+	// rangefeed.Config.EventChanCap.
+	RangeFeedEventChanCap int
+	// RangeFeedEventChanCap overrides the default value for


s/RangeFeedEventChanCap/RangeFeedSkipInitResolvedTS/

aliher1911 · 2022-10-17T12:41:44Z

pkg/kv/kvserver/rangefeed/registry.go

@@ -574,9 +644,13 @@ func (r *registration) maybeConstructCatchUpIter() {
 	catchUpIter := r.catchUpIterConstructor(r.span, r.catchUpTimestamp)
 	r.catchUpIterConstructor = nil

+	rtsIter := r.rtsIterConstructor(r.span)


This method now does more than just catch up iterator. Its name and comment should reflect that otherwise it is asymmetric with respect of detachCatchUpIter and detachRTSIter.

aliher1911 · 2022-10-17T13:22:56Z

pkg/kv/kvserver/rangefeed/task.go

+// The task is abstracted so that it can also be used outside the context of a
+// Processor, even though Processor is the primary consumer. This flexibility
+// allows registrations to perform ad-hoc computations of the resolved timestamp
+// at specific Raft log indexes in certain error cases.


Does specific Raft log indexes mean after unsuccessful catch up scan?
We'd rather change original comment to refer to processorRTSScanConsumer and give an example of processor being a consumer when running initial scan or ad hoc synchronous consumer run by registration when initial scan can't complete.

aliher1911 · 2022-10-17T13:25:58Z

pkg/kv/kvserver/replica_rangefeed.go

@@ -324,20 +313,26 @@ func (r *Replica) registerWithRangefeedRaftMuLocked(
 	ctx context.Context,
 	span roachpb.RSpan,
 	startTS hlc.Timestamp, // exclusive
-	catchUpIter rangefeed.CatchUpIteratorConstructor,
+	catchUpIterFunc rangefeed.CatchUpIteratorConstructor,


makeCatchUpIter and the guy below? If that doesn't break any conventions.

nvanbenschoten mentioned this pull request Mar 12, 2022

kv: Fix a race in catchup scan completion that may send an error. #77696

Open

miretskiy mentioned this pull request Mar 14, 2022

kv: Limit concurrency of catchup scans. #77725

Merged

nvanbenschoten mentioned this pull request Sep 20, 2022

kv/rangefeed: reduce size of event struct from 200 bytes to 72 bytes #88308

Merged

nvanbenschoten force-pushed the nvanbenschoten/ensureRTSScan branch 2 times, most recently from 7ed5309 to d1be7cd Compare September 21, 2022 19:06

nvanbenschoten requested review from miretskiy and aliher1911 September 21, 2022 19:07

nvanbenschoten marked this pull request as ready for review September 21, 2022 19:07

nvanbenschoten requested a review from a team as a code owner September 21, 2022 19:07

nvanbenschoten force-pushed the nvanbenschoten/ensureRTSScan branch from d1be7cd to b8016e4 Compare September 22, 2022 16:25

nvanbenschoten force-pushed the nvanbenschoten/ensureRTSScan branch from b8016e4 to 45571ce Compare September 23, 2022 23:41

aliher1911 reviewed Oct 17, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv/rangefeed: guarantee non-empty checkpoint before REASON_SLOW_CONSUMER error #77724

kv/rangefeed: guarantee non-empty checkpoint before REASON_SLOW_CONSUMER error #77724

nvanbenschoten commented Mar 12, 2022 •

edited

Loading

cockroach-teamcity commented Mar 12, 2022

miretskiy commented Mar 13, 2022

nvanbenschoten commented Sep 27, 2022

aliher1911 left a comment

aliher1911 Oct 17, 2022

aliher1911 Oct 17, 2022

aliher1911 Oct 17, 2022

aliher1911 Oct 17, 2022

kv/rangefeed: guarantee non-empty checkpoint before REASON_SLOW_CONSUMER error #77724

Are you sure you want to change the base?

kv/rangefeed: guarantee non-empty checkpoint before REASON_SLOW_CONSUMER error #77724

Conversation

nvanbenschoten commented Mar 12, 2022 • edited Loading

cockroach-teamcity commented Mar 12, 2022

miretskiy commented Mar 13, 2022

nvanbenschoten commented Sep 27, 2022

aliher1911 left a comment

Choose a reason for hiding this comment

aliher1911 Oct 17, 2022

Choose a reason for hiding this comment

aliher1911 Oct 17, 2022

Choose a reason for hiding this comment

aliher1911 Oct 17, 2022

Choose a reason for hiding this comment

aliher1911 Oct 17, 2022

Choose a reason for hiding this comment

nvanbenschoten commented Mar 12, 2022 •

edited

Loading