Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/rangefeed: guarantee non-empty checkpoint before REASON_SLOW_CONSUMER error #77724

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Mar 12, 2022

Fixes #77696.

This commit ensures progress of per-range rangefeeds in the presence of large catch-up scans by guaranteeing that a non-empty checkpoint is published before returning a REASON_SLOW_CONSUMER error to the client. This ensures that the client doesn't spin in DistSender on the same catch-up span without advancing its frontier. It does so by having rangefeed registrations perform an ad-hoc resolved timestamp computation in cases where the registration's buffer hits a memory limit before it succeeds in publishing a non-empty checkpoint.

In doing so, we can make a loose guarantee (assuming timely closed timestamp progression) that a rangefeed with a client-side retry loop will always be able to catch-up and converge towards a stable connection as long as its rate of consumption is greater than the rate of production on the table. In other words, if catch_up_scan_rate > new_write_rate, the retry loop will make forward progress and eventually stop hitting REASON_SLOW_CONSUMER errors.

A nearly viable alternative to this ad-hoc scan is to ensure that the processor-wide resolved timestamp tracker publishes at least one non-zero checkpoint on each registration before the registration is allowed to fail. This runs into the issues described in #77696 (comment). Specifically, because this tracker is shared with other registrations, it continues to advance even after the stream of events has been broken to an overflowing registration. That means that a checkpoint computed after the registration has overflowed cannot be published without violating the ordering contract of rangefeed checkpoints ("all prior events have been seen"). The checkpoint published after the initial catch-up scan needs to be coherent with the state of the range that the catch-up scan saw.

This change should be followed up with system-level testing that exercises a changefeed's ability to be unpaused after a long amount of time on a table with a high rate of writes. That is an example of the kind of situation that this change aims to improve.

Release justification: None. Wait on this.

Release note (enterprise change): Changefeeds are now guaranteed to make forward progress while performing a large catch-up scan on a table with a high rate of writes.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@miretskiy
Copy link
Contributor

Agreed re instrumentation; perhaps even less of a priority if we add #77725

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/ensureRTSScan branch 2 times, most recently from 7ed5309 to d1be7cd Compare September 21, 2022 19:06
@nvanbenschoten nvanbenschoten marked this pull request as ready for review September 21, 2022 19:07
@nvanbenschoten nvanbenschoten requested a review from a team as a code owner September 21, 2022 19:07
craig bot pushed a commit that referenced this pull request Sep 22, 2022
87877: kvnemesis: simplify and document validation logic r=erikgrinaker a=tbg

It took me a while to fully understand `processOp` and `checkAtomic`.
This commit simplifies them. It does so in a few ways:

- remove a map that only ever had one entry.
- avoid the need to use fake transaction IDs, and in particular clarify
  that nothing is really about txn IDs, it's about collecting atomic
  units (which might originate in a batch or a txn)
- thread the optional execution timestamp directly, as opposed to
  indirecting through an optional `*Transaction` proto.

The last point deserves a few more words. At its core, `kvnemesis` wants
to figure out valid execution timestamps by relying on unique values
coming in over the rangestream. But deletion tombstones carry no value
and thus aren't unique. There then needs to be some way to match up a
deletion tombstone with an operation that might have written it. This
requires knowledge of the timestamp at which the operation executed, and
kvnemesis was, at least for `ClosureTxnOperation`s, using its knowledge
of the commit timestamp for that purpose.

We can actually get that timestamp for all operations, though, and we
should switch `kvnemesis` to sort operations by their execution
timestamp, and then verify that the observed MVCC history is congruent
with that execution order[^1].

This commit doesn't quite do that but it
sets the stage by abstracting away from the txn commit timestamp.

This is related to #69642 in that this is the issue that prompted this
refactor.

[^1]: which happens to have been something also envisioned by the original author: https://github.com/cockroachdb/cockroach/blob/7cde315da539fe3d790f546a1ddde6cc882fca6b/pkg/kv/kvnemesis/validator.go#L43-L46

Release note: None

88308: kv/rangefeed: reduce size of event struct from 200 bytes to 72 bytes r=nvanbenschoten a=nvanbenschoten

This commit restructures the event struct and reduces its size from 200 bytes to 72 bytes. This is accomplished primarily by pushing large, infrequently used struct fields into pointers. This is mostly just a drive-by cleanup found while working on #77724.

Release justification: None. Don't merge yet.

Release note: None.

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/ensureRTSScan branch from d1be7cd to b8016e4 Compare September 22, 2022 16:25
…MER error

Fixes cockroachdb#77696.

This commit ensures progress of per-range rangefeeds in the presence of large
catch-up scans by guaranteeing that a non-empty checkpoint is published before
returning a REASON_SLOW_CONSUMER error to the client. This ensures that the
client doesn't spin in `DistSender` on the same catch-up span without advancing
its frontier. It does so by having rangefeed registrations perform an ad-hoc
resolved timestamp computation in cases where the registration's buffer hits a
memory limit before it succeeds in publishing a non-empty checkpoint.

In doing so, we can make a loose guarantee (assuming timely closed timestamp
progression) that a rangefeed with a client-side retry loop will always be able
to catch-up and converge towards a stable connection as long as its rate of
consumption is greater than the rate of production on the table. In other words,
if `catch_up_scan_rate > new_write_rate`, the retry loop will make forward
progress and eventually stop hitting REASON_SLOW_CONSUMER errors.

A nearly viable alternative to this ad-hoc scan is to ensure that the
processor-wide resolved timestamp tracker publishes at least one non-zero
checkpoint on each registration before the registration is allowed to fail. This
runs into the issues described in cockroachdb#77696 (comment).
Specifically, because this tracker is shared with other registrations, it
continues to advance even after the stream of events has been broken to an
overflowing registration. That means that a checkpoint computed after the
registration has overflowed cannot be published without violating the ordering
contract of rangefeed checkpoints ("all prior events have been seen"). The
checkpoint published after the initial catch-up scan needs to be coherent with
the state of the range that the catch-up scan saw.

This change should be followed up with system-level testing that exercises a
changefeed's ability to be unpaused after a long amount of time on a table with
a high rate of writes. That is an example of the kind of situation that this
change aims to improve.

Release justification: None. Wait on this.

Release note (enterprise change): Changefeeds are now guaranteed to make forward
progress while performing a large catch-up scan on a table with a high rate of
writes.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/ensureRTSScan branch from b8016e4 to 45571ce Compare September 23, 2022 23:41
@nvanbenschoten
Copy link
Member Author

This PR could use a new metric that tracks ad-hoc resolved timestamp scans.

Copy link
Contributor

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks workable. I have some nits regarding comments which makes it easier to understand.

// RangeFeedEventChanCap overrides the default value for
// rangefeed.Config.EventChanCap.
RangeFeedEventChanCap int
// RangeFeedEventChanCap overrides the default value for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/RangeFeedEventChanCap/RangeFeedSkipInitResolvedTS/

@@ -574,9 +644,13 @@ func (r *registration) maybeConstructCatchUpIter() {
catchUpIter := r.catchUpIterConstructor(r.span, r.catchUpTimestamp)
r.catchUpIterConstructor = nil

rtsIter := r.rtsIterConstructor(r.span)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method now does more than just catch up iterator. Its name and comment should reflect that otherwise it is asymmetric with respect of detachCatchUpIter and detachRTSIter.

// The task is abstracted so that it can also be used outside the context of a
// Processor, even though Processor is the primary consumer. This flexibility
// allows registrations to perform ad-hoc computations of the resolved timestamp
// at specific Raft log indexes in certain error cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does specific Raft log indexes mean after unsuccessful catch up scan?
We'd rather change original comment to refer to processorRTSScanConsumer and give an example of processor being a consumer when running initial scan or ad hoc synchronous consumer run by registration when initial scan can't complete.

@@ -324,20 +313,26 @@ func (r *Replica) registerWithRangefeedRaftMuLocked(
ctx context.Context,
span roachpb.RSpan,
startTS hlc.Timestamp, // exclusive
catchUpIter rangefeed.CatchUpIteratorConstructor,
catchUpIterFunc rangefeed.CatchUpIteratorConstructor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makeCatchUpIter and the guy below? If that doesn't break any conventions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: Fix a race in catchup scan completion that may send an error.
4 participants