-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/rangefeed: guarantee non-empty checkpoint before REASON_SLOW_CONSUMER error #77724
base: master
Are you sure you want to change the base?
kv/rangefeed: guarantee non-empty checkpoint before REASON_SLOW_CONSUMER error #77724
Conversation
Agreed re instrumentation; perhaps even less of a priority if we add #77725 |
7ed5309
to
d1be7cd
Compare
87877: kvnemesis: simplify and document validation logic r=erikgrinaker a=tbg It took me a while to fully understand `processOp` and `checkAtomic`. This commit simplifies them. It does so in a few ways: - remove a map that only ever had one entry. - avoid the need to use fake transaction IDs, and in particular clarify that nothing is really about txn IDs, it's about collecting atomic units (which might originate in a batch or a txn) - thread the optional execution timestamp directly, as opposed to indirecting through an optional `*Transaction` proto. The last point deserves a few more words. At its core, `kvnemesis` wants to figure out valid execution timestamps by relying on unique values coming in over the rangestream. But deletion tombstones carry no value and thus aren't unique. There then needs to be some way to match up a deletion tombstone with an operation that might have written it. This requires knowledge of the timestamp at which the operation executed, and kvnemesis was, at least for `ClosureTxnOperation`s, using its knowledge of the commit timestamp for that purpose. We can actually get that timestamp for all operations, though, and we should switch `kvnemesis` to sort operations by their execution timestamp, and then verify that the observed MVCC history is congruent with that execution order[^1]. This commit doesn't quite do that but it sets the stage by abstracting away from the txn commit timestamp. This is related to #69642 in that this is the issue that prompted this refactor. [^1]: which happens to have been something also envisioned by the original author: https://github.com/cockroachdb/cockroach/blob/7cde315da539fe3d790f546a1ddde6cc882fca6b/pkg/kv/kvnemesis/validator.go#L43-L46 Release note: None 88308: kv/rangefeed: reduce size of event struct from 200 bytes to 72 bytes r=nvanbenschoten a=nvanbenschoten This commit restructures the event struct and reduces its size from 200 bytes to 72 bytes. This is accomplished primarily by pushing large, infrequently used struct fields into pointers. This is mostly just a drive-by cleanup found while working on #77724. Release justification: None. Don't merge yet. Release note: None. Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
d1be7cd
to
b8016e4
Compare
…MER error Fixes cockroachdb#77696. This commit ensures progress of per-range rangefeeds in the presence of large catch-up scans by guaranteeing that a non-empty checkpoint is published before returning a REASON_SLOW_CONSUMER error to the client. This ensures that the client doesn't spin in `DistSender` on the same catch-up span without advancing its frontier. It does so by having rangefeed registrations perform an ad-hoc resolved timestamp computation in cases where the registration's buffer hits a memory limit before it succeeds in publishing a non-empty checkpoint. In doing so, we can make a loose guarantee (assuming timely closed timestamp progression) that a rangefeed with a client-side retry loop will always be able to catch-up and converge towards a stable connection as long as its rate of consumption is greater than the rate of production on the table. In other words, if `catch_up_scan_rate > new_write_rate`, the retry loop will make forward progress and eventually stop hitting REASON_SLOW_CONSUMER errors. A nearly viable alternative to this ad-hoc scan is to ensure that the processor-wide resolved timestamp tracker publishes at least one non-zero checkpoint on each registration before the registration is allowed to fail. This runs into the issues described in cockroachdb#77696 (comment). Specifically, because this tracker is shared with other registrations, it continues to advance even after the stream of events has been broken to an overflowing registration. That means that a checkpoint computed after the registration has overflowed cannot be published without violating the ordering contract of rangefeed checkpoints ("all prior events have been seen"). The checkpoint published after the initial catch-up scan needs to be coherent with the state of the range that the catch-up scan saw. This change should be followed up with system-level testing that exercises a changefeed's ability to be unpaused after a long amount of time on a table with a high rate of writes. That is an example of the kind of situation that this change aims to improve. Release justification: None. Wait on this. Release note (enterprise change): Changefeeds are now guaranteed to make forward progress while performing a large catch-up scan on a table with a high rate of writes.
b8016e4
to
45571ce
Compare
This PR could use a new metric that tracks ad-hoc resolved timestamp scans. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks workable. I have some nits regarding comments which makes it easier to understand.
// RangeFeedEventChanCap overrides the default value for | ||
// rangefeed.Config.EventChanCap. | ||
RangeFeedEventChanCap int | ||
// RangeFeedEventChanCap overrides the default value for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/RangeFeedEventChanCap/RangeFeedSkipInitResolvedTS/
@@ -574,9 +644,13 @@ func (r *registration) maybeConstructCatchUpIter() { | |||
catchUpIter := r.catchUpIterConstructor(r.span, r.catchUpTimestamp) | |||
r.catchUpIterConstructor = nil | |||
|
|||
rtsIter := r.rtsIterConstructor(r.span) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method now does more than just catch up iterator. Its name and comment should reflect that otherwise it is asymmetric with respect of detachCatchUpIter
and detachRTSIter
.
// The task is abstracted so that it can also be used outside the context of a | ||
// Processor, even though Processor is the primary consumer. This flexibility | ||
// allows registrations to perform ad-hoc computations of the resolved timestamp | ||
// at specific Raft log indexes in certain error cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does specific Raft log indexes
mean after unsuccessful catch up scan?
We'd rather change original comment to refer to processorRTSScanConsumer
and give an example of processor being a consumer when running initial scan or ad hoc synchronous consumer run by registration when initial scan can't complete.
@@ -324,20 +313,26 @@ func (r *Replica) registerWithRangefeedRaftMuLocked( | |||
ctx context.Context, | |||
span roachpb.RSpan, | |||
startTS hlc.Timestamp, // exclusive | |||
catchUpIter rangefeed.CatchUpIteratorConstructor, | |||
catchUpIterFunc rangefeed.CatchUpIteratorConstructor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makeCatchUpIter
and the guy below? If that doesn't break any conventions.
Fixes #77696.
This commit ensures progress of per-range rangefeeds in the presence of large catch-up scans by guaranteeing that a non-empty checkpoint is published before returning a REASON_SLOW_CONSUMER error to the client. This ensures that the client doesn't spin in
DistSender
on the same catch-up span without advancing its frontier. It does so by having rangefeed registrations perform an ad-hoc resolved timestamp computation in cases where the registration's buffer hits a memory limit before it succeeds in publishing a non-empty checkpoint.In doing so, we can make a loose guarantee (assuming timely closed timestamp progression) that a rangefeed with a client-side retry loop will always be able to catch-up and converge towards a stable connection as long as its rate of consumption is greater than the rate of production on the table. In other words, if
catch_up_scan_rate > new_write_rate
, the retry loop will make forward progress and eventually stop hitting REASON_SLOW_CONSUMER errors.A nearly viable alternative to this ad-hoc scan is to ensure that the processor-wide resolved timestamp tracker publishes at least one non-zero checkpoint on each registration before the registration is allowed to fail. This runs into the issues described in #77696 (comment). Specifically, because this tracker is shared with other registrations, it continues to advance even after the stream of events has been broken to an overflowing registration. That means that a checkpoint computed after the registration has overflowed cannot be published without violating the ordering contract of rangefeed checkpoints ("all prior events have been seen"). The checkpoint published after the initial catch-up scan needs to be coherent with the state of the range that the catch-up scan saw.
This change should be followed up with system-level testing that exercises a changefeed's ability to be unpaused after a long amount of time on a table with a high rate of writes. That is an example of the kind of situation that this change aims to improve.
Release justification: None. Wait on this.
Release note (enterprise change): Changefeeds are now guaranteed to make forward progress while performing a large catch-up scan on a table with a high rate of writes.