-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission: support non-blocking {Store,}WorkQueue.Admit() #97599
Conversation
e1ef1ca
to
05ec485
Compare
@sumeerbhola, this is ready for review. It's still a bit incomplete -- I think I've broken epoch-LIFO for the below-raft admission queues. I'll fix + add tests while you review. |
05ec485
to
159aba1
Compare
I couldn't figure it out and I've confused myself further. I put my (partly unreadable) questions in I12 within kvflowcontrol/doc.go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In WorkQueue ordering -- for replicated writes below-raft, we ignore CreateTime/epoch-LIFO, and instead sort by priority and within a priority, sort by log position.
Log position is not a physically meaningful number since different ranges may be seeing new entries at different rates, and different ranges may have been created at different times so may be at different places in their raft log (maybe that is why the comparison code includes RangeID
?). Are we doing this to ensure that for a range, if an entry with priority p at position n is admitted that we can assume that for that same range every entry with priority p at position < n is also admitted?
If we need this invariant, can we assign the create time in a monotonic manner when the proposal is assigned a tentative log position after evaluation (i.e., the log position we are using here in admission control for flow control tokens).
If we really want epoch-lifo to work at this layer in the future we will need to use the txn create time, which means we can't have this invariant -- I need to think some more about that.
Reviewed 3 of 27 files at r1, 1 of 11 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/util/admission/work_queue.go
line 191 at r2 (raw file):
// ReplicatedWorkInfo groups everything needed to admit replicated writes, done // so asynchronously below-raft as part of replication admission control. ReplicatedWorkInfo ReplicatedWorkInfo
An alternative would be to make this a ReplicatedWorkInfo interface{}
and make it opaque to the AC package. We would allocate using a sync.Pool
.
pkg/util/admission/work_queue.go
line 205 at r2 (raw file):
// Origin is the node at which this work originated. It's used for // replication admission control to inform the origin of admitted work // (after which flow tokens are released, permitted more replicated
nit: permitting
pkg/util/admission/work_queue.go
line 214 at r2 (raw file):
// maintain accurate linear models for L0 growth due to ingests and // regular write batches. Ingested bool
StoreWorkDoneInfo
allowed both WriteBytes
and IngestedBytes
to be non-zero. Do we not need that here?
pkg/util/admission/work_queue.go
line 532 at r2 (raw file):
// admission control is enabled. AdmittedWorkDone must be called iff // enabled=true && err!=nil, and the WorkKind for this queue uses slots. func (q *WorkQueue) Admit(ctx context.Context, info WorkInfo) (enabled bool, err error) {
aside: before enabling replication admission control for user-facing traffic (which can have arbitrary concurrency, compared to our internal load like index backfills), I think we will need to work out the memory overhead of queueing each raft command and whether we need to do some coalescing.
We may want to track this somewhere so we don't forget.
pkg/util/admission/work_queue.go
line 1487 at r2 (raw file):
// LIFO, and the epoch is closed, so can simply use createTime. if (*wwh)[i].replicated.RangeID != (*wwh)[j].replicated.RangeID ||
why is RangeID relevant here?
I am not keen on changing the ordering function and would like to understand what motivates the change.
pkg/util/admission/work_queue.go
line 1807 at r2 (raw file):
// We use a per-request estimate only when no requested count is // provided. It's always provided for below-raft admission where we // already know the size of the work being admitted. Since it's async,
The "Since it's async, ..." is confusing. Our general pattern is to deduct tokens in the granter when admitting and then tell the requester about the admission. If the former is "upfront", since we still need to do that. And we will since info.RequestedCount
is non-zero. We will potentially under-deduct since we have not used any model in this granter deduction, and will fix things when WorkQueue.granted
calls q.onAdmittedReplicatedWork.admittedReplicatedWork
- which is also fine.
A longer more explicit comment specifying the exact control flow would be preferable.
pkg/util/admission/work_queue.go
line 1903 at r2 (raw file):
// upfront, and deduct what should be the right number of tokens. So why the // adjustment here? When deducting originally, how come we don't just apply // the linear models?
Regarding this TODO, this is partially a peculiarity of how the WorkQueue and granter interaction is setup to be very general and partly because there was no information about the actual size at admission time.
Let's focus on the former, since the latter is available for this replication admission control. The WorkQueue
has to handle both tokens and slots and does not know anything about linear models (which is a very specialized case of token based AC). It also does not know that kvStoreTokenGranter.storeWriteDone
has multiple resources hidden under it (disk bandwidth and L0 tokens). All of this complexity is handled via this side channel. I think its best to continue doing it this way -- there is no risk of over-admission since this adjustment is being done in the same goroutine that did the granting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
different ranges may have been created at different times so may be at different places in their raft log (maybe that is why the comparison code includes RangeID?).
Yes.
Are we doing this to ensure that for a range, if an entry with priority p at position n is admitted that we can assume that for that same range every entry with priority p at position < n is also admitted?
Yes, exactly.
If we need this invariant, can we assign the create time in a monotonic manner when the proposal is assigned a tentative log position after evaluation
Yes, this would work. I was effectively doing this by either using CreateTime in work queue orderings, or using log positions, but not both.
I need to think some more about that.
Do take a look at I12.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/util/admission/work_queue.go
line 191 at r2 (raw file):
Previously, sumeerbhola wrote…
An alternative would be to make this a
ReplicatedWorkInfo interface{}
and make it opaque to the AC package. We would allocate using async.Pool
.
I'll wait for the resolution of the other open threads in this review, around changes to the work queue ordering which ends up using the fields included here, and why I started off not using an opaque interface{}.
pkg/util/admission/work_queue.go
line 214 at r2 (raw file):
Previously, sumeerbhola wrote…
StoreWorkDoneInfo
allowed bothWriteBytes
andIngestedBytes
to be non-zero. Do we not need that here?
No, it's not needed. The check for whether we use Ingested=true
, i.e. use IngestedBytes
underneath, is only when the raft entry we're admitting uses the sideloaded encoding, which we use for AddSSTs. For everything else we'd just use WriteBytes
. For StoreWorkDoneInfo
, when are both things non-zero?
pkg/util/admission/work_queue.go
line 532 at r2 (raw file):
Previously, sumeerbhola wrote…
aside: before enabling replication admission control for user-facing traffic (which can have arbitrary concurrency, compared to our internal load like index backfills), I think we will need to work out the memory overhead of queueing each raft command and whether we need to do some coalescing.
We may want to track this somewhere so we don't forget.
Added a TODO. For things we need to flesh out as part #95563, I've been referencing that issue number in TODOs to make sure we don't lose track. I can convert them to issues if they feel larger.
pkg/util/admission/work_queue.go
line 1487 at r2 (raw file):
Previously, sumeerbhola wrote…
why is RangeID relevant here?
I am not keen on changing the ordering function and would like to understand what motivates the change.
I tried to explain the motivation and an alternative (which I thought would be slower to implement) in I12 in kvflowcontrol/doc.go, which I amended right as you were reviewing so perhaps you missed. Copying over here:
// - For below-raft work queue ordering, we ignore CreateTime when ordering work
// within the same range. Within a given <tenant,priority,range>, admission
// takes place in raft log order (i.e. entries with lower terms get admitted
// first, or lower indexes within the same term).
// - NB: Regarding "admission takes place in raft log order", we can implement
// this differently. We introduced log-position based ordering to simplify
// the implementation of token returns where we release tokens by specifying
// the log position up-to-which we want to release held tokens[^11]. But
// with additional tracking in the below-raft work queues, if we know that
// work W2 with log position L2 got admitted, and corresponded to F flow
// token deductions at the origin, and we also know that work W1 with log
// position L1 is currently queued, also corresponding to F flow token
// deductions at the origin, we could inform the origin node to return flow
// tokens up to L1 and still get what we want -- a return of F flow tokens
// when each work gets admitted.
// - To operate within cluster-wide FIFO ordering, we order by CreateTime when
// comparing work across different ranges. Writes for a single range, as
// observed by a given store below-raft (follower or otherwise) travel along
// a single stream. Consider the case where a single store S3 receives
// replication traffic for two ranges R1 and R2, originating from two separate
// nodes N1 and N2. If N1 is issuing writes with strictly older CreateTimes,
// when returning flow tokens we should prefer N1.
// - What about CreateTime ordering within a <tenant,priority,range>? Flow
// tokens deductions aren't tied to create times -- they're tied to work
// classes on the sender. So we still want priority-based ordering to
// release regular flow tokens before elastic ones, but releasing flow
// tokens for work with lower CreateTimes does not actually promote doing
// older work. Below-raft admission is all asynchronous. To get back to
// CreateTime ordering, we'd need to do it above-raft, by introducing a
// WorkQueue-like structure for requests waiting for flow tokens.
But I'm unsure about much of this and could use help. What's the purpose of CreateTime-based ordering below-raft given admission is asynchronous and flow tokens that we're returning weren't tied to CreateTimes at the sender nodes where they were originally deducted?
pkg/util/admission/work_queue.go
line 1807 at r2 (raw file):
Previously, sumeerbhola wrote…
The "Since it's async, ..." is confusing. Our general pattern is to deduct tokens in the granter when admitting and then tell the requester about the admission. If the former is "upfront", since we still need to do that. And we will since
info.RequestedCount
is non-zero. We will potentially under-deduct since we have not used any model in this granter deduction, and will fix things whenWorkQueue.granted
callsq.onAdmittedReplicatedWork.admittedReplicatedWork
- which is also fine.
A longer more explicit comment specifying the exact control flow would be preferable.
Edited this comment + the one below.
pkg/util/admission/work_queue.go
line 1903 at r2 (raw file):
Previously, sumeerbhola wrote…
Regarding this TODO, this is partially a peculiarity of how the WorkQueue and granter interaction is setup to be very general and partly because there was no information about the actual size at admission time.
Let's focus on the former, since the latter is available for this replication admission control. TheWorkQueue
has to handle both tokens and slots and does not know anything about linear models (which is a very specialized case of token based AC). It also does not know thatkvStoreTokenGranter.storeWriteDone
has multiple resources hidden under it (disk bandwidth and L0 tokens). All of this complexity is handled via this side channel. I think its best to continue doing it this way -- there is no risk of over-admission since this adjustment is being done in the same goroutine that did the granting.
Thanks for explaining, I understand now why we have this particular sequencing (i.e. why we deduct tokens corresponding to the request size first and then adjust for the linear model effects). I've removed the TODO and edited the comment above.
Aside: I sometimes find the very general purpose interactions we have with the WorkQueue and granter interface, coupled with the specialized cases for particular granter+requester pairs (kvStoreTokenGranter and the StoreWorkQueue, which also directly interacts holds references to the underlying granters for this particular post-admission token adjustment), hard to follow. I don't have suggestions on how to improve it, but sometimes I wish I had separate WorkQueue-like implementations with lots of code duplication, but purpose built for IO or CPU admission control. It could maybe simplify code at the expense of not using some exact shared interface across the implementations.
159aba1
to
175fbc5
Compare
175fbc5
to
c332c64
Compare
Part of cockroachdb#95563. Dispatch is a concrete implementation of the kvflowcontrol.Dispatch interface. It's used to dispatch information about admitted raft log entries to the specific nodes where (i) said entries originated, (ii) flow tokens were deducted and (iii) are waiting to be returned. This type is also used to read pending dispatches, which will be used in the raft transport layer when looking to piggyback information on traffic already bound to specific nodes. Since timely dispatching (read: piggybacking) is not guaranteed, we allow querying for all long-overdue dispatches. Internally it's able to coalesce dispatches bound for a given node. If dispatching admission information for two log entries with the same <RangeID,StoreID,WorkPriority> triple, with log positions L1 and L2 where L1 < L2, we can simply dispatch the one with L2. We leave the integration of this type with the {Store,}WorkQueue (cockroachdb#97599) + raft transport to future PRs. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/util/admission/work_queue.go
line 214 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
No, it's not needed. The check for whether we use
Ingested=true
, i.e. useIngestedBytes
underneath, is only when the raft entry we're admitting uses the sideloaded encoding, which we use for AddSSTs. For everything else we'd just useWriteBytes
. ForStoreWorkDoneInfo
, when are both things non-zero?
I suppose you are saying that both proposal.command.WriteBatch != nil
and proposal.command.ReplicatedEvalResult.AddSSTable
cannot be true. If so, I wasn't sure enough about the invariants to make that assumption when doing the corresponding integration into AC.
And we are using the same StoreWorkDoneInfo
in StoreWorkQueue.BypassedWorkDone
which includes many raft command applications via replicaAppBatch
, so presumably can include both AddSSTs and regular writes.
pkg/util/admission/work_queue.go
line 1903 at r2 (raw file):
Aside: I sometimes find the very general purpose interactions we have with the WorkQueue and granter interface, coupled with the specialized cases for particular granter+requester pairs (kvStoreTokenGranter and the StoreWorkQueue, which also directly interacts holds references to the underlying granters for this particular post-admission token adjustment), hard to follow.
I relate to that. There is a lot of shared code too, and arguably most of the algorithmic bits are shared, and I've not been able to think of an obviously better scheme.
Part of cockroachdb#95563. Dispatch is a concrete implementation of the kvflowcontrol.Dispatch interface. It's used to dispatch information about admitted raft log entries to the specific nodes where (i) said entries originated, (ii) flow tokens were deducted and (iii) are waiting to be returned. This type is also used to read pending dispatches, which will be used in the raft transport layer when looking to piggyback information on traffic already bound to specific nodes. Since timely dispatching (read: piggybacking) is not guaranteed, we allow querying for all long-overdue dispatches. Internally it's able to coalesce dispatches bound for a given node. If dispatching admission information for two log entries with the same <RangeID,StoreID,WorkPriority> triple, with log positions L1 and L2 where L1 < L2, we can simply dispatch the one with L2. We leave the integration of this type with the {Store,}WorkQueue (cockroachdb#97599) + raft transport to future PRs. Release note: None
97766: kvflowcontrol: implement kvflowcontrol.Dispatch r=irfansharif a=irfansharif Part of #95563. Dispatch is a concrete implementation of the kvflowcontrol.Dispatch interface. It's used to dispatch information about admitted raft log entries to the specific nodes where (i) said entries originated, (ii) flow tokens were deducted and (iii) are waiting to be returned. This type is also used to read pending dispatches, which will be used in the raft transport layer when looking to piggyback information on traffic already bound to specific nodes. Since timely dispatching (read: piggybacking) is not guaranteed, we allow querying for all long-overdue dispatches. Internally it's able to coalesce dispatches bound for a given node. If dispatching admission information for two log entries with the same <RangeID,StoreID,WorkPriority> triple, with log positions L1 and L2 where L1 < L2, we can simply dispatch the one with L2. We leave the integration of this type with the {Store,}WorkQueue (#97599) + raft transport to future PRs. Release note: None 97826: sql: introduce array_cat_agg aggregate builtin r=yuzefovich a=yuzefovich This commit introduces a new `array_cat_agg` aggregate builtin function that takes in an array type as its input, and then unnests each array and appends all its elements into a single result array. In other words, it behaves similar to `array_agg(unnest(array_column))`. This function doesn't have an analogue in Postgres. However, some of our SQL observability tools need this functionality, and the current workaround of using a LATERAL JOIN often results in slow apply joins, so this new builtin should speed things up significantly. In particular, `crdb_internal.statement_statistics` view is now refactored to use the new builtin which removes an apply join from it. The choice of this particular name comes from the fact that we have the `array_cat` builtin which concatenates two arrays. Fixes: #97502. Release note (sql change): New aggregate builtin function `array_cat_agg` is introduced. It behaves similar to how `array_agg(unnest(array_column))` would - namely, it takes arrays as its input, unnests them into the array elements which are then aggregated into a single result array (i.e. it's similar to concatenating all input arrays into a single one). 98079: generate-bazel-extra: add instructions to update timeouts list r=rickystewart a=healthy-pod This code change removes `pkg/ccl/backupccl` from the really enormous targets list because it's only there for CI stress purposes. It also adds instructions to follow when adding a new test target to the list of really enormous timeouts. Release note: None Epic: none 98113: roachtest: decommission roachtests conform to new output r=kvoli a=AlexTalks This change fixes decommission roachtests to properly be aware of the new output introduced by #96100, and to utilize the decommission pre-checks accordingly. In some places, the decommission pre-checks are skipped, especially because the decommission used in the test is not expected to be completeable. Fixes: #98026, #98018, #98017. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: healthy-pod <ahmad@cockroachlabs.com> Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com>
46fa705
to
214a606
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumeerbhola PTAL. I undid the WorkQueue ordering changes to no longer be so closely coupled to log positions and range IDs, and instead have the caller (ab)use the CreateTime parameter by combining the "true" create time with the work's log position to generate a monotonic number that roughly follows log position ordering.
You might be saddened by where I put this logic -- in the StoreWorkQueue instead of somewhere outside of the admission package. You might also be saddened by me not using an opaque interface{} for ReplicatedWorkInfo. But for both of these I feel (very subjectively) like it reads a bit cleaner. StoreWorkQueue is concerned with admission control of store work, so is somewhat aware of this log position business. And interface{}s make things harder to read for me.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
ef6f23a
to
7b65ec7
Compare
We move kvflowsequencer.Sequencer and its use in kvflowhandle.Handle (above-raft) to admission.sequencer, now used by admission.StoreWorkQueue (below-raft). This variant appeared in an earlier revision of cockroachdb#97599 where we first introduced monotonically increasing CreateTimes for a given raft group. In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll observe that it's quite difficult to create sequencing CreateTimes[^1] above raft. This is because these sequence numbers are encoded as part of the raft proposal[^2], and at encode-time, we don't actually know what log position the proposal is going to end up in. It's hard to explicitly guarantee that a proposal with log-position P1 will get encoded before another with log position P2, where P1 < P2. Naively sequencing CreateTimes at proposal-encode-time could result in over-admission. This is because of how we return flow tokens -- up to some log index[^3], and how use these sequence numbers in below-raft WorkQueues. If P2 ends up with a lower sequence number/CreateTime, it would get admitted first, and when returning flow tokens by log position, in specifying up-to-P2, we'll early return P1's flow tokens despite it not being admitted. So we'd over-admit at the sender. This is all within a <tenant,priority> pair. [^1]: We use CreateTimes as "sequence numbers" in replication admission control. We want to assign each AC-queued work below-raft a "sequence number" for FIFO ordering within a <tenant,priority>. We ensure these timestamps are roughly monotonic with respect to log positions of replicated work by sequencing work in log position order. [^2]: In kvflowcontrolpb.RaftAdmissionMeta. [^3]: See kvflowcontrolpb.AdmittedRaftLogEntries. Release note: None
In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll set up the return of flow tokens from the receiver node back to the sender once log entries get (asynchronously) admitted[^1]. So we need to intercept the exact points at which the virtually enqueued work items get admitted, since it all happens asynchronously[^2]. To that end we introduce the following interface: // OnLogEntryAdmitted is used to observe the specific entries // (identified by rangeID + log position) that were admitted. Since // admission control for log entries is asynchronous/non-blocking, // this allows callers to do requisite post-admission // bookkeeping. type OnLogEntryAdmitted interface { AdmittedLogEntry( origin roachpb.NodeID, /* node where the entry originated */ pri admissionpb.WorkPriority, /* admission priority of the entry */ storeID roachpb.StoreID, /* store on which the entry was admitted */ rangeID roachpb.RangeID, /* identifying range for the log entry */ pos LogPosition, /* log position of the entry that was admitted*/ ) } For now we pass in a no-op implementation in production code, but this will change shortly. Seeing as how the asynchronous admit interface is going to be the primary once once we enable replication admission control by default, for IO control, we no longer need the storeWriteDone interfaces and corresponding types. It's being used by our current (and soon-to-be legacy) above-raft IO admission control to inform granters of when the write was actually done, post-admission. For above-raft IO control, at admit-time we do not have sizing info for the writes, so by intercepting these writes at write-done time we're able to make any outstanding token adjustments in the granter. To reflect this new world, we: - Rename setAdmittedDoneModels to setLinearModels. - Introduce a storeReplicatedWorkAdmittedInfo[^3]. It provides information about the size of replicated work once it's admitted (which happens asynchronously from the work itself). This lets us use the underlying linear models for L0 {writes,ingests} to deduct an appropriate number of tokens from the granter, for the admitted work size[^4]. - Rename the granterWithStoreWriteDone interface to granterWithStoreReplicatedWorkAdmitted. We'll still intercept the actual point of admission for some token adjustments, through the the storeReplicatedWorkAdmittedLocked API shown below. There are two callstacks through which this API gets invoked, one where the coord.mu is already held, and one where it isn't. We plumb this information through so the lock is acquired if not already held. The locking structure is unfortunate, but this was a minimally invasive diff. storeReplicatedWorkAdmittedLocked( originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo, ) (additionalTokens int64) While here, we also export an admission.TestingReverseWorkPriorityDict. There are at least three tests that have re-invented the wheel. [^1]: This will happen through the kvflowcontrol.Dispatch interface introduced back in cockroachdb#97766, after integrating it with the RaftTransport layer. [^2]: Introduced in cockroachdb#97599, for replicated write work. [^3]: Identical to the previous StoreWorkDoneInfo. [^4]: There's a peculiarity here in that at enqueuing-time we actually know the size of the write, so we could have deducted the right number of tokens upfront and avoid this post-admit granter token adjustment. We inherit this structure from earlier, and just leave a TODO for now. Release note: None
We move kvflowsequencer.Sequencer and its use in kvflowhandle.Handle (above-raft) to admission.sequencer, now used by admission.StoreWorkQueue (below-raft). This variant appeared in an earlier revision of cockroachdb#97599 where we first introduced monotonically increasing CreateTimes for a given raft group. In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll observe that it's quite difficult to create sequencing CreateTimes[^1] above raft. This is because these sequence numbers are encoded as part of the raft proposal[^2], and at encode-time, we don't actually know what log position the proposal is going to end up in. It's hard to explicitly guarantee that a proposal with log-position P1 will get encoded before another with log position P2, where P1 < P2. Naively sequencing CreateTimes at proposal-encode-time could result in over-admission. This is because of how we return flow tokens -- up to some log index[^3], and how use these sequence numbers in below-raft WorkQueues. If P2 ends up with a lower sequence number/CreateTime, it would get admitted first, and when returning flow tokens by log position, in specifying up-to-P2, we'll early return P1's flow tokens despite it not being admitted. So we'd over-admit at the sender. This is all within a <tenant,priority> pair. [^1]: We use CreateTimes as "sequence numbers" in replication admission control. We want to assign each AC-queued work below-raft a "sequence number" for FIFO ordering within a <tenant,priority>. We ensure these timestamps are roughly monotonic with respect to log positions of replicated work by sequencing work in log position order. [^2]: In kvflowcontrolpb.RaftAdmissionMeta. [^3]: See kvflowcontrolpb.AdmittedRaftLogEntries. Release note: None
In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll set up the return of flow tokens from the receiver node back to the sender once log entries get (asynchronously) admitted[^1]. So we need to intercept the exact points at which the virtually enqueued work items get admitted, since it all happens asynchronously[^2]. To that end we introduce the following interface: // OnLogEntryAdmitted is used to observe the specific entries // (identified by rangeID + log position) that were admitted. Since // admission control for log entries is asynchronous/non-blocking, // this allows callers to do requisite post-admission // bookkeeping. type OnLogEntryAdmitted interface { AdmittedLogEntry( origin roachpb.NodeID, /* node where the entry originated */ pri admissionpb.WorkPriority, /* admission priority of the entry */ storeID roachpb.StoreID, /* store on which the entry was admitted */ rangeID roachpb.RangeID, /* identifying range for the log entry */ pos LogPosition, /* log position of the entry that was admitted*/ ) } For now we pass in a no-op implementation in production code, but this will change shortly. Seeing as how the asynchronous admit interface is going to be the primary once once we enable replication admission control by default, for IO control, we no longer need the storeWriteDone interfaces and corresponding types. It's being used by our current (and soon-to-be legacy) above-raft IO admission control to inform granters of when the write was actually done, post-admission. For above-raft IO control, at admit-time we do not have sizing info for the writes, so by intercepting these writes at write-done time we're able to make any outstanding token adjustments in the granter. To reflect this new world, we: - Rename setAdmittedDoneModels to setLinearModels. - Introduce a storeReplicatedWorkAdmittedInfo[^3]. It provides information about the size of replicated work once it's admitted (which happens asynchronously from the work itself). This lets us use the underlying linear models for L0 {writes,ingests} to deduct an appropriate number of tokens from the granter, for the admitted work size[^4]. - Rename the granterWithStoreWriteDone interface to granterWithStoreReplicatedWorkAdmitted. We'll still intercept the actual point of admission for some token adjustments, through the the storeReplicatedWorkAdmittedLocked API shown below. There are two callstacks through which this API gets invoked, one where the coord.mu is already held, and one where it isn't. We plumb this information through so the lock is acquired if not already held. The locking structure is unfortunate, but this was a minimally invasive diff. storeReplicatedWorkAdmittedLocked( originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo, ) (additionalTokens int64) While here, we also export an admission.TestingReverseWorkPriorityDict. There are at least three tests that have re-invented the wheel. [^1]: This will happen through the kvflowcontrol.Dispatch interface introduced back in cockroachdb#97766, after integrating it with the RaftTransport layer. [^2]: Introduced in cockroachdb#97599, for replicated write work. [^3]: Identical to the previous StoreWorkDoneInfo. [^4]: There's a peculiarity here in that at enqueuing-time we actually know the size of the write, so we could have deducted the right number of tokens upfront and avoid this post-admit granter token adjustment. We inherit this structure from earlier, and just leave a TODO for now. Release note: None
We move kvflowsequencer.Sequencer and its use in kvflowhandle.Handle (above-raft) to admission.sequencer, now used by admission.StoreWorkQueue (below-raft). This variant appeared in an earlier revision of cockroachdb#97599 where we first introduced monotonically increasing CreateTimes for a given raft group. In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll observe that it's quite difficult to create sequencing CreateTimes[^1] above raft. This is because these sequence numbers are encoded as part of the raft proposal[^2], and at encode-time, we don't actually know what log position the proposal is going to end up in. It's hard to explicitly guarantee that a proposal with log-position P1 will get encoded before another with log position P2, where P1 < P2. Naively sequencing CreateTimes at proposal-encode-time could result in over-admission. This is because of how we return flow tokens -- up to some log index[^3], and how use these sequence numbers in below-raft WorkQueues. If P2 ends up with a lower sequence number/CreateTime, it would get admitted first, and when returning flow tokens by log position, in specifying up-to-P2, we'll early return P1's flow tokens despite it not being admitted. So we'd over-admit at the sender. This is all within a <tenant,priority> pair. [^1]: We use CreateTimes as "sequence numbers" in replication admission control. We want to assign each AC-queued work below-raft a "sequence number" for FIFO ordering within a <tenant,priority>. We ensure these timestamps are roughly monotonic with respect to log positions of replicated work by sequencing work in log position order. [^2]: In kvflowcontrolpb.RaftAdmissionMeta. [^3]: See kvflowcontrolpb.AdmittedRaftLogEntries. Release note: None
In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll set up the return of flow tokens from the receiver node back to the sender once log entries get (asynchronously) admitted[^1]. So we need to intercept the exact points at which the virtually enqueued work items get admitted, since it all happens asynchronously[^2]. To that end we introduce the following interface: // OnLogEntryAdmitted is used to observe the specific entries // (identified by rangeID + log position) that were admitted. Since // admission control for log entries is asynchronous/non-blocking, // this allows callers to do requisite post-admission // bookkeeping. type OnLogEntryAdmitted interface { AdmittedLogEntry( origin roachpb.NodeID, /* node where the entry originated */ pri admissionpb.WorkPriority, /* admission priority of the entry */ storeID roachpb.StoreID, /* store on which the entry was admitted */ rangeID roachpb.RangeID, /* identifying range for the log entry */ pos LogPosition, /* log position of the entry that was admitted*/ ) } For now we pass in a no-op implementation in production code, but this will change shortly. Seeing as how the asynchronous admit interface is going to be the primary once once we enable replication admission control by default, for IO control, we no longer need the storeWriteDone interfaces and corresponding types. It's being used by our current (and soon-to-be legacy) above-raft IO admission control to inform granters of when the write was actually done, post-admission. For above-raft IO control, at admit-time we do not have sizing info for the writes, so by intercepting these writes at write-done time we're able to make any outstanding token adjustments in the granter. To reflect this new world, we: - Rename setAdmittedDoneModels to setLinearModels. - Introduce a storeReplicatedWorkAdmittedInfo[^3]. It provides information about the size of replicated work once it's admitted (which happens asynchronously from the work itself). This lets us use the underlying linear models for L0 {writes,ingests} to deduct an appropriate number of tokens from the granter, for the admitted work size[^4]. - Rename the granterWithStoreWriteDone interface to granterWithStoreReplicatedWorkAdmitted. We'll still intercept the actual point of admission for some token adjustments, through the the storeReplicatedWorkAdmittedLocked API shown below. There are two callstacks through which this API gets invoked, one where the coord.mu is already held, and one where it isn't. We plumb this information through so the lock is acquired if not already held. The locking structure is unfortunate, but this was a minimally invasive diff. storeReplicatedWorkAdmittedLocked( originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo, ) (additionalTokens int64) While here, we also export an admission.TestingReverseWorkPriorityDict. There are at least three tests that have re-invented the wheel. [^1]: This will happen through the kvflowcontrol.Dispatch interface introduced back in cockroachdb#97766, after integrating it with the RaftTransport layer. [^2]: Introduced in cockroachdb#97599, for replicated write work. [^3]: Identical to the previous StoreWorkDoneInfo. [^4]: There's a peculiarity here in that at enqueuing-time we actually know the size of the write, so we could have deducted the right number of tokens upfront and avoid this post-admit granter token adjustment. We inherit this structure from earlier, and just leave a TODO for now. Release note: None
We move kvflowsequencer.Sequencer and its use in kvflowhandle.Handle (above-raft) to admission.sequencer, now used by admission.StoreWorkQueue (below-raft). This variant appeared in an earlier revision of cockroachdb#97599 where we first introduced monotonically increasing CreateTimes for a given raft group. In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll observe that it's quite difficult to create sequencing CreateTimes[^1] above raft. This is because these sequence numbers are encoded as part of the raft proposal[^2], and at encode-time, we don't actually know what log position the proposal is going to end up in. It's hard to explicitly guarantee that a proposal with log-position P1 will get encoded before another with log position P2, where P1 < P2. Naively sequencing CreateTimes at proposal-encode-time could result in over-admission. This is because of how we return flow tokens -- up to some log index[^3], and how use these sequence numbers in below-raft WorkQueues. If P2 ends up with a lower sequence number/CreateTime, it would get admitted first, and when returning flow tokens by log position, in specifying up-to-P2, we'll early return P1's flow tokens despite it not being admitted. So we'd over-admit at the sender. This is all within a <tenant,priority> pair. [^1]: We use CreateTimes as "sequence numbers" in replication admission control. We want to assign each AC-queued work below-raft a "sequence number" for FIFO ordering within a <tenant,priority>. We ensure these timestamps are roughly monotonic with respect to log positions of replicated work by sequencing work in log position order. [^2]: In kvflowcontrolpb.RaftAdmissionMeta. [^3]: See kvflowcontrolpb.AdmittedRaftLogEntries. Release note: None
In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll set up the return of flow tokens from the receiver node back to the sender once log entries get (asynchronously) admitted[^1]. So we need to intercept the exact points at which the virtually enqueued work items get admitted, since it all happens asynchronously[^2]. To that end we introduce the following interface: // OnLogEntryAdmitted is used to observe the specific entries // (identified by rangeID + log position) that were admitted. Since // admission control for log entries is asynchronous/non-blocking, // this allows callers to do requisite post-admission // bookkeeping. type OnLogEntryAdmitted interface { AdmittedLogEntry( origin roachpb.NodeID, /* node where the entry originated */ pri admissionpb.WorkPriority, /* admission priority of the entry */ storeID roachpb.StoreID, /* store on which the entry was admitted */ rangeID roachpb.RangeID, /* identifying range for the log entry */ pos LogPosition, /* log position of the entry that was admitted*/ ) } For now we pass in a no-op implementation in production code, but this will change shortly. Seeing as how the asynchronous admit interface is going to be the primary once once we enable replication admission control by default, for IO control, we no longer need the storeWriteDone interfaces and corresponding types. It's being used by our current (and soon-to-be legacy) above-raft IO admission control to inform granters of when the write was actually done, post-admission. For above-raft IO control, at admit-time we do not have sizing info for the writes, so by intercepting these writes at write-done time we're able to make any outstanding token adjustments in the granter. To reflect this new world, we: - Rename setAdmittedDoneModels to setLinearModels. - Introduce a storeReplicatedWorkAdmittedInfo[^3]. It provides information about the size of replicated work once it's admitted (which happens asynchronously from the work itself). This lets us use the underlying linear models for L0 {writes,ingests} to deduct an appropriate number of tokens from the granter, for the admitted work size[^4]. - Rename the granterWithStoreWriteDone interface to granterWithStoreReplicatedWorkAdmitted. We'll still intercept the actual point of admission for some token adjustments, through the the storeReplicatedWorkAdmittedLocked API shown below. There are two callstacks through which this API gets invoked, one where the coord.mu is already held, and one where it isn't. We plumb this information through so the lock is acquired if not already held. The locking structure is unfortunate, but this was a minimally invasive diff. storeReplicatedWorkAdmittedLocked( originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo, ) (additionalTokens int64) While here, we also export an admission.TestingReverseWorkPriorityDict. There are at least three tests that have re-invented the wheel. [^1]: This will happen through the kvflowcontrol.Dispatch interface introduced back in cockroachdb#97766, after integrating it with the RaftTransport layer. [^2]: Introduced in cockroachdb#97599, for replicated write work. [^3]: Identical to the previous StoreWorkDoneInfo. [^4]: There's a peculiarity here in that at enqueuing-time we actually know the size of the write, so we could have deducted the right number of tokens upfront and avoid this post-admit granter token adjustment. We inherit this structure from earlier, and just leave a TODO for now. Release note: None
We move kvflowsequencer.Sequencer and its use in kvflowhandle.Handle (above-raft) to admission.sequencer, now used by admission.StoreWorkQueue (below-raft). This variant appeared in an earlier revision of cockroachdb#97599 where we first introduced monotonically increasing CreateTimes for a given raft group. In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll observe that it's quite difficult to create sequencing CreateTimes[^1] above raft. This is because these sequence numbers are encoded as part of the raft proposal[^2], and at encode-time, we don't actually know what log position the proposal is going to end up in. It's hard to explicitly guarantee that a proposal with log-position P1 will get encoded before another with log position P2, where P1 < P2. Naively sequencing CreateTimes at proposal-encode-time could result in over-admission. This is because of how we return flow tokens -- up to some log index[^3], and how use these sequence numbers in below-raft WorkQueues. If P2 ends up with a lower sequence number/CreateTime, it would get admitted first, and when returning flow tokens by log position, in specifying up-to-P2, we'll early return P1's flow tokens despite it not being admitted. So we'd over-admit at the sender. This is all within a <tenant,priority> pair. [^1]: We use CreateTimes as "sequence numbers" in replication admission control. We want to assign each AC-queued work below-raft a "sequence number" for FIFO ordering within a <tenant,priority>. We ensure these timestamps are roughly monotonic with respect to log positions of replicated work by sequencing work in log position order. [^2]: In kvflowcontrolpb.RaftAdmissionMeta. [^3]: See kvflowcontrolpb.AdmittedRaftLogEntries. Release note: None
In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll set up the return of flow tokens from the receiver node back to the sender once log entries get (asynchronously) admitted[^1]. So we need to intercept the exact points at which the virtually enqueued work items get admitted, since it all happens asynchronously[^2]. To that end we introduce the following interface: // OnLogEntryAdmitted is used to observe the specific entries // (identified by rangeID + log position) that were admitted. Since // admission control for log entries is asynchronous/non-blocking, // this allows callers to do requisite post-admission // bookkeeping. type OnLogEntryAdmitted interface { AdmittedLogEntry( origin roachpb.NodeID, /* node where the entry originated */ pri admissionpb.WorkPriority, /* admission priority of the entry */ storeID roachpb.StoreID, /* store on which the entry was admitted */ rangeID roachpb.RangeID, /* identifying range for the log entry */ pos LogPosition, /* log position of the entry that was admitted*/ ) } For now we pass in a no-op implementation in production code, but this will change shortly. Seeing as how the asynchronous admit interface is going to be the primary once once we enable replication admission control by default, for IO control, we no longer need the storeWriteDone interfaces and corresponding types. It's being used by our current (and soon-to-be legacy) above-raft IO admission control to inform granters of when the write was actually done, post-admission. For above-raft IO control, at admit-time we do not have sizing info for the writes, so by intercepting these writes at write-done time we're able to make any outstanding token adjustments in the granter. To reflect this new world, we: - Rename setAdmittedDoneModels to setLinearModels. - Introduce a storeReplicatedWorkAdmittedInfo[^3]. It provides information about the size of replicated work once it's admitted (which happens asynchronously from the work itself). This lets us use the underlying linear models for L0 {writes,ingests} to deduct an appropriate number of tokens from the granter, for the admitted work size[^4]. - Rename the granterWithStoreWriteDone interface to granterWithStoreReplicatedWorkAdmitted. We'll still intercept the actual point of admission for some token adjustments, through the the storeReplicatedWorkAdmittedLocked API shown below. There are two callstacks through which this API gets invoked, one where the coord.mu is already held, and one where it isn't. We plumb this information through so the lock is acquired if not already held. The locking structure is unfortunate, but this was a minimally invasive diff. storeReplicatedWorkAdmittedLocked( originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo, ) (additionalTokens int64) While here, we also export an admission.TestingReverseWorkPriorityDict. There are at least three tests that have re-invented the wheel. [^1]: This will happen through the kvflowcontrol.Dispatch interface introduced back in cockroachdb#97766, after integrating it with the RaftTransport layer. [^2]: Introduced in cockroachdb#97599, for replicated write work. [^3]: Identical to the previous StoreWorkDoneInfo. [^4]: There's a peculiarity here in that at enqueuing-time we actually know the size of the write, so we could have deducted the right number of tokens upfront and avoid this post-admit granter token adjustment. We inherit this structure from earlier, and just leave a TODO for now. Release note: None
101786: workload: introduce timeout for pre-warming connection pool r=sean- a=sean- Interrupting target instances during prewarming shouldn't cause workload to proceed: introduce a timeout to prewarming connections. Connections will have 15s to 5min to warmup before the context will expire. Epic: none 101987: cli/sql: new option autocerts for TLS client cert auto-discovery r=rafiss a=knz Fixes #101986. See the release note below. An additional benefit not mentioned in the release note is that it simplifies switching from one tenant to another when using shared-process multitenancy. For example, this becomes possible: ``` > CREATE TENANT foo; > ALTER TENANT foo START SERVICE SHARED; > \c cluster:foo root - - autocerts ``` Alternatively, this can also be used to quickly switch from a non-root user in an app tenant to the root user in the system tenant: ``` > \c cluster:system root - - autocerts ``` This works because (currently) all tenant servers running side-by-side use the same TLS CA to validate SQL client certs. ---- Release note (cli change): The `\connect` client-side command for the SQL shell (included in `cockroach sql`, `cockroach demo`, `cockroach-sql`) now recognizes an option `autocerts` as last argument. When provided, `\c` will now try to discover a TLS client certificate and key in the same directory(ies) as used by the previous connection URL. This feature makes it easier to switch usernames when TLS client/key files are available for both the previous and the new username. 102382: c2c: deflake c2c/shutdown roachtests r=stevendanna a=msbutler c2c: deflake c2c/shutdown roachtests This patch addresses to roachtest failure modes: - Prevents roachtest failure if a query fails during a node shutdown. - Prevents the src cluster from returning a single node topology, which could cause the stream ingestion job to hang if the participating src node gets shut down. Longer term, automatic replanning will prevent this. Specifically, this patch changes the kv workload to split and scatter the kv table across the cluster before the c2c job begins. Fixes #101898 Fixes #102111 This patch also makes it easier to reproduce c2c roachtest failures by plumbing a random seed to several components of the roachtest driver. Release note: None c2c: rename completeStreamIngestion to applyCutoverTime Release note: none workload: add --scatter flag to kv workload The user can now run `./workload init kv --scatter ....` which scatters the kv table across the cluster after the initial data load. This flag is best used with `--splits`, `--max-block-bytes`, and `--insert-count`. Epic: none Release note: none 102819: admission: move CreateTime-sequencing below-raft r=irfansharif a=irfansharif These are already reviewed commits from #98308. Part of #95563. --- **admission: move CreateTime-sequencing below-raft** We move kvflowsequencer.Sequencer and its use in kvflowhandle.Handle (above-raft) to admission.sequencer, now used by admission.StoreWorkQueue (below-raft). This variant appeared in an earlier revision of #97599 where we first introduced monotonically increasing CreateTimes for a given raft group. In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll observe that it's quite difficult to create sequencing CreateTimes[^1] above raft. This is because these sequence numbers are encoded as part of the raft proposal[^2], and at encode-time, we don't actually know what log position the proposal is going to end up in. It's hard to explicitly guarantee that a proposal with log-position P1 will get encoded before another with log position P2, where P1 < P2. Naively sequencing CreateTimes at proposal-encode-time could result in over-admission. This is because of how we return flow tokens -- up to some log index[^3], and how use these sequence numbers in below-raft WorkQueues. If P2 ends up with a lower sequence number/CreateTime, it would get admitted first, and when returning flow tokens by log position, in specifying up-to-P2, we'll early return P1's flow tokens despite it not being admitted. So we'd over-admit at the sender. This is all within a <tenant,priority> pair. [^1]: We use CreateTimes as "sequence numbers" in replication admission control. We want to assign each AC-queued work below-raft a "sequence number" for FIFO ordering within a <tenant,priority>. We ensure these timestamps are roughly monotonic with respect to log positions of replicated work by sequencing work in log position order. [^2]: In kvflowcontrolpb.RaftAdmissionMeta. [^3]: See kvflowcontrolpb.AdmittedRaftLogEntries. --- **admission: add intercept points for when replicated work gets admitted** In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll set up the return of flow tokens from the receiver node back to the sender once log entries get (asynchronously) admitted[^4]. So we need to intercept the exact points at which the virtually enqueued work items get admitted, since it all happens asynchronously[^5]. To that end we introduce the following interface: ```go // OnLogEntryAdmitted is used to observe the specific entries // (identified by rangeID + log position) that were admitted. Since // admission control for log entries is asynchronous/non-blocking, // this allows callers to do requisite post-admission // bookkeeping. type OnLogEntryAdmitted interface { AdmittedLogEntry( origin roachpb.NodeID, /* node where the entry originated */ pri admissionpb.WorkPriority, /* admission priority of the entry */ storeID roachpb.StoreID, /* store on which the entry was admitted */ rangeID roachpb.RangeID, /* identifying range for the log entry */ pos LogPosition, /* log position of the entry that was admitted*/ ) } ``` For now we pass in a no-op implementation in production code, but this will change shortly. Seeing as how the asynchronous admit interface is going to be the primary once once we enable replication admission control by default, for IO control, we no longer need the storeWriteDone interfaces and corresponding types. It's being used by our current (and soon-to-be legacy) above-raft IO admission control to inform granters of when the write was actually done, post-admission. For above-raft IO control, at admit-time we do not have sizing info for the writes, so by intercepting these writes at write-done time we're able to make any outstanding token adjustments in the granter. To reflect this new world, we: - Rename setAdmittedDoneModels to setLinearModels. - Introduce a storeReplicatedWorkAdmittedInfo[^6]. It provides information about the size of replicated work once it's admitted (which happens asynchronously from the work itself). This lets us use the underlying linear models for L0 {writes,ingests} to deduct an appropriate number of tokens from the granter, for the admitted work size[^7]. - Rename the granterWithStoreWriteDone interface to granterWithStoreReplicatedWorkAdmitted. We'll still intercept the actual point of admission for some token adjustments, through the the storeReplicatedWorkAdmittedLocked API shown below. There are two callstacks through which this API gets invoked, one where the coord.mu is already held, and one where it isn't. We plumb this information through so the lock is acquired if not already held. The locking structure is unfortunate, but this was a minimally invasive diff. ```go storeReplicatedWorkAdmittedLocked( originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo, ) (additionalTokens int64) ``` While here, we also export an admission.TestingReverseWorkPriorityDict. There are at least three tests that have re-invented the wheel. [^4]: This will happen through the kvflowcontrol.Dispatch interface introduced back in #97766, after integrating it with the RaftTransport layer. [^5]: Introduced in #97599, for replicated write work. [^6]: Identical to the previous StoreWorkDoneInfo. [^7]: There's a peculiarity here in that at enqueuing-time we actually know the size of the write, so we could have deducted the right number of tokens upfront and avoid this post-admit granter token adjustment. We inherit this structure from earlier, and just leave a TODO for now. 103116: generate-logic-test: fix incorrect timeout in logictests template r=rickystewart a=healthy-pod In #102719, we changed the way we set `-test.timeout` but didn't update the logictests template. This code change updates the template. Release note: None Epic: none Co-authored-by: Sean Chittenden <sean@chittenden.org> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Michael Butler <butler@cockroachlabs.com> Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: healthy-pod <ahmad@cockroachlabs.com>
Fixes cockroachdb#105185. Fixes cockroachdb#105613. In cockroachdb#97599 we introduced a non-blocking admission interface for below-raft, replication admission control. When doing so, we unintentionally violated the 'requester' interface -- when 'requester.granted()' is invoked, the granter expects to admit a single queued request. The code layering made it so that after granting one, when doing post-hoc token adjustments, if we observed the granter was exhausted but now no longer was so, we'd try to grant again. This resulted in admitting work recursively, with a callstack as deep as the admit chain. Not only is that undesirable design-wise, it also triggered panics in the granter that wasn't expecting more than one request being admitted. Recursively we were incrementing the grant chain index, which overflowed (it was in int8, so happened readily with long enough admit chains), after which we panic-ed when using a negative index to access an array. We add a test that fails without the changes. The failure can also be triggered by applying the diff below (which reverts to the older, buggy behavior): dev test pkg/kv/kvserver -f TestFlowControlGranterAdmitOneByOne -v --show-logs diff --git i/pkg/util/admission/granter.go w/pkg/util/admission/granter.go index ba42213c375..7c526fbb3d8 100644 --- i/pkg/util/admission/granter.go +++ w/pkg/util/admission/granter.go @@ -374,7 +374,7 @@ func (cg *kvStoreTokenChildGranter) storeWriteDone( func (cg *kvStoreTokenChildGranter) storeReplicatedWorkAdmittedLocked( originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo, ) (additionalTokens int64) { - return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, false /* canGrantAnother */) + return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, true /* canGrantAnother */) } Release note: None
Fixes cockroachdb#105185. Fixes cockroachdb#105613. In cockroachdb#97599 we introduced a non-blocking admission interface for below-raft, replication admission control. When doing so, we unintentionally violated the 'requester' interface -- when 'requester.granted()' is invoked, the granter expects to admit a single queued request. The code layering made it so that after granting one, when doing post-hoc token adjustments, if we observed the granter was exhausted but now no longer was so, we'd try to grant again. This resulted in admitting work recursively, with a callstack as deep as the admit chain. Not only is that undesirable design-wise, it also triggered panics in the granter that wasn't expecting more than one request being admitted. Recursively we were incrementing the grant chain index, which overflowed (it was in int8, so happened readily with long enough admit chains), after which we panic-ed when using a negative index to access an array. We add a test that fails without the changes. The failure can also be triggered by applying the diff below (which reverts to the older, buggy behavior): dev test pkg/kv/kvserver -f TestFlowControlGranterAdmitOneByOne -v --show-logs diff --git i/pkg/util/admission/granter.go w/pkg/util/admission/granter.go index ba42213c375..7c526fbb3d8 100644 --- i/pkg/util/admission/granter.go +++ w/pkg/util/admission/granter.go @@ -374,7 +374,7 @@ func (cg *kvStoreTokenChildGranter) storeWriteDone( func (cg *kvStoreTokenChildGranter) storeReplicatedWorkAdmittedLocked( originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo, ) (additionalTokens int64) { - return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, false /* canGrantAnother */) + return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, true /* canGrantAnother */) } Release note: None
106236: admission: avoid recursive grant chain r=irfansharif a=irfansharif Fixes #105185. Fixes #105613. In #97599 we introduced a non-blocking admission interface for below-raft, replication admission control. When doing so, we unintentionally violated the 'requester' interface -- when 'requester.granted()' is invoked, the granter expects to admit a single queued request. The code layering made it so that after granting one, when doing post-hoc token adjustments, if we observed the granter was exhausted but now no longer was so, we'd try to grant again. This resulted in admitting work recursively, with a callstack as deep as the admit chain. Not only is that undesirable design-wise, it also triggered panics in the granter that wasn't expecting more than one request being admitted. Recursively we were incrementing the grant chain index, which overflowed (it was in int8, so happened readily with long enough admit chains), after which we panic-ed when using a negative index to access an array. We add a test that fails without the changes. The failure can also be triggered by applying the diff below (which reverts to the older, buggy behavior): ``` dev test pkg/kv/kvserver -f TestFlowControlGranterAdmitOneByOne -v --show-logs ``` ```diff diff --git i/pkg/util/admission/granter.go w/pkg/util/admission/granter.go index ba42213c375..7c526fbb3d8 100644 --- i/pkg/util/admission/granter.go +++ w/pkg/util/admission/granter.go `@@` -374,7 +374,7 `@@` func (cg *kvStoreTokenChildGranter) storeWriteDone( func (cg *kvStoreTokenChildGranter) storeReplicatedWorkAdmittedLocked( originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo, ) (additionalTokens int64) { - return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, false /* canGrantAnother */) + return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, true /* canGrantAnother */) } ``` Release note: None 106378: upgrades: fix txn retry bug in upgrade batching r=adityamaru a=stevendanna In #105750 we split the backfill of the job_type column across multiple transactions. This introduced a bug in which we would modify the resumeAfter variable that controlled the batching before the transaction succeeded. In the face of a transaction retry, this would result in some rows not having their job_type column populated. This was caught in nightly tests that attempted to use the crdb_internal.system_jobs virtual index on the job_type column. Here, we apply the same fix that we applied in #104752 for the same type of bug. Fixes #106347 Fixes #106246 Release note (bug fix): Fixes a bug where a transaction retry during the backfill of the job_type column in the jobs table could result some job records with no job_type value. 106408: ci: remove `lint` job from GitHub CI r=rail a=rickystewart With `staticcheck` and `unused` working identically under `lint` in Bazel and `make` now, it's time! Delete this file so that GitHub CI lint stops running. This is the *last* GitHub CI job. :) Now only Bazel builds and tests will run on PR's. Epic: CRDB-15060 Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Part of #95563. For end-to-end flow control of replicated writes, we want to enable below-raft admission control through the following API on kvadmission.Controller:
This serves as the integration point for log entries received below raft right as they're being written to stable storage. It's a non-blocking interface since we're below-raft and in the raft.Ready() loop. What it effectively does is enqueues a "virtual" work item in the underlying StoreWorkQueue mediating all store IO. This virtual work item is what later gets dequeued once the IO granter informs the work queue of newly available IO tokens. When enqueueing the virtual work item, we still update the StoreWorkQueue's physically-accounted-for bytes since the actual write is not deferred, and timely statistic updates improves accuracy for the underlying linear models (that map between accounted-for writes and observed L0 growth, using it to inform IO token generation rates).
For each of the arguments above:
We use the above to populate the following fields on a per-(replicated write)work basis:
Since admission is happening below-raft where the size of the write is known, we no longer need per-work estimates for upfront IO token deductions. Since admission is asynchronous, we also don't use the AdmittedWorkDone interface which was to make token adjustments (without blocking) given the upfront estimates. We still want to intercept the exact point when some write work gets admitted in order to inform the origin node so it can release flow tokens. We do so through the following interface satisfied by the StoreWorkQueue:
Release note: None
Footnotes
See kvflowcontrolpb.AdmittedRaftLogEntries introduced in kvflowcontrol,raftlog: interfaces for replication control #95637. ↩ ↩2
See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor} introduced in kvflowcontrol,raftlog: interfaces for replication control #95637. Token deductions and returns are tied to raft log positions. ↩
See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in raftlog: introduce EntryEncoding{Standard,Sideloaded}WithAC #95748. ↩
See kvflowcontrolpb.RaftAdmissionMeta introduced in kvflowcontrol,raftlog: interfaces for replication control #95637. ↩ ↩2