admission: support non-blocking {Store,}WorkQueue.Admit() #97599

irfansharif · 2023-02-23T23:29:56Z

Part of #95563. For end-to-end flow control of replicated writes, we want to enable below-raft admission control through the following API on kvadmission.Controller:

  // AdmitRaftEntry informs admission control of a raft log entry being
  // written to storage (for the given tenant, the specific range, and
  // on the named store).
  AdmitRaftEntry(
    context.Context, roachpb.TenantID,
    roachpb.StoreID, roachpb.RangeID, raftpb.Entry.
  )

This serves as the integration point for log entries received below raft right as they're being written to stable storage. It's a non-blocking interface since we're below-raft and in the raft.Ready() loop. What it effectively does is enqueues a "virtual" work item in the underlying StoreWorkQueue mediating all store IO. This virtual work item is what later gets dequeued once the IO granter informs the work queue of newly available IO tokens. When enqueueing the virtual work item, we still update the StoreWorkQueue's physically-accounted-for bytes since the actual write is not deferred, and timely statistic updates improves accuracy for the underlying linear models (that map between accounted-for writes and observed L0 growth, using it to inform IO token generation rates).

For each of the arguments above:

The roachpb.TenantID is plumbed to find the right tenant heap to queue it under (for inter-tenant isolation).
The roachpb.StoreID to find the right store work queue on multi-store nodes. We'll also use the StoreID when informing the origin node of log entries being admitted¹.
We pass in the roachpb.RangeID on behalf of which work is being admitted. This, along side the raftpb.Entry.{Term,Index} for the replicated write uniquely identifies where the write is to end up. We use these identifiers to return flow tokens on the origin node¹².
For standard work queue ordering, our work item needs to include the CreateTime and AdmissionPriority, details that are passed down using dedicated raft log entry encodings³⁴ as part of the raftpb.Entry parameter above.
- Since the raftpb.Entry encodes within it its origin node⁴, it will be used post-admission to dispatch flow tokens to the right node. This integration is left to future PRs.

We use the above to populate the following fields on a per-(replicated write)work basis:

    // ReplicatedWorkInfo groups everything needed to admit replicated
    // writes, done so asynchronously below-raft as part of replication
    // admission control.
    type ReplicatedWorkInfo struct {
      RangeID roachpb.RangeID
      Origin roachpb.NodeID
      LogPosition LogPosition
      Ingested bool
    }

Since admission is happening below-raft where the size of the write is known, we no longer need per-work estimates for upfront IO token deductions. Since admission is asynchronous, we also don't use the AdmittedWorkDone interface which was to make token adjustments (without blocking) given the upfront estimates. We still want to intercept the exact point when some write work gets admitted in order to inform the origin node so it can release flow tokens. We do so through the following interface satisfied by the StoreWorkQueue:

  // onAdmittedReplicatedWork is used to intercept the
  // point-of-admission for replicated writes.
  type onAdmittedReplicatedWork interface {
    admittedReplicatedWork(
      tenantID roachpb.TenantID,
      pri admissionpb.WorkPriority,
      rwi ReplicatedWorkInfo,
      requestedTokens int64,
    )
  }

Release note: None

See kvflowcontrolpb.AdmittedRaftLogEntries introduced in kvflowcontrol,raftlog: interfaces for replication control #95637. ↩ ↩²
See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor} introduced in kvflowcontrol,raftlog: interfaces for replication control #95637. Token deductions and returns are tied to raft log positions. ↩
See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in raftlog: introduce EntryEncoding{Standard,Sideloaded}WithAC #95748. ↩
See kvflowcontrolpb.RaftAdmissionMeta introduced in kvflowcontrol,raftlog: interfaces for replication control #95637. ↩ ↩²

cockroach-teamcity · 2023-02-23T23:30:06Z

This change is

irfansharif · 2023-02-25T11:42:39Z

@sumeerbhola, this is ready for review. It's still a bit incomplete -- I think I've broken epoch-LIFO for the below-raft admission queues. I'll fix + add tests while you review.

irfansharif · 2023-02-27T16:34:56Z

I think I've broken epoch-LIFO for the below-raft admission queues. I'll fix + add tests while you review.

I couldn't figure it out and I've confused myself further. I put my (partly unreadable) questions in I12 within kvflowcontrol/doc.go.

sumeerbhola

In WorkQueue ordering -- for replicated writes below-raft, we ignore CreateTime/epoch-LIFO, and instead sort by priority and within a priority, sort by log position.

Log position is not a physically meaningful number since different ranges may be seeing new entries at different rates, and different ranges may have been created at different times so may be at different places in their raft log (maybe that is why the comparison code includes RangeID?). Are we doing this to ensure that for a range, if an entry with priority p at position n is admitted that we can assume that for that same range every entry with priority p at position < n is also admitted?
If we need this invariant, can we assign the create time in a monotonic manner when the proposal is assigned a tentative log position after evaluation (i.e., the log position we are using here in admission control for flow control tokens).
If we really want epoch-lifo to work at this layer in the future we will need to use the txn create time, which means we can't have this invariant -- I need to think some more about that.

Reviewed 3 of 27 files at r1, 1 of 11 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)

pkg/util/admission/work_queue.go line 191 at r2 (raw file):

	// ReplicatedWorkInfo groups everything needed to admit replicated writes, done
	// so asynchronously below-raft as part of replication admission control.
	ReplicatedWorkInfo ReplicatedWorkInfo

An alternative would be to make this a ReplicatedWorkInfo interface{} and make it opaque to the AC package. We would allocate using a sync.Pool.

pkg/util/admission/work_queue.go line 205 at r2 (raw file):

	// Origin is the node at which this work originated. It's used for
	// replication admission control to inform the origin of admitted work
	// (after which flow tokens are released, permitted more replicated

nit: permitting

pkg/util/admission/work_queue.go line 214 at r2 (raw file):

	// maintain accurate linear models for L0 growth due to ingests and
	// regular write batches.
	Ingested bool

StoreWorkDoneInfo allowed both WriteBytes and IngestedBytes to be non-zero. Do we not need that here?

pkg/util/admission/work_queue.go line 532 at r2 (raw file):

// admission control is enabled. AdmittedWorkDone must be called iff
// enabled=true && err!=nil, and the WorkKind for this queue uses slots.
func (q *WorkQueue) Admit(ctx context.Context, info WorkInfo) (enabled bool, err error) {

aside: before enabling replication admission control for user-facing traffic (which can have arbitrary concurrency, compared to our internal load like index backfills), I think we will need to work out the memory overhead of queueing each raft command and whether we need to do some coalescing.
We may want to track this somewhere so we don't forget.

pkg/util/admission/work_queue.go line 1487 at r2 (raw file):

			// LIFO, and the epoch is closed, so can simply use createTime.

			if (*wwh)[i].replicated.RangeID != (*wwh)[j].replicated.RangeID ||

why is RangeID relevant here?
I am not keen on changing the ordering function and would like to understand what motivates the change.

pkg/util/admission/work_queue.go line 1807 at r2 (raw file):

		// We use a per-request estimate only when no requested count is
		// provided. It's always provided for below-raft admission where we
		// already know the size of the work being admitted. Since it's async,

The "Since it's async, ..." is confusing. Our general pattern is to deduct tokens in the granter when admitting and then tell the requester about the admission. If the former is "upfront", since we still need to do that. And we will since info.RequestedCount is non-zero. We will potentially under-deduct since we have not used any model in this granter deduction, and will fix things when WorkQueue.granted calls q.onAdmittedReplicatedWork.admittedReplicatedWork - which is also fine.
A longer more explicit comment specifying the exact control flow would be preferable.

pkg/util/admission/work_queue.go line 1903 at r2 (raw file):

	// upfront, and deduct what should be the right number of tokens. So why the
	// adjustment here? When deducting originally, how come we don't just apply
	// the linear models?

Regarding this TODO, this is partially a peculiarity of how the WorkQueue and granter interaction is setup to be very general and partly because there was no information about the actual size at admission time.
Let's focus on the former, since the latter is available for this replication admission control. The WorkQueue has to handle both tokens and slots and does not know anything about linear models (which is a very specialized case of token based AC). It also does not know that kvStoreTokenGranter.storeWriteDone has multiple resources hidden under it (disk bandwidth and L0 tokens). All of this complexity is handled via this side channel. I think its best to continue doing it this way -- there is no risk of over-admission since this adjustment is being done in the same goroutine that did the granting.

irfansharif

different ranges may have been created at different times so may be at different places in their raft log (maybe that is why the comparison code includes RangeID?).

Yes.

Are we doing this to ensure that for a range, if an entry with priority p at position n is admitted that we can assume that for that same range every entry with priority p at position < n is also admitted?

Yes, exactly.

If we need this invariant, can we assign the create time in a monotonic manner when the proposal is assigned a tentative log position after evaluation

Yes, this would work. I was effectively doing this by either using CreateTime in work queue orderings, or using log positions, but not both.

I need to think some more about that.

Do take a look at I12.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/util/admission/work_queue.go line 191 at r2 (raw file):