Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission: byte tokens for store admission #79092

Closed
sumeerbhola opened this issue Mar 30, 2022 · 3 comments · Fixed by #85059
Closed

admission: byte tokens for store admission #79092

sumeerbhola opened this issue Mar 30, 2022 · 3 comments · Fixed by #85059
Assignees
Labels
A-admission-control A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Mar 30, 2022

Admission control is not aware of the bytes that will be written by each work item. It does compute tokens based on the bytes being written to L0 and compacted out of L0, but turns those into a token-per-work-item by using the mean bytes written per work. Estimates are fine for small writes but proper accounting for larger writes (like AddSSTable, and in the future for range snapshot application) is preferable, since it avoids spikes of over-admission, later compensated by under-admission.
Also, we should be properly accounting for how many of the ingested bytes were added to L0.

We plan to:

  • Enhance admission control logic to compute byte tokens for store writes. Those requests that provide their byte size (which should be all large requests) will consume these tokens. Estimates will be used for two purposes (a) requests that don't provide their byte size, for which the estimate will be used to decide how many tokens to consume (b) computing the fraction of an ingest request that will end up in L0 (to adjust the token consumption for an ingest request). Just like token estimation, these estimates are continually adjusted based on stats (at every 15s interval).
  • After the admitted work is done, each ingest request will also provide information on how many bytes were added to L0 (this will need a small Pebble change db: make DB.Ingest return bytes ingested into L0  pebble#1600), so that the token consumption can be fixed and we have data for future estimates.

#75120 contains a WIP PR with this change.

Deficiencies:
The overload threshold at which we want to start constraining writes could be different for user-facing and background operations (like index backfills): By sharing a single queue and a single source of tokens for that queue we also share the same overload thresholds. This is probably not an immediate problem since rocksdb.ingest_backpressure.l0_file_count_threshold and admission.l0_sub_level_count_overload_threshold both default to 20. There is a way to address this in the future via a hierarchical token bucket scheme: the admission.ioLoadListener would produce high_overload_tokens and low_overload_tokens where the background operations have to consume both, while foreground operations only use the former (we are trying to do something similar for cpu slots for background activities like concurrent compression threads in Pebble compactions).

This issue is split off from the broader #75066

Jira issue: CRDB-14556

Epic CRDB-14607

@sumeerbhola sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control A-storage Relating to our storage engine (Pebble) on-disk storage. labels Mar 30, 2022
@blathers-crl blathers-crl bot added the T-storage Storage Team label Mar 30, 2022
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Apr 25, 2022
This switch to byte tokens will result in better accounting for
large writes, including ingests based on whether their bytes land
in L0 or elsewhere. It is also a precurson to taking into account
flush capacity (in bytes).

The store write admission control path now uses a StoreWorkQueue
which wraps a WorkQueue and provides additional functionality:
- Work can specify WriteBytes and whether it is an IngestRequest.
  This is used to decide how many byte tokens to consume.
- Done work specifies how many bytes were ingested into L0, so
  token consumption can be adjusted.

The main framework change is that a single work item can consume
multiple (byte) tokens, which ripples through the various
interfaces including requester, granter. There is associated
cleanup: kvGranter that was handling both slots and tokens is
eliminated since in practice it was only doing one or the other.
Instead, for the slot case the slotGranter is reused. For the token
case there is a new kvStoreTokenGranter.

The main logic change is in ioLoadListener which computes byte
tokens and various estimates. The change is (mostly) neutral if
no write provides WriteBytes, since the usual estimation will
take over. The integration changes in this PR are superficial in
that the requests don't provide WriteBytes. Improvements to the
integration, along with experimental results, will happen in
future PRs.

Informs cockroachdb#79092
Informs cockroachdb#77357

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Apr 28, 2022
This switch to byte tokens will result in better accounting for
large writes, including ingests based on whether their bytes land
in L0 or elsewhere. It is also a precursor to taking into account
flush capacity (in bytes).

The store write admission control path now uses a StoreWorkQueue
which wraps a WorkQueue and provides additional functionality:
- Work can specify WriteBytes and whether it is an IngestRequest.
  This is used to decide how many byte tokens to consume.
- Done work specifies how many bytes were ingested into L0, so
  token consumption can be adjusted.

The main framework change is that a single work item can consume
multiple (byte) tokens, which ripples through the various
interfaces including requester, granter. There is associated
cleanup: kvGranter that was handling both slots and tokens is
eliminated since in practice it was only doing one or the other.
Instead, for the slot case the slotGranter is reused. For the token
case there is a new kvStoreTokenGranter.

The main logic change is in ioLoadListener which computes byte
tokens and various estimates. The change is (mostly) neutral if
no write provides WriteBytes, since the usual estimation will
take over. The integration changes in this PR are superficial in
that the requests don't provide WriteBytes. Improvements to the
integration, along with experimental results, will happen in
future PRs.

Informs cockroachdb#79092
Informs cockroachdb#77357

Release note: None
craig bot pushed a commit that referenced this issue May 3, 2022
80480: admission: change store write admission control to use byte tokens r=ajwerner a=sumeerbhola

This switch to byte tokens will result in better accounting for
large writes, including ingests based on whether their bytes land
in L0 or elsewhere. It is also a precurson to taking into account
flush capacity (in bytes).

The store write admission control path now uses a StoreWorkQueue
which wraps a WorkQueue and provides additional functionality:
- Work can specify WriteBytes and whether it is an IngestRequest.
  This is used to decide how many byte tokens to consume.
- Done work specifies how many bytes were ingested into L0, so
  token consumption can be adjusted.

The main framework change is that a single work item can consume
multiple (byte) tokens, which ripples through the various
interfaces including requester, granter. There is associated
cleanup: kvGranter that was handling both slots and tokens is
eliminated since in practice it was only doing one or the other.
Instead, for the slot case the slotGranter is reused. For the token
case there is a new kvStoreTokenGranter.

The main logic change is in ioLoadListener which computes byte
tokens and various estimates. The change is (mostly) neutral if
no write provides WriteBytes, since the usual estimation will
take over. The integration changes in this PR are superficial in
that the requests don't provide WriteBytes. Improvements to the
integration, along with experimental results, will happen in
future PRs.

Informs #79092
Informs #77357

Release note: None

80892: hlc: move ParseHLC / DecimalToHLC to util/hlc r=ajwerner a=otan

util/hlc needs util/log, but ParseHLC/DecimalToHLC doesn't need to be
in tree. To remove the tree dependency on util/log, move the
mentioned functions.

Release note: None

80898: util/cache: remove dependency on util/log r=ajwerner a=otan

Use a hook in IntervalCache to log any errors instead.

Release note: None

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
@sumeerbhola
Copy link
Collaborator Author

The changes to the admission control package are done. The remaining work is

  • integration changes that provide bytes being ingested and how many bytes went into L0.
  • roachtest that has (a) high rate of sst ingests (that already integrate with admission control), say a secondary index backfill, that often land in L0, and (b) high rate of higher priority (than the sst ingests) normal (small) writes. We want to observe that with this change, (b) would not be unfairly charged the bytes for the ingests, so would suffer less queueing in admission control.
    @dt do you have any suggestions on existing roachtests or workload combinations that I could use as a starting point?

@dt
Copy link
Member

dt commented May 4, 2022

I don't think bulk has any roachtests that have any real concurrent SQL writes with bulk ingest; the ingest ones (IMPORT, RESTORE) have the normal system-driven writes (liveness, job progress, etc) but that's it. Schema might though have something that adds an index while running a workload? @ajwerner ?

@ajwerner
Copy link
Contributor

ajwerner commented May 4, 2022

There's a few, but I don't know much about them. See here for schemachange/index/tpcc/w=%d

func makeIndexAddTpccTest(
spec spec.ClusterSpec, warehouses int, length time.Duration,
) registry.TestSpec {
return registry.TestSpec{
Name: fmt.Sprintf("schemachange/index/tpcc/w=%d", warehouses),
Owner: registry.OwnerSQLSchema,
Cluster: spec,
Timeout: length * 3,
Run: func(ctx context.Context, t test.Test, c cluster.Cluster) {
runTPCC(ctx, t, c, tpccOptions{
Warehouses: warehouses,
// We limit the number of workers because the default results in a lot
// of connections which can lead to OOM issues (see #40566).
ExtraRunArgs: fmt.Sprintf("--wait=false --tolerate-errors --workers=%d", warehouses),
During: func(ctx context.Context) error {
return runAndLogStmts(ctx, t, c, "addindex", []string{
`CREATE UNIQUE INDEX ON tpcc.order (o_entry_d, o_w_id, o_d_id, o_carrier_id, o_id);`,
`CREATE INDEX ON tpcc.order (o_carrier_id);`,
`CREATE INDEX ON tpcc.customer (c_last, c_first);`,
})
},
Duration: length,
SetupType: usingImport,
})
},
}

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jul 26, 2022
The existing scheme for byte token estimation simply looked
at the total bytes added to L0 and divided it among the number
of requests. This was because (a) the parameters to provide
better size information for the request were not populated by
kvserver, (b) the basic estimation approach was flawed since
it assumed that regular writes would be roughly equal sized,
and assumed that ingests would tell what fraction went into L0.

The kvserver-side plumbing for improving (a) were done in a
preceding PR. This one completes that plumbing to pass on
admission.StoreWorkDoneInfo to the admission control package.
In this scheme the {WriteBytes,IngestedBytes} are provided
post-proposal evaluation, and the IngestedBytes is for the
whole LSM. This PR makes changes to the plumbing in the
admission package: specifically, the post-work-done token
adjustments are performed via the granterWithStoreWriteDone
interface and the addition to granterWithIOTokens. The former
also returns the token adjustments to StoreWorkQueue so
that the per-tenant fairness accounting in WorkQueue can be
updated.

The main changes in this PR are in the byte token
estimation logic in the admission package, where the
 estimation now uses a linear model a.x + b, where
x is the bytes provided in admission.StoreWorkDoneInfo.
If we consider regular writes, one can expect that even
with many different sized workloads concurrently being
active on a node, we should be able to fit a model where
a is roughly 2 and b is tiny -- this is because x is the
bytes written to the raft log and does not include the
subsequent state machine application. Similarly, one can
imagine being somewhere in the interval [0,1] for ingested
work. The linear model is meant to fix flaw (b) mentioned
earlier. The current linear model fitting in
store_token_estimation.go is very simple and can be
independently improved in the future -- there are code
comments outlining this. Additionally, all the byte token
estimation logic in granter.go has been removed, which
is better from a code readability perspective.

This change was evaluated with a single node that first
saw a kv0 workload that writes 64KB blocks, then
additionally a kv0 workload that writes 4KB blocks, and
finally a third workload that starts doing an index
backfill due to creating an index on the v column in
the kv table.

Here are snippets from a sequence of log statements when
only the first workload (64KB writes) was running:
```
write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB
write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB
write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB
write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB
```
Note that the parameter a, in a.x does fluctuate. The
additive value b stays at the minimum of 1 bytes, which
is desirable. There is no change to the starting ingest
model since there are no ingestions.

After both the 4KB and 64KB writes are active the log
statements look like:
```
write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB
write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB
write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB
```
Note that the b value stays at 1 byte. The tokens consumed
at admission time are evenly divided among requests, so
the value has dropped.

When the index backfill is also running, the sstables are
ingested into L5 and L6, so the x value in the ingested
model is high, but what is ingested into L0 is low, which
means a becomes very small for the ingested-model -- see
the smoothed 0.00x+1 B below. There is choppiness in this
experiment wrt the write model and the at-admission-tokens,
which is caused by a high number of write stalls. This
was not planned for, and is a side-effect of huge Pebble
manifests caused by 64KB keys. So ignore those values in
the following log statements.
```
write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB
write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB
```

Fixes cockroachdb#79092
Informs cockroachdb#82536

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Aug 6, 2022
The existing scheme for byte token estimation simply looked
at the total bytes added to L0 and divided it among the number
of requests. This was because (a) the parameters to provide
better size information for the request were not populated by
kvserver, (b) the basic estimation approach was flawed since
it assumed that regular writes would be roughly equal sized,
and assumed that ingests would tell what fraction went into L0.

The kvserver-side plumbing for improving (a) were done in a
preceding PR. This one completes that plumbing to pass on
admission.StoreWorkDoneInfo to the admission control package.
In this scheme the {WriteBytes,IngestedBytes} are provided
post-proposal evaluation, and the IngestedBytes is for the
whole LSM. This PR makes changes to the plumbing in the
admission package: specifically, the post-work-done token
adjustments are performed via the granterWithStoreWriteDone
interface and the addition to granterWithIOTokens. The former
also returns the token adjustments to StoreWorkQueue so
that the per-tenant fairness accounting in WorkQueue can be
updated.

The main changes in this PR are in the byte token
estimation logic in the admission package, where the
 estimation now uses a linear model a.x + b, where
x is the bytes provided in admission.StoreWorkDoneInfo.
If we consider regular writes, one can expect that even
with many different sized workloads concurrently being
active on a node, we should be able to fit a model where
a is roughly 2 and b is tiny -- this is because x is the
bytes written to the raft log and does not include the
subsequent state machine application. Similarly, one can
imagine being somewhere in the interval [0,1] for ingested
work. The linear model is meant to fix flaw (b) mentioned
earlier. The current linear model fitting in
store_token_estimation.go is very simple and can be
independently improved in the future -- there are code
comments outlining this. Additionally, all the byte token
estimation logic in granter.go has been removed, which
is better from a code readability perspective.

This change was evaluated with a single node that first
saw a kv0 workload that writes 64KB blocks, then
additionally a kv0 workload that writes 4KB blocks, and
finally a third workload that starts doing an index
backfill due to creating an index on the v column in
the kv table.

Here are snippets from a sequence of log statements when
only the first workload (64KB writes) was running:
```
write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB
write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB
write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB
write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB
```
Note that the parameter a, in a.x does fluctuate. The
additive value b stays at the minimum of 1 bytes, which
is desirable. There is no change to the starting ingest
model since there are no ingestions.

After both the 4KB and 64KB writes are active the log
statements look like:
```
write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB
write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB
write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB
```
Note that the b value stays at 1 byte. The tokens consumed
at admission time are evenly divided among requests, so
the value has dropped.

When the index backfill is also running, the sstables are
ingested into L5 and L6, so the x value in the ingested
model is high, but what is ingested into L0 is low, which
means a becomes very small for the ingested-model -- see
the smoothed 0.00x+1 B below. There is choppiness in this
experiment wrt the write model and the at-admission-tokens,
which is caused by a high number of write stalls. This
was not planned for, and is a side-effect of huge Pebble
manifests caused by 64KB keys. So ignore those values in
the following log statements.
```
write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB
write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB
```

Fixes cockroachdb#79092
Informs cockroachdb#82536

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Aug 8, 2022
The existing scheme for byte token estimation simply looked
at the total bytes added to L0 and divided it among the number
of requests. This was because (a) the parameters to provide
better size information for the request were not populated by
kvserver, (b) the basic estimation approach was flawed since
it assumed that regular writes would be roughly equal sized,
and assumed that ingests would tell what fraction went into L0.

The kvserver-side plumbing for improving (a) were done in a
preceding PR (cockroachdb#83937). This one completes that plumbing to pass on
admission.StoreWorkDoneInfo to the admission control package.
In this scheme the {WriteBytes,IngestedBytes} are provided
post-proposal evaluation, and the IngestedBytes is for the
whole LSM. This PR makes changes to the plumbing in the
admission package: specifically, the post-work-done token
adjustments are performed via the granterWithStoreWriteDone
interface and the addition to granterWithIOTokens. The former
also returns the token adjustments to StoreWorkQueue so
that the per-tenant fairness accounting in WorkQueue can be
updated.

The main changes in this PR are in the byte token
estimation logic in the admission package, where the
estimation now uses a linear model y=a.x + b, where
x is the bytes provided in admission.StoreWorkDoneInfo,
and y is the bytes added to L0 via write or ingestion.
If we consider regular writes, one can expect that even
with many different sized workloads concurrently being
active on a node, we should be able to fit a model where
a is roughly 2 and b is tiny -- this is because x is the
bytes written to the raft log and does not include the
subsequent state machine application. Similarly, one can
expect the a term being in the interval [0,1] for ingested
work. The linear model is meant to fix flaw (b) mentioned
earlier. The current linear model fitting in
store_token_estimation.go is very simple and can be
independently improved in the future -- there are code
comments outlining this. Additionally, all the byte token
estimation logic in granter.go has been removed, which
is better from a code readability perspective.

This change was evaluated with a single node that first
saw a kv0 workload that writes 64KB blocks, then
additionally a kv0 workload that writes 4KB blocks, and
finally a third workload that starts doing an index
backfill due to creating an index on the v column in
the kv table.

Here are snippets from a sequence of log statements when
only the first workload (64KB writes) was running:
```
write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB
write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB
write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB
write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB
```
Note that the parameter a, in a.x does fluctuate. The
additive value b stays at the minimum of 1 bytes, which
is desirable. There is no change to the starting ingest
model since there are no ingestions.

After both the 4KB and 64KB writes are active the log
statements look like:
```
write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB
write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB
write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB
```
Note that the b value stays at 1 byte. The tokens consumed
at admission time are evenly divided among requests, so
the value has dropped.

When the index backfill is also running, the sstables are
ingested into L5 and L6, so the x value in the ingested
model is high, but what is ingested into L0 is low, which
means a becomes very small for the ingested-model -- see
the smoothed 0.00x+1 B below. There is choppiness in this
experiment wrt the write model and the at-admission-tokens,
which is caused by a high number of write stalls. This
was not planned for, and is a side-effect of huge Pebble
manifests caused by 64KB keys. So ignore those values in
the following log statements.
```
write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB
write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB
```

Fixes cockroachdb#79092
Informs cockroachdb#82536

Release note: None
craig bot pushed a commit that referenced this issue Aug 9, 2022
84761: schematelemetry,eventpb: add schema telemetry r=postamar a=postamar

This commit adds:
  - the event definitions and logic for generating them,
  - the scheduling and jobs boilerplate to periodically log them.

Care is taken to redact all strings present in descriptors which might
unintentionally be leaking PIIs.

The event generation logic is tested on the schema of a bootstrapped
test cluster: the test checks that the events match expectations.

Fixes #84284.

Release note (general change): CRDB will now collect schema info if
phoning home is enabled. This schema info is added to the telemetry log
by a built-in scheduled job which runs on a weekly basis by default.
This recurrence can be changed via the sql.schema.telemetry.recurrence
cluster setting.  The schedule can also be paused via PAUSE SCHEDULE
followed by its ID, which can be retrieved by querying
SELECT * FROM [SHOW SCHEDULES] WHERE label = 'sql-schema-telemetry'.

85059: admission,kvserver: improved byte token estimation for writes r=irfansharif,tbg a=sumeerbhola

The existing scheme for byte token estimation simply looked
at the total bytes added to L0 and divided it among the number
of requests. This was because (a) the parameters to provide
better size information for the request were not populated by
kvserver, (b) the basic estimation approach was flawed since
it assumed that regular writes would be roughly equal sized,
and assumed that ingests would tell what fraction went into L0.

The kvserver-side plumbing for improving (a) were done in a
preceding PR (#83937). This one completes that plumbing to pass on
admission.StoreWorkDoneInfo to the admission control package.
In this scheme the {WriteBytes,IngestedBytes} are provided
post-proposal evaluation, and the IngestedBytes is for the
whole LSM. This PR makes changes to the plumbing in the
admission package: specifically, the post-work-done token
adjustments are performed via the granterWithStoreWriteDone
interface and the addition to granterWithIOTokens. The former
also returns the token adjustments to StoreWorkQueue so
that the per-tenant fairness accounting in WorkQueue can be
updated.

The main changes in this PR are in the byte token
estimation logic in the admission package, where the
estimation now uses a linear model y=a.x + b, where
x is the bytes provided in admission.StoreWorkDoneInfo,
and y is the bytes added to L0 via write or ingestion.
If we consider regular writes, one can expect that even
with many different sized workloads concurrently being
active on a node, we should be able to fit a model where
a is roughly 2 and b is tiny -- this is because x is the
bytes written to the raft log and does not include the
subsequent state machine application. Similarly, one can
expect the a term being in the interval [0,1] for ingested
work. The linear model is meant to fix flaw (b) mentioned
earlier. The current linear model fitting in
store_token_estimation.go is very simple and can be
independently improved in the future -- there are code
comments outlining this. Additionally, all the byte token
estimation logic in granter.go has been removed, which
is better from a code readability perspective.

This change was evaluated with a single node that first
saw a kv0 workload that writes 64KB blocks, then
additionally a kv0 workload that writes 4KB blocks, and
finally a third workload that starts doing an index
backfill due to creating an index on the v column in
the kv table.

Here are snippets from a sequence of log statements when
only the first workload (64KB writes) was running:
```
write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB
write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB
write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB
write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB
```
Note that the parameter a, in a.x does fluctuate. The
additive value b stays at the minimum of 1 bytes, which
is desirable. There is no change to the starting ingest
model since there are no ingestions.

After both the 4KB and 64KB writes are active the log
statements look like:
```
write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB
write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB
write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB
```
Note that the b value stays at 1 byte. The tokens consumed
at admission time are evenly divided among requests, so
the value has dropped.

When the index backfill is also running, the sstables are
ingested into L5 and L6, so the x value in the ingested
model is high, but what is ingested into L0 is low, which
means a becomes very small for the ingested-model -- see
the smoothed 0.00x+1 B below. There is choppiness in this
experiment wrt the write model and the at-admission-tokens,
which is caused by a high number of write stalls. This
was not planned for, and is a side-effect of huge Pebble
manifests caused by 64KB keys. So ignore those values in
the following log statements.
```
write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB
write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB
```

Fixes #79092
Informs #82536

Release note: None

Co-authored-by: Marius Posta <marius@cockroachlabs.com>
Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
@craig craig bot closed this as completed in c7396ee Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants