workload: use same key sequence during KV --insert-count #152979

jbowens · 2025-09-04T13:45:42Z

When inserting an initial --insert-count keys, use the configured key schema to
insert the keys. This ensures that a subsequent call passing an appropriate
--write-seq flag will read the keys inserted through --insert-count.

Fix #107874.
Epic: none
Release note: none

Refactor the KV workload's key generator interface to reduce the interface surface area. The keyGenerator is now a struct, rather than an interface, and a more constrained keyMapper interface encapsulates the logic necessary for mapping the sequential index to the appropriate key. Key generator configuration is held in a keyGeneratorConfig struct, while state that requires mutual exclusion is held in a keyGeneratorState struct. Epic: none Release note: none

cockroach-teamcity · 2025-09-04T13:45:59Z

This change is

RaduBerinde

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong and @golgeek)

jbowens · 2025-09-04T15:08:19Z

Hrm, I'm realizing there's an issue. --insert-count uses IMPORT, so there must not be duplicate keys. The keyGenerator can generate duplicate keys at large insert counts with the zipfian or default hashing key generation.

RaduBerinde · 2025-09-04T15:34:42Z

Can we use a pseudorandom way of generating permutations? https://chatgpt.com/share/68b9b11a-18dc-8010-9fe1-990d5c3e731a

Previously the BatchedTuples used to load initial data for a workload disallowed duplicate keys violating uniqueness constraints. Some of the workload key generators may inevitably generate duplicates when hashing keys because of the relatively small key space of 64-bit integers. This commit updates the BatchedTuples type so that the workload can indicate the possible existence of duplicates, and updates the data loader to use INSERT ... ON CONFLICT DO NOTHING queries to perform the insertion, effectively ignoring the duplicates.

When inserting an initial --insert-count keys, use the configured key schema to insert the keys. This ensures that a subsequent call passing an appropriate --write-seq flag will read the keys inserted through --insert-count. Fix cockroachdb#107874. Epic: none Release note: none

jbowens · 2025-09-04T17:48:16Z

Can we use a pseudorandom way of generating permutations? https://chatgpt.com/share/68b9b11a-18dc-8010-9fe1-990d5c3e731a

interesting... I think maybe for the uniform generator, yes, but I'm not sure how to adapt the zipfian generator. For now I adapted the data loader to use use the insert data loader if there's the possibility of duplicates, and to configure the insert data loader to use ON CONFLICT DO NOTHING

RaduBerinde · 2025-09-04T18:06:48Z

interesting... I think maybe for the uniform generator, yes, but I'm not sure how to adapt the zipfian generator

Indeed.. a zipfian write pattern necessarily implies we are overwriting keys

jbowens · 2025-09-04T19:18:19Z

I guess ideally for a zipfian key generator, --insert-count would insert the n keys with the highest probability of being surfaced by the zipfian distribution (rather than pulling n keys out of a zipfian distribution)

RaduBerinde · 2025-09-04T20:07:44Z

I guess ideally for a zipfian key generator, --insert-count would insert the n keys with the highest probability of being surfaced by the zipfian distribution (rather than pulling n keys out of a zipfian distribution)

Looking at the zipfian generator code, that just means keys 0 through n-1. BTW I don't think this pattern (keys are ordered by decreasing frequency) is realistic. We should pass the zipf output through a random permutation.

jbowens · 2025-09-04T20:37:53Z

Oof, I didn't even notice that. The ycsb workload will hash the index afterwards, ensuring keys are still randomly distributed across the keyspace.

TFTR!

bors r=RaduBerinde

craig · 2025-09-04T21:19:05Z

Build succeeded:

In 1041bf0 (cockroachdb#152979), we converted ycsb and kv workloads to use INSERT statements for initial data loading when hashing keys. Hashing keys can cause key collisions. At higher row counts, these collisions are inevitable. Since IMPORTs require imported data to be unique, an IMPORT with colliding keys would fail. Before cockroachdb#152979, only the ycsb workload was susceptible to this failure mode, because the kv workload did not use the same key generator for data loaded through --insert-count (cockroachdb#107874). Fixing cockroachdb#107874 made the kv workload susceptible to this type of failure. The change to use INSERTs instead of IMPORTs ran afoul of KV message size limits (see cockroachdb#153086). For now we revert the change to automatically use IMPORTs and instead warn about the possibility of collisions. Epic: none Release note: none

153066: cursor: add user-level CLAUDE.md to rule r=benbardin a=benbardin Useful for user-level preferences like gh to access Github, etc. Epic: none Release note: None 153171: workload: use IMPORT by default, but warn against uniqueness violations r=tbg a=jbowens In 1041bf0 (#152979), we converted ycsb and kv workloads to use INSERT statements for initial data loading when hashing keys. Hashing keys can cause key collisions. At higher row counts, these collisions are inevitable. Since IMPORTs require imported data to be unique, an IMPORT with colliding keys would fail. Before #152979, only the ycsb workload was susceptible to this failure mode, because the kv workload did not use the same key generator for data loaded through --insert-count (#107874). Fixing #107874 made the kv workload susceptible to this type of failure. The change to use INSERTs instead of IMPORTs ran afoul of KV message size limits (see #153086). For now we revert the change to automatically use IMPORTs and instead warn about the possibility of collisions. Epic: none Release note: none 153206: pkg/cmd/roachtests: fix disk bandwidth passed to cluster setting r=nicktrav a=nicktrav As of #135019, the disk bandwidth variable has units of bytes, but the string passed to the cluster setting has units of `MiB`. This can result an effective bandwidth that is orders of magnitude too high. Change the former to match the units of the latter. Release note: None. Epic: None. 153293: storage: bump pebble, add new blob rewrite cluster setting r=RaduBerinde a=jbowens Changes: * [`342d2e8b`](cockroachdb/pebble@342d2e8b) db: add high-priority blob file rewrite compactions * [`fff7d3ad`](cockroachdb/pebble@fff7d3ad) db: fix MarkedForCompaction flake * [`d86f6dab`](cockroachdb/pebble@d86f6dab) scripts: more improvements to stress.sh Release note: none. Epic: none. Add a new cluster setting for configuring high-priority blob file rewrite compactions. One of these compactions is allowed to run ahead of a default compaction if the amount of garbage exceeds the configured percentage. Co-authored-by: Ben Bardin <bardin@cockroachlabs.com> Co-authored-by: Jackson Owens <jackson@cockroachlabs.com> Co-authored-by: Nick Travers <travers@cockroachlabs.com>

jbowens requested review from a team and RaduBerinde September 4, 2025 13:45

jbowens requested a review from a team as a code owner September 4, 2025 13:45

jbowens requested review from DarrylWong and golgeek and removed request for a team September 4, 2025 13:45

jbowens force-pushed the workload-kv branch from 32da88c to c3cbd57 Compare September 4, 2025 14:40

RaduBerinde approved these changes Sep 4, 2025

View reviewed changes

jbowens added 2 commits September 4, 2025 13:36

jbowens force-pushed the workload-kv branch from c3cbd57 to a776965 Compare September 4, 2025 17:40

craig bot merged commit b66aafa into cockroachdb:master Sep 4, 2025
32 of 33 checks passed

celeste-cockroachdb bot added the target-release-25.4.0 label Sep 4, 2025

This was referenced Sep 8, 2025

roachtest: allocbench/nodes=7/cpu=8/kv/r=50/ops=skew failed [#152979] #153013

Closed

roachtest: splits/load/ycsb/d/nodes=3/obj=cpu failed #153089

Closed

jbowens deleted the workload-kv branch September 8, 2025 15:04

jbowens mentioned this pull request Sep 8, 2025

workload: use IMPORT by default, but warn against uniqueness violations #153171

Merged

log-head mentioned this pull request Sep 15, 2025

roachtest: cdc/scan/initial/ranges=100k/envelope=enriched/enriched_properties=source,schema failed #153117

Closed

log-head mentioned this pull request Sep 15, 2025

roachtest: cdc/scan/catchup/ranges=100k/envelope=enriched/enriched_properties=source,schema failed #153120

Closed

celeste-cockroachdb bot added v25.4.0-prerelease and removed target-release-25.4.0 labels Sep 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

workload: use same key sequence during KV --insert-count #152979

workload: use same key sequence during KV --insert-count #152979

Uh oh!

jbowens commented Sep 4, 2025

Uh oh!

cockroach-teamcity commented Sep 4, 2025

Uh oh!

RaduBerinde left a comment

Uh oh!

jbowens commented Sep 4, 2025

Uh oh!

RaduBerinde commented Sep 4, 2025

Uh oh!

jbowens commented Sep 4, 2025

Uh oh!

RaduBerinde commented Sep 4, 2025

Uh oh!

jbowens commented Sep 4, 2025

Uh oh!

RaduBerinde commented Sep 4, 2025

Uh oh!

jbowens commented Sep 4, 2025

Uh oh!

craig bot commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

workload: use same key sequence during KV --insert-count #152979

workload: use same key sequence during KV --insert-count #152979

Uh oh!

Conversation

jbowens commented Sep 4, 2025

Uh oh!

cockroach-teamcity commented Sep 4, 2025

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

jbowens commented Sep 4, 2025

Uh oh!

RaduBerinde commented Sep 4, 2025

Uh oh!

jbowens commented Sep 4, 2025

Uh oh!

RaduBerinde commented Sep 4, 2025

Uh oh!

jbowens commented Sep 4, 2025

Uh oh!

RaduBerinde commented Sep 4, 2025

Uh oh!

jbowens commented Sep 4, 2025

Uh oh!

craig bot commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants