Skip to content

Conversation

@jbowens
Copy link
Collaborator

@jbowens jbowens commented Sep 4, 2025

When inserting an initial --insert-count keys, use the configured key schema to
insert the keys. This ensures that a subsequent call passing an appropriate
--write-seq flag will read the keys inserted through --insert-count.

Fix #107874.
Epic: none
Release note: none

Refactor the KV workload's key generator interface to reduce the interface
surface area. The keyGenerator is now a struct, rather than an interface, and a
more constrained keyMapper interface encapsulates the logic necessary for
mapping the sequential index to the appropriate key. Key generator
configuration is held in a keyGeneratorConfig struct, while state that requires
mutual exclusion is held in a keyGeneratorState struct.

Epic: none
Release note: none
@jbowens jbowens requested review from a team and RaduBerinde September 4, 2025 13:45
@jbowens jbowens requested a review from a team as a code owner September 4, 2025 13:45
@jbowens jbowens requested review from DarrylWong and golgeek and removed request for a team September 4, 2025 13:45
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong and @golgeek)

@jbowens
Copy link
Collaborator Author

jbowens commented Sep 4, 2025

Hrm, I'm realizing there's an issue. --insert-count uses IMPORT, so there must not be duplicate keys. The keyGenerator can generate duplicate keys at large insert counts with the zipfian or default hashing key generation.

@RaduBerinde
Copy link
Member

Can we use a pseudorandom way of generating permutations? https://chatgpt.com/share/68b9b11a-18dc-8010-9fe1-990d5c3e731a

Previously the BatchedTuples used to load initial data for a workload
disallowed duplicate keys violating uniqueness constraints. Some of the
workload key generators may inevitably generate duplicates when hashing keys
because of the relatively small key space of 64-bit integers.

This commit updates the BatchedTuples type so that the workload can indicate
the possible existence of duplicates, and updates the data loader to use INSERT
... ON CONFLICT DO NOTHING queries to perform the insertion, effectively
ignoring the duplicates.
When inserting an initial --insert-count keys, use the configured key schema to
insert the keys. This ensures that a subsequent call passing an appropriate
--write-seq flag will read the keys inserted through --insert-count.

Fix cockroachdb#107874.
Epic: none
Release note: none
@jbowens
Copy link
Collaborator Author

jbowens commented Sep 4, 2025

Can we use a pseudorandom way of generating permutations? https://chatgpt.com/share/68b9b11a-18dc-8010-9fe1-990d5c3e731a

interesting... I think maybe for the uniform generator, yes, but I'm not sure how to adapt the zipfian generator. For now I adapted the data loader to use use the insert data loader if there's the possibility of duplicates, and to configure the insert data loader to use ON CONFLICT DO NOTHING

@RaduBerinde
Copy link
Member

interesting... I think maybe for the uniform generator, yes, but I'm not sure how to adapt the zipfian generator

Indeed.. a zipfian write pattern necessarily implies we are overwriting keys

@jbowens
Copy link
Collaborator Author

jbowens commented Sep 4, 2025

I guess ideally for a zipfian key generator, --insert-count would insert the n keys with the highest probability of being surfaced by the zipfian distribution (rather than pulling n keys out of a zipfian distribution)

@RaduBerinde
Copy link
Member

I guess ideally for a zipfian key generator, --insert-count would insert the n keys with the highest probability of being surfaced by the zipfian distribution (rather than pulling n keys out of a zipfian distribution)

Looking at the zipfian generator code, that just means keys 0 through n-1. BTW I don't think this pattern (keys are ordered by decreasing frequency) is realistic. We should pass the zipf output through a random permutation.

@jbowens
Copy link
Collaborator Author

jbowens commented Sep 4, 2025

Oof, I didn't even notice that. The ycsb workload will hash the index afterwards, ensuring keys are still randomly distributed across the keyspace.

TFTR!

bors r=RaduBerinde

@craig
Copy link
Contributor

craig bot commented Sep 4, 2025

@craig craig bot merged commit b66aafa into cockroachdb:master Sep 4, 2025
32 of 33 checks passed
@jbowens jbowens deleted the workload-kv branch September 8, 2025 15:04
jbowens added a commit to jbowens/cockroach that referenced this pull request Sep 8, 2025
In 1041bf0 (cockroachdb#152979), we converted ycsb and kv workloads to use INSERT
statements for initial data loading when hashing keys.

Hashing keys can cause key collisions. At higher row counts, these collisions
are inevitable. Since IMPORTs require imported data to be unique, an IMPORT
with colliding keys would fail. Before cockroachdb#152979, only the ycsb workload was
susceptible to this failure mode, because the kv workload did not use the same
key generator for data loaded through --insert-count (cockroachdb#107874). Fixing cockroachdb#107874
made the kv workload susceptible to this type of failure.

The change to use INSERTs instead of IMPORTs ran afoul of KV message size
limits (see cockroachdb#153086). For now we revert the change to automatically use IMPORTs
and instead warn about the possibility of collisions.

Epic: none
Release note: none
craig bot pushed a commit that referenced this pull request Sep 10, 2025
153066: cursor: add user-level CLAUDE.md to rule r=benbardin a=benbardin

Useful for user-level preferences like gh to access Github, etc.

Epic: none
Release note: None

153171: workload: use IMPORT by default, but warn against uniqueness violations r=tbg a=jbowens

In 1041bf0 (#152979), we converted ycsb and kv workloads to use INSERT statements for initial data loading when hashing keys.

Hashing keys can cause key collisions. At higher row counts, these collisions are inevitable. Since IMPORTs require imported data to be unique, an IMPORT with colliding keys would fail. Before #152979, only the ycsb workload was susceptible to this failure mode, because the kv workload did not use the same key generator for data loaded through --insert-count (#107874). Fixing #107874 made the kv workload susceptible to this type of failure.

The change to use INSERTs instead of IMPORTs ran afoul of KV message size limits (see #153086). For now we revert the change to automatically use IMPORTs and instead warn about the possibility of collisions.

Epic: none
Release note: none

153206: pkg/cmd/roachtests: fix disk bandwidth passed to cluster setting r=nicktrav a=nicktrav

As of #135019, the disk bandwidth variable has units of bytes, but the string passed to the cluster setting has units of `MiB`. This can result an effective bandwidth that is orders of magnitude too high.

Change the former to match the units of the latter.

Release note: None.

Epic: None.

153293: storage: bump pebble, add new blob rewrite cluster setting r=RaduBerinde a=jbowens

Changes:

 * [`342d2e8b`](cockroachdb/pebble@342d2e8b) db: add high-priority blob file rewrite compactions
 * [`fff7d3ad`](cockroachdb/pebble@fff7d3ad) db: fix MarkedForCompaction flake
 * [`d86f6dab`](cockroachdb/pebble@d86f6dab) scripts: more improvements to stress.sh

Release note: none.
Epic: none.

Add a new cluster setting for configuring high-priority blob file rewrite
compactions. One of these compactions is allowed to run ahead of a default
compaction if the amount of garbage exceeds the configured percentage.

Co-authored-by: Ben Bardin <bardin@cockroachlabs.com>
Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
Co-authored-by: Nick Travers <travers@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

workload/kv: --insert-count keys don't overlap workload

3 participants