Skip to content

Conversation

@tbg
Copy link
Member

@tbg tbg commented Jan 7, 2026

Backport 2/2 commits from #160615 on behalf of @tbg.


When deciding whether a store should shed load, computeCandidatesForReplicaTransfer
was checking both sls (store-level) and nls (node-level) load summaries. If nls
indicated overload but sls did not, shedding would proceed anyway. This caused a
panic in sortTargetCandidateSetAndPick which requires loadThreshold > loadNoChange.

The fix is to only check sls when deciding if a store should shed. If sls <=
loadNoChange, the store itself isn't overloaded relative to candidates that can
receive the load. High nls with low sls means other stores on the node are
causing node-level overload, so shedding from this store wouldn't help.

Fixes #160569


Release justification: low risk fix for panics in mmaprototype

tbg added 2 commits January 7, 2026 12:23
Add a datadriven test that reproduces the panic in sortTargetCandidateSetAndPick
when sls <= loadNoChange but nls > loadNoChange. The scenario involves a node
with two stores where one store (s2) contributes heavily to node CPU overload
while another (s1) does not. When s1 tries to shed based on node-level overload,
the candidate set excludes s2 (due to refusing disposition), making s1's
store-level load look normal, but sortTargetCandidateSetAndPick receives sls as
the loadThreshold which violates its invariant.

Also add panic recovery to TestClusterState so panics are captured as expected
output rather than failing the test, enabling regression tests for panic bugs.

Informs cockroachdb#160569
When deciding whether a store should shed load, computeCandidatesForReplicaTransfer
was checking both sls (store-level) and nls (node-level) load summaries. If nls
indicated overload but sls did not, shedding would proceed anyway. This caused a
panic in sortTargetCandidateSetAndPick which requires loadThreshold > loadNoChange.

The fix is to only check sls when deciding if a store should shed. If sls <=
loadNoChange, the store itself isn't overloaded relative to candidates that can
receive the load. High nls with low sls means other stores on the node are
causing node-level overload, so shedding from this store wouldn't help.

Fixes cockroachdb#160569
@tbg tbg requested review from a team as code owners January 7, 2026 16:07
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Jan 7, 2026
@blathers-crl blathers-crl bot requested review from sumeerbhola and wenyihu6 January 7, 2026 16:07
@blathers-crl
Copy link

blathers-crl bot commented Jan 7, 2026

Thanks for opening a backport.

Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate.

@blathers-crl blathers-crl bot added backport Label PR's that are backports to older release branches T-kv KV Team labels Jan 7, 2026
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@wenyihu6 wenyihu6 merged commit 799b473 into cockroachdb:release-26.1 Jan 8, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. T-kv KV Team target-release-26.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants