Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvnemesis: TestKVNemesisMultiNode failed #119681

Closed
cockroach-teamcity opened this issue Feb 27, 2024 · 9 comments · Fixed by #131093
Closed

kv/kvnemesis: TestKVNemesisMultiNode failed #119681

cockroach-teamcity opened this issue Feb 27, 2024 · 9 comments · Fixed by #131093
Assignees
Labels
A-testing Testing tools and infrastructure branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 27, 2024

kv/kvnemesis.TestKVNemesisMultiNode failed with artifacts on release-23.1 @ 5d02fdf2851038279f7544e6271a08e1036a2966:

=== RUN   TestKVNemesisMultiNode
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/logTestKVNemesisMultiNode681426315
    test_log_scope.go:79: use -show-logs to present logs inline
    kvnemesis_test.go:180: seed: 8246618713428840787
    kvnemesis_test.go:124: kvnemesis logging to /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/kvnemesis2506896689
    kvnemesis.go:165: error applying x.ScanForUpdate(ctx, tk(7627554184111334918), tk(11297944047384909700), 0) // WriteTooOldError: write for key /Table/100/"83f343408bfd2cf2" at timestamp 1709031554.741791117,0 too old; wrote at 1709031554.784361740,1: {<nil> 0 {0xc002fbb670} <nil> 1709031554.919155299,0}
    kvnemesis.go:185: failures(verbose): /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/kvnemesis2506896689/failures
        repro steps: /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/kvnemesis2506896689/repro.go
        rangefeed KVs: /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/kvnemesis2506896689/kvs-rangefeed.txt
        scan KVs: /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/kvnemesis2506896689/kvs-scan.txt
    kvnemesis_test.go:207: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/pkg/kv/kvnemesis/kvnemesis_test.go:207
        	            				github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/pkg/kv/kvnemesis/kvnemesis_test.go:160
        	Error:      	Should be zero, but was 1
        	Test:       	TestKVNemesisMultiNode
        	Messages:   	kvnemesis detected failures
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/logTestKVNemesisMultiNode681426315
--- FAIL: TestKVNemesisMultiNode (19.24s)

Parameters:

  • TAGS=bazel,gss
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-36279

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Feb 27, 2024
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Feb 27, 2024
@arulajmani
Copy link
Collaborator

The failed operation was:

  before: 1709031554.731652557,0
  3 OP  db1.ScanForUpdate(ctx, tk(7627554184111334918), tk(11297944047384909700), 0) // WriteTooOldError: write for key /Table/100/"83f343408bfd2cf2" at timestamp 1709031554.741791117,0 too old; wrote at 1709031554.784361740,1
  after: 1709031554.919480615,0

and the trace is:

w3_step17_trace.txt

@arulajmani
Copy link
Collaborator

@arulajmani arulajmani removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 29, 2024
@arulajmani
Copy link
Collaborator

It looks like we're getting an unexpected RetryWithUncertaintyIntervalError for a locking request. I think this is coming from range keys, as the only path where RetryWithUncertaintyIntervalErrors are possible for locking requests is:

// Check if any of the range keys are in the uncertainty interval.
if p.checkUncertainty {
for _, version := range p.curRangeKeys.Versions {
if version.Timestamp.LessEq(p.ts) {
break
}
var value MVCCValue
var simple bool
value, simple, p.err = tryDecodeSimpleMVCCValue(version.Value)
if !simple && p.err == nil {
value, p.err = decodeExtendedMVCCValue(version.Value)
}
if p.err != nil {
return false
}
localTS := value.GetLocalTimestamp(version.Timestamp)
if p.uncertainty.IsUncertain(version.Timestamp, localTS) {
return p.uncertaintyError(version.Timestamp, localTS)
}
}
}
}

We only attempt a server side retry while holding latches once. Server side retries can happen for both RetryWithUncertaintyIntervalErrors and WriteTooOldErrors. We used up our one and only retry for the (unexpected) RetryWithUncertaintyIntervalError. As a result, the second WriteTooOldError was bubbled back up to the client.

We seem to be breaking the invariants described in:

However, for this specific case, we're just missing a potential opportunity to perform a server side refresh. I don't think this qualifies as a release blocker as a result. That being said, let's discuss this a bit more once you're back @nvanbenschoten, as you've got the most context here.

@arulajmani
Copy link
Collaborator

Thanks for helping debug this btw @miraradeva!

@arulajmani arulajmani added the P-3 Issues/test failures with no fix SLA label Mar 1, 2024
@nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten added A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Mar 11, 2024
Copy link

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
@tbg tbg reopened this Sep 20, 2024
@tbg tbg assigned tbg and unassigned miraradeva Sep 20, 2024
tbg added a commit to tbg/cockroach that referenced this issue Sep 20, 2024
@github-actions github-actions bot removed the X-stale label Sep 23, 2024
craig bot pushed a commit that referenced this issue Sep 26, 2024
131093: storage: disable checkUncertainty on failOnMoreRecent in scanner r=tbg a=tbg

It was possible for reads with failOnMoreRecent to hit a
ReadWithinUncertaintyIntervalError instead of the desired
WriteTooOldError. This commit disables uncertainty checks when
failOnMoreRecent is active, as the latter is a stronger check anyway.

Fixes #119681.
Fixes #131005.

Epic: none
Release note: None

131384: roachtest: admission-control/disk-bandwidth-limiter test improvements r=sumeerbhola a=aadityasondhi

This patch fixes a few things in this test:
- Runs the first step longer to have a fuller LSM to induce block and page cache misses to have some disk reads.
- Reduces the throughput of the foreground workload since it was causing saturation on its own.
- Assert on total bandwidth since the disk bandwidth limiter should be accounting for reads when determining tokens.

Fixes #129534.

Release note: None

131395: crosscluster/producer: modify lastEmitWait and lastProduceWait computation r=dt a=msbutler

This patch modifies the lastEmitWait and lastProduceWait in the crdb_internal.cluster_replication_node streams vtable to be either the current wait or previous wait, if the event stream is currently waiting on that given state.

Epic: none

Release note: none

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Aaditya Sondhi <20070511+aadityasondhi@users.noreply.github.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
@craig craig bot closed this as completed in 10664ba Sep 26, 2024
cthumuluru-crdb pushed a commit to cthumuluru-crdb/cockroach that referenced this issue Oct 1, 2024
It was possible for reads with failOnMoreRecent to hit a
ReadWithinUncertaintyIntervalError instead of the desired
WriteTooOldError. This commit disables uncertainty checks when
failOnMoreRecent is active, as the latter is a stronger check anyway.

Fixes cockroachdb#119681.
Fixes cockroachdb#131005.

Epic: none
Release note: None
Copy link

blathers-crl bot commented Nov 11, 2024

Based on the specified backports for linked PR #131093, I applied the following new label(s) to this issue: branch-release-24.2. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 label Nov 11, 2024
Copy link

blathers-crl bot commented Nov 11, 2024

Based on the specified backports for linked PR #131093, I applied the following new label(s) to this issue: branch-release-24.1. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 label Nov 11, 2024
blathers-crl bot pushed a commit that referenced this issue Nov 11, 2024
It was possible for reads with failOnMoreRecent to hit a
ReadWithinUncertaintyIntervalError instead of the desired
WriteTooOldError. This commit disables uncertainty checks when
failOnMoreRecent is active, as the latter is a stronger check anyway.

Fixes #119681.
Fixes #131005.

Epic: none
Release note: None
blathers-crl bot pushed a commit that referenced this issue Nov 11, 2024
It was possible for reads with failOnMoreRecent to hit a
ReadWithinUncertaintyIntervalError instead of the desired
WriteTooOldError. This commit disables uncertainty checks when
failOnMoreRecent is active, as the latter is a stronger check anyway.

Fixes #119681.
Fixes #131005.

Epic: none
Release note: None
Copy link

blathers-crl bot commented Dec 4, 2024

Based on the specified backports for linked PR #131093, I applied the following new label(s) to this issue: branch-release-23.2. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team
Projects
No open projects
Status: Incoming
Development

Successfully merging a pull request may close this issue.

5 participants