-
Notifications
You must be signed in to change notification settings - Fork 4.1k
tidy up some correctness issues reported by go vet #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's pretty smart of it.
|
How do I set up a git presubmit hook? |
tidy up some correctness issues reported by go vet
|
Take a look at https://github.com/andybons/hipchat, specifically https://github.com/andybons/hipchat/blob/master/githooks/pre-commit and https://github.com/andybons/hipchat/blob/master/initrepo.sh I’ll include something like this when I send my next pull fixing the lint issues. |
``` go test -benchtime 1s -run - -bench TrackChoices1280_Cockroach -timeout 5m ./sql -benchmem 10000 127583 ns/op 27432 B/op 145 allocs/op go test -benchtime 10s -run - -bench TrackChoices1280_Cockroach -timeout 5m ./sql -benchmem 100000 181085 ns/op 27421 B/op 144 allocs/op go test -benchtime 20s -run - -bench TrackChoices1280_Cockroach -timeout 5m ./sql -benchmem 300000 166564 ns/op 27673 B/op 142 allocs/op ``` ``` ``` 2016/05/10 15:31:53.900328 0.025948 node 1 15:31:53.900331 . 3 ... node 1 15:31:53.900338 . 7 ... read has no clock uncertainty 15:31:53.900426 . 87 ... executing 1282 requests 15:31:53.900706 . 280 ... read-write path 15:31:53.900707 . 1 ... command queue 15:31:53.901148 . 441 ... left command queue 15:31:53.901151 . 3 ... request leader lease (attempt #1) 15:31:53.901392 . 240 ... prep for ts cache 15:31:53.904028 . 2637 ... applied ts cache 15:31:53.904613 . 584 ... proposed to Raft 15:31:53.905698 . 1085 ... applying batch 15:31:53.905769 . 72 ... checked aborted txn 15:31:53.905959 . 189 ... checked leadership 15:31:53.906425 . 466 ... 1280 blind puts 15:31:53.914285 . 7860 ... executed batch 15:31:53.914340 . 55 ... prep for commit 15:31:53.915122 . 782 ... committed 15:31:53.915177 . 55 ... processed async intents 15:31:53.915178 . 1 ... applied batch 15:31:53.915186 . 9 ... response obtained ``` out.base
The insert workload is randomized, so the blind puts were computed but nothing was actually written blindly. benchmark old ns/op new ns/op delta BenchmarkTrackChoices1280_Cockroach-8 171532 65505 -61.81% benchmark old allocs new allocs delta BenchmarkTrackChoices1280_Cockroach-8 142 130 -8.45% benchmark old bytes new bytes delta BenchmarkTrackChoices1280_Cockroach-8 27241 26018 -4.49% ``` 2016/05/10 20:15:44.886244 0.006944 node 1 20:15:44.886246 . 3 ... node 1 20:15:44.886253 . 6 ... read has no clock uncertainty 20:15:44.886334 . 81 ... executing 1282 requests 20:15:44.886543 . 209 ... read-write path 20:15:44.886544 . 1 ... command queue 20:15:44.886963 . 420 ... left command queue 20:15:44.886966 . 3 ... request leader lease (attempt #1) 20:15:44.887265 . 298 ... prep for ts cache 20:15:44.887266 . 1 ... applied ts cache 20:15:44.887268 . 2 ... start marshal 20:15:44.887772 . 504 ... end marshal 20:15:44.887835 . 63 ... begin prop 20:15:44.887852 . 17 ... done prop 20:15:44.887855 . 3 ... proposed to Raft 20:15:44.888949 . 1094 ... applying batch 20:15:44.889015 . 67 ... checked aborted txn 20:15:44.889197 . 181 ... checked leadership 20:15:44.889199 . 2 ... executing as 1PC txn 20:15:44.889644 . 445 ... 1280 blind puts 20:15:44.892245 . 2601 ... executeBatch returns 20:15:44.892308 . 63 ... executed batch 20:15:44.892360 . 52 ... prep for commit 20:15:44.893101 . 741 ... committed 20:15:44.893154 . 53 ... processed async intents 20:15:44.893155 . 1 ... applied batch 20:15:44.893173 . 18 ... response obtained 20:15:44.893174 . 1 ... endCmds begins 20:15:44.893175 . 1 ... begin update tsCache 20:15:44.893176 . 1 ... end update tsCache 20:15:44.893177 . 2 ... removed from cmdQ 20:15:44.893178 . 1 ... endCmds ends 20:15:44.893180 . 1 ... Send returns ```
``` go test -benchtime 1s -run - -bench TrackChoices1280_Cockroach -timeout 5m ./sql -benchmem 10000 127583 ns/op 27432 B/op 145 allocs/op go test -benchtime 10s -run - -bench TrackChoices1280_Cockroach -timeout 5m ./sql -benchmem 100000 181085 ns/op 27421 B/op 144 allocs/op go test -benchtime 20s -run - -bench TrackChoices1280_Cockroach -timeout 5m ./sql -benchmem 300000 166564 ns/op 27673 B/op 142 allocs/op ``` ``` ``` 2016/05/10 15:31:53.900328 0.025948 node 1 15:31:53.900331 . 3 ... node 1 15:31:53.900338 . 7 ... read has no clock uncertainty 15:31:53.900426 . 87 ... executing 1282 requests 15:31:53.900706 . 280 ... read-write path 15:31:53.900707 . 1 ... command queue 15:31:53.901148 . 441 ... left command queue 15:31:53.901151 . 3 ... request leader lease (attempt cockroachdb#1) 15:31:53.901392 . 240 ... prep for ts cache 15:31:53.904028 . 2637 ... applied ts cache 15:31:53.904613 . 584 ... proposed to Raft 15:31:53.905698 . 1085 ... applying batch 15:31:53.905769 . 72 ... checked aborted txn 15:31:53.905959 . 189 ... checked leadership 15:31:53.906425 . 466 ... 1280 blind puts 15:31:53.914285 . 7860 ... executed batch 15:31:53.914340 . 55 ... prep for commit 15:31:53.915122 . 782 ... committed 15:31:53.915177 . 55 ... processed async intents 15:31:53.915178 . 1 ... applied batch 15:31:53.915186 . 9 ... response obtained ``` out.base
The insert workload is randomized, so the blind puts were computed but nothing was actually written blindly. benchmark old ns/op new ns/op delta BenchmarkTrackChoices1280_Cockroach-8 171532 65505 -61.81% benchmark old allocs new allocs delta BenchmarkTrackChoices1280_Cockroach-8 142 130 -8.45% benchmark old bytes new bytes delta BenchmarkTrackChoices1280_Cockroach-8 27241 26018 -4.49% ``` 2016/05/10 20:15:44.886244 0.006944 node 1 20:15:44.886246 . 3 ... node 1 20:15:44.886253 . 6 ... read has no clock uncertainty 20:15:44.886334 . 81 ... executing 1282 requests 20:15:44.886543 . 209 ... read-write path 20:15:44.886544 . 1 ... command queue 20:15:44.886963 . 420 ... left command queue 20:15:44.886966 . 3 ... request leader lease (attempt cockroachdb#1) 20:15:44.887265 . 298 ... prep for ts cache 20:15:44.887266 . 1 ... applied ts cache 20:15:44.887268 . 2 ... start marshal 20:15:44.887772 . 504 ... end marshal 20:15:44.887835 . 63 ... begin prop 20:15:44.887852 . 17 ... done prop 20:15:44.887855 . 3 ... proposed to Raft 20:15:44.888949 . 1094 ... applying batch 20:15:44.889015 . 67 ... checked aborted txn 20:15:44.889197 . 181 ... checked leadership 20:15:44.889199 . 2 ... executing as 1PC txn 20:15:44.889644 . 445 ... 1280 blind puts 20:15:44.892245 . 2601 ... executeBatch returns 20:15:44.892308 . 63 ... executed batch 20:15:44.892360 . 52 ... prep for commit 20:15:44.893101 . 741 ... committed 20:15:44.893154 . 53 ... processed async intents 20:15:44.893155 . 1 ... applied batch 20:15:44.893173 . 18 ... response obtained 20:15:44.893174 . 1 ... endCmds begins 20:15:44.893175 . 1 ... begin update tsCache 20:15:44.893176 . 1 ... end update tsCache 20:15:44.893177 . 2 ... removed from cmdQ 20:15:44.893178 . 1 ... endCmds ends 20:15:44.893180 . 1 ... Send returns ```
…perator bulkmerge: create bulkmerge and merge processor scaffolding
This commit fixes two race conditions in the index split operation: 1. After setting the state of the left sub-partition to Ready, say that the split unexpectedly fails. Now say that the left sub- partition itself splits and is deleted. When the original split resumes, it will not be able to get the centroid for the left sub-partition, which is needed to run the K-means clustering algorithm. 2. As described by cockroachdb#1, it's possible that a splitting partition references target sub-partitions that are now missing from the index. This will trigger PartitionNotFound errors in insert code paths. The fixes are: 1. Update the logic so that vectors are first copied to the left and right sub-partitions before either sub-partition's state is updated from Updating to Ready. Only Ready sub-partitions can be split, so this should prevent race condition cockroachdb#1. 2. Update the insert logic so that searches of non-root partitions return multiple results, to make it extremely likely that a suitable insert partition will be found. For root partitions, check split target sub-partitions instead, since the splitting partition does not share a parent with its sub-partitions. Epic: CRDB-42943 Release note: None
This commit fixes two race conditions in the index split operation: 1. After setting the state of the left sub-partition to Ready, say that the split unexpectedly fails. Now say that the left sub- partition itself splits and is deleted. When the original split resumes, it will not be able to get the centroid for the left sub-partition, which is needed to run the K-means clustering algorithm. 2. As described by cockroachdb#1, it's possible that a splitting partition references target sub-partitions that are now missing from the index. This will trigger PartitionNotFound errors in insert code paths. The fixes are: 1. Update the logic so that vectors are first copied to the left and right sub-partitions before either sub-partition's state is updated from Updating to Ready. Only Ready sub-partitions can be split, so this should prevent race condition cockroachdb#1. 2. Update the insert logic so that searches of non-root partitions return multiple results, to make it extremely likely that a suitable insert partition will be found. For root partitions, check split target sub-partitions instead, since the splitting partition does not share a parent with its sub-partitions. Epic: CRDB-42943 Release note: None
This commit fixes two race conditions in the index split operation: 1. After setting the state of the left sub-partition to Ready, say that the split unexpectedly fails. Now say that the left sub- partition itself splits and is deleted. When the original split resumes, it will not be able to get the centroid for the left sub-partition, which is needed to run the K-means clustering algorithm. 2. As described by cockroachdb#1, it's possible that a splitting partition references target sub-partitions that are now missing from the index. This will trigger PartitionNotFound errors in insert code paths. The fixes are: 1. Update the logic so that vectors are first copied to the left and right sub-partitions before either sub-partition's state is updated from Updating to Ready. Only Ready sub-partitions can be split, so this should prevent race condition cockroachdb#1. 2. Update the insert logic so that searches of non-root partitions return multiple results, to make it extremely likely that a suitable insert partition will be found. For root partitions, check split target sub-partitions instead, since the splitting partition does not share a parent with its sub-partitions. Epic: CRDB-42943 Release note: None
144781: roachtest: add operation to probe ranges r=noahstho a=noahstho Since SRE uses crdb_internal.probe_ranges to test for prod cluster health, we would like to add this as a roach operation to make the DRT cluster as realistic as possible, and test for potential issues with crdb_internal.probe_ranges, so we know ASAP if our alerting coverage drops. **Background** crdb_internal.probe_ranges is a virtual table that quickly probes the entire keyspace of the KV layer to return a table of schema (range_id | error | end_to_end_latency_ms). It has minimal dependencies, so it functions even when a cluster is quite broken. And since it probes the entire keyspace, it is useful when something has already gone wrong, in narrowing down an issue to specific ranges. **What will this roach operation can catch?** If this roach operation fails, there is either a bug in `crdb_internal.probe_ranges`, so SRE is short a critical tool, or there is a serious bug is present in KV layer, and KV team will need to know asap. Ideally SRE would be first to know if there is an issue, and can hand off to KV if necessary. **Testing PR** Tested that it works on roachtest cluster with `roachtest run-operation noahthompsoncockroachlabscom-test probe-ranges`, and also was able to test that it successfully failed by forcing a range_error in DB, w/ ```./bin/roachtest run-operation noahthompsoncockroachlabscom-test2 probe-ranges Running operation probe-ranges on noahthompsoncockroachlabscom-test2. 2025/04/29 20:00:46 run_operation.go:145: [1] operation status: checking if operation probe-ranges/read dependencies are met 2025/04/29 20:00:47 run_operation.go:145: [1] operation status: running operation probe-ranges/read with run id 12821170976295052991 2025/04/29 20:00:47 probe_ranges.go:92: [1] operation status: executing crdb_internal.probe-ranges read statement against node 3 2025/04/29 20:00:47 probe_ranges.go:92: [1] operation status: found 1 errors while executing crdb_internal.probe-ranges read statement against node 3 2025/04/29 20:00:47 probe_ranges.go:92: [1] operation status: error on node 3 on range 4: test range error 2025/04/29 20:00:47 operation_impl.go:138: [1] operation failure #1: Found range errors when probing via crdb_internal.probe-ranges read statement against node 3 2025/04/29 20:00:47 run_operation.go:229: recovered from panic: o.Fatal() was called ``` **Future Work** We would like to also enable KVProber cluster setting to test this from a different angle, this should be a very easy change. Fixes: #102034 Release note: None Epic: None 145578: ttljob: add cluster setting to control concurrency r=rafiss a=rafiss Each processor of the TTL job creates a number of goroutines that operate concurrently to scan for expired rows and delete them. Previously, the concurrency was always equal to GOMAXPROCS. This new setting allows it to be overriden. Once this is merged, we should update support runbooks to discuss this setting. Informs: https://github.com/cockroachlabs/support/issues/3284 Epic: None Release note: None Co-authored-by: Noah Thompson <noah.thompson@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
149479: roachtest: exit with failure on github post errors r=herkolategan,DarrylWong a=williamchoe3 Fixes #147116 ### Changes #### Highlevel Changes Added a new failure path first by * adding a new counter in `testRunner` struct which get's incremented when `github.MaybePost()` (called in `testRunner.runWorkers()` and `testRunner.runTests()` )returns an error. When this count > 0, `testRunner.Run()` will return a new error `errGithubPostFailed` and when `main()` sees that error, it will return a new exit code `12` which will fail the pipeline (unlike exit code 10, 11) * ^ very similar to how provisioning errors are tracked and returned to `main()` * does not trigger test short circuiting mechanism because `testRunner.runWorkers()` doesn't return an error ``` type testRunner struct { ... // numGithubPostErrs Counts GitHub post errors across all workers numGithubPostErrs int32 ... } ... issue, err := github.MaybePost(t, issueInfo, l, output, params) // TODO add cluster specific args here if err != nil { shout(ctx, l, stdout, "failed to post issue: %s", err) atomic.AddInt32(&r.numGithubPostErrs, 1) } ``` #### Design In order to do verification via unit tests, i'm used to using something like Python's magic mock, but that's not available in GoLang so i opted for a Dependency Injection approach. (This was the best I could come up with, I wanted to avoid "if unit test, do this" logic. If anyone has any other approaches / suggestions let me know!) I made a new interface `GithubPoster` in such a way that the original `githubIssues` implements that new interface. I then pass this interface in function signatures all the way from `Run()` to `runTests()`. Then in the unit tests, I could pass a different implementation of `GithubPoster` that has a `MaybePost()` that always fails. `github.go` ``` type GithubPoster interface { MaybePost( t *testImpl, issueInfo *githubIssueInfo, l *logger.Logger, message string, params map[string]string) ( *issues.TestFailureIssue, error) } ``` Another issue with this approach is the original `githubIssues` has information that is cluster specific, but because of dependency injection, it's now a shared struct among all the workers, so it doesn't make sense to store certain fields that are worker dependent. For the fields that are worker specific, I created a new struct `githubIssueInfo` that is created in `runWorkers()`, similar to how `githubIssues` used to be created there. Note: I don't love the name `githubIssueInfo`, but i wanted to stick with a similar naming convention to `githubIssues`, open to name suggestions ``` // Original githubIssues type githubIssues struct { disable bool cluster *clusterImpl vmCreateOpts *vm.CreateOpts issuePoster func(context.Context, issues.Logger, issues.IssueFormatter, issues.PostRequest, *issues.Options) (*issues.TestFailureIssue, error) teamLoader func() (team.Map, error) } // New githubIssues type githubIssues struct { disable bool issuePoster func(context.Context, issues.Logger, issues.IssueFormatter, issues.PostRequest, *issues.Options) (*issues.TestFailureIssue, error) teamLoader func() (team.Map, error) } ``` All this was very verbose and didn't love that i had to change all the function signatures to do this, open to other ways to do verification. ### Misc Also first time writing in Go in like ~3 years very open to general go semantic feedback / best practices / design patterns ### Verification Diff of binary I used to manually confirm if you wanna see where I hardcoded to return errors: 611adcc #### Manual Test Logs > ➜ cockroach git:(wchoe/147116-github-err-will-fail-pipeline) ✗ tmp/roachtest run acceptance/build-info --cockroach /Users/wchoe/work/cockroachdb/cockroach/bin_linux/cockroach > ... > Running tests which match regex "acceptance/build-info" and are compatible with cloud "gce". > > fallback runner logs in: artifacts/roachtest.crdb.log > 2025/07/09 00:51:48 run.go:386: test runner logs in: artifacts/_runner-logs/test_runner-1752022308.log > test runner logs in: artifacts/_runner-logs/test_runner-1752022308.log > HTTP server listening on port 56238 on localhost: http://localhost:56238/ > 2025/07/09 00:51:48 run.go:148: global random seed: 1949199437086051249 > 2025/07/09 00:51:48 test_runner.go:398: test_run_id: will.choe-1752022308 > test_run_id: will.choe-1752022308 > [w0] 2025/07/09 00:51:48 work_pool.go:198: Acquired quota for 16 CPUs > [w0] 2025/07/09 00:51:48 cluster.go:3204: Using randomly chosen arch="amd64", acceptance/build-info > [w0] 2025/07/09 00:51:48 test_runner.go:798: Unable to create (or reuse) cluster for test acceptance/build-info due to: mocking. > Unable to create (or reuse) cluster for test acceptance/build-info due to: mocking. > 2025/07/09 00:51:48 test_impl.go:478: test failure #1: full stack retained in failure_1.log: (test_runner.go:873).func4: mocking [owner=test-eng] > 2025/07/09 00:51:48 test_impl.go:200: Runtime assertions disabled > [w0] 2025/07/09 00:51:48 test_runner.go:883: failed to post issue: mocking > failed to post issue: mocking > [w0] 2025/07/09 00:51:48 test_runner.go:1019: test failed: acceptance/build-info (run 1) > [w0] 2025/07/09 00:51:48 test_runner.go:732: Releasing quota for 16 CPUs > [w0] 2025/07/09 00:51:48 test_runner.go:744: No work remaining; runWorker is bailing out... > No work remaining; runWorker is bailing out... > [w0] 2025/07/09 00:51:48 test_runner.go:643: Worker exiting; no cluster to destroy. > 2025/07/09 00:51:48 test_runner.go:460: PASS > PASS > 2025/07/09 00:51:48 test_runner.go:465: 1 clusters could not be created and 1 errors occurred while posting to github > 1 clusters could not be created and 1 errors occurred while posting to github > 2025/07/09 00:51:48 run.go:200: runTests destroying all clusters > Error: some clusters could not be created > failed to POST to GitHub > ➜ cockroach git:(wchoe/147116-github-err-will-fail-pipeline) ✗ echo $? > 12 149913: crosscluster/physical: persist standby poller progress r=dt a=msbutler This patch sets the standby poller job's resolved time to the system time that standby descriptors have been updated to. This allows a reader tenant user to easily check that the poller job is running smoothly via SHOW JOB. Epic: none Release note: none Co-authored-by: William Choe <williamchoe3@gmail.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com>
156830: storeliveness: smear storeliveness heartbeat messages to reduce goroutine spikes at heartbeat interval tick r=miraradeva,iskettaneh a=dodeca12 This PR introduces heartbeat smearing logic that batches and smears Store Liveness heartbeat sends across destination nodes to prevent thundering herd of goroutine spikes. ### Changes Core changes are within these files: ```sh pkg/kv/kvserver/storeliveness ├── support_manager.go # Rename SendAsync→EnqueueMessage, add smearing settings └── transport.go # Add smearing sender goroutine to transport.go which takes care of smearing when enabled ``` ### Background Previously, all stores in a cluster sent heartbeats immediately at each heartbeat interval tick. In large clusters with multi-store nodes, this created synchronized bursts of goroutine spikes that caused issues in other parts of the running CRDB process. ### Commits **Commit: Introduce heartbeat smearing** - Adds a smearing sender goroutine to `transport.go` that batches enqueued messages - Smears send signals across queues using `taskpacer` to spread traffic over time - Configurable via cluster settings (default: enabled) **How it works:** 1. Messages are enqueued via `EnqueueMessage()` into per-node queues 2. When `SendAllEnqueuedMessages()` is called, transport's smearing sender goroutine waits briefly to batch messages 3. Transport's smearing sender goroutine uses `taskpacer` to pace signaling to each queue over a smear duration 4. Each `processQueue` goroutine drains its queue and sends when signalled ### New Cluster Settings - `kv.store_liveness.heartbeat_smearing.enabled` (default: true) - Enable/disable smearing - `kv.store_liveness.heartbeat_smearing.refresh` (default: 10ms) - Batching window duration - `kv.store_liveness.heartbeat_smearing.smear` (default: 1ms) - Time to spread sends across queues ### Backward Compatibility - Feature is disabled by setting `kv.store_liveness.heartbeat_smearing.enabled=false` - When disabled, messages are sent immediately via the existing path (non-smearing mode) ### Testing - Added comprehensive unit tests verifying: - Messages batch correctly across multiple destinations - Smearing spreads signals over the configured time window - Smearing mode vs immediate mode behaviour - Message ordering and reliability All existing tests updated to call `SendAllEnqueuedMessages()` after enqueuing when smearing is enabled. #### Roachprod testing ##### Prototype #1 - Ran a prototype with a [similar design](#154942) (TL;DR of prototype, the heartbeats were smeared via `SupportManager` goroutines being put to sleep; this current design ensures that `SupportManager` goroutines do not get blocked) on a roachprod with 150 node cluster to verify smearing works. | Before changes (current behaviour on master) | After changes (prototype) | |--------|--------| | <img width="2680" height="570" alt="image" src="https://github.com/user-attachments/assets/32fe6ee0-437f-48eb-b3f1-087a3eafe6ac" /> | <img width="2692" height="634" alt="image" src="https://github.com/user-attachments/assets/66b5b82b-bbc4-4f47-a13e-5f6d42a1c6d4" /> | ##### Current changes - Ran a roachprod test with current changes but without the check for empty queues (more info - https://reviewable.io/reviews/cockroachdb/cockroach/156378#-). This check did end up proving vital, as the test results didn't show the expected smearing behaviour. - Ran a mini-roachprod test on [this prototype commit](https://github.com/cockroachdb/cockroach/pull/155317/files#diff-9282b4b1d9a2fe32fae81e5776eb081e58069b4bc7db76718873b75f026e16c1) (where the only real difference between my changes is the inclusion of the length check on the queues that have messages on that commit) showed expected smearing behaviour. <img width="1797" height="469" alt="image" src="https://github.com/user-attachments/assets/bd7778ef-9f8d-4dbf-8ed2-dac40e7fb03c" /> Fixes: #148210 Release note: None Co-authored-by: Swapneeth Gorantla <swapneeth.gorantla@cockroachlabs.com>
159877: kvserver: deflake TestReadLoadMetricAccounting r=tbg a=tbg `TestReadLoadMetricAccounting` has a history of flaking due to lease-related writes interfering with load metric measurements. Issue #141716 (and #141586) identified the same failure signature: ``` Error: Max difference between 0 and 85 allowed is 4, but difference was -85 ``` The root cause was identified by `@pav-kv:` an "unexpected" leader lease upgrade write was interfering with the test's write bytes measurements. PR #141843 added `tc.MaybeWaitForLeaseUpgrade()` to wait for lease upgrades before starting measurements. **The fix from #141843 IS present** in the failing SHA. However, the test still flaked with the same error signature (85 write bytes when expecting 0). The logs show: 1. AddSSTableRequest evaluated (test setup) 2. Many LeaseInfoRequest polls (from MaybeWaitForLeaseUpgrade) 3. RequestLeaseRequest (the lease upgrade write) 4. More LeaseInfoRequest polls 5. "lease is now of type: LeaseLeader" - **upgrade complete** 6. "test #1" - test loop begins 7. GetRequest evaluated (the actual test request) 8. **Assertion fails** - 85 write bytes observed The race condition is subtle: `MaybeWaitForLeaseUpgrade()` waits until `FindRangeLeaseEx()` reports the lease is upgraded, but it does **not** guarantee that the write bytes have been recorded to load stats. This is because stats are recorded "awkwardly late" on the client goroutine (`SendWithWriteBytes`). The fix: 1. Wraps each test case iteration in `SucceedsSoon` 2. Resets load stats, sends the request, checks results 3. If any stat doesn't match (due to background activity like lease upgrades), returns an error to trigger retry 4. Adds a comment noting that test cases must be idempotent (they are—all reads) ## Related Issues/PRs | Issue/PR | Status | Relevance | |----------|--------|-----------| | #159719 | OPEN | Current failure | | #141716 | CLOSED | Duplicate, Feb 2025 | | #141586 | CLOSED | Original issue, Feb 2025 | | #141843 | MERGED | Deflake attempt (wait for lease upgrade) | | #141599 | MERGED | Added logging to help debug | | #141905 | CLOSED | Duplicate | | #134799 | CLOSED | Older occurrence | This is more robust than trying to synchronize with specific background operations because it handles **any** source of interference, not just lease upgrades. Epic: none Closes #159719. Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
I use go vet (http://golang.org/cmd/go/#hdr-Run_go_tool_vet_on_packages) as a presubmit hook on almost all of my go code. It definitely helps with finding correctness problems.