Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stress: nightly CI job sometimes exits 2 #107779

Closed
tbg opened this issue Jul 28, 2023 · 2 comments · Fixed by #107802
Closed

stress: nightly CI job sometimes exits 2 #107779

tbg opened this issue Jul 28, 2023 · 2 comments · Fixed by #107802
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-dev-inf

Comments

@tbg
Copy link
Member

tbg commented Jul 28, 2023

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-dev-inf labels Jul 28, 2023
@tbg
Copy link
Member Author

tbg commented Jul 28, 2023

Also, e.g. https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_StressBazel/11098339?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildProblemsSection=true

failed parent test "TestCancelQuery" (no subtests)
  consolidating failed subtest "reject_cancel_from_wrong_client_IP" into parent test "TestCancelQuery"
  panic: assignment to entry in nil map
  goroutine 1 [running]:
  github.com/cockroachdb/cockroach/pkg/cmd/bazci/githubpost.getIssueFilerForFormatter.func1({0xdd5f10, 0xc00003a038}, {{0xc0003366f0, 0x27}, {0xc00012a0c0, 0x34}, {0xc00012a240, 0xf}, {0xc0002f6380, 0x34b}})
    github.com/cockroachdb/cockroach/pkg/cmd/bazci/githubpost/githubpost.go:104 +0x16a
  github.com/cockroachdb/cockroach/pkg/cmd/bazci/githubpost.processFailures({0xdd5f10, 0xc00003a038}, 0xc000280080, 0xb549a0?)
    github.com/cockroachdb/cockroach/pkg/cmd/bazci/githubpost/githubpost.go:558 +0x98c
  github.com/cockroachdb/cockroach/pkg/cmd/bazci/githubpost.listFailuresFromTestXML({0xdd5f10, 0xc00003a038}, {0xdcdcc0?, 0xc000014048?}, 0x0?)
    github.com/cockroachdb/cockroach/pkg/cmd/bazci/githubpost/githubpost.go:521 +0x1c7
  github.com/cockroachdb/cockroach/pkg/cmd/bazci/githubpost.PostFromTestXML({0xca27a5?, 0xbe?}, {0xdcdcc0, 0xc000014048})
    github.com/cockroachdb/cockroach/pkg/cmd/bazci/githubpost/githubpost.go:129 +0x52
  main.processTestXmls({0xc0002245c0, 0x1, 0xcb6ae3?})
    main/pkg/cmd/bazci/bazci.go:486 +0x2ad
  main.bazciImpl(0x12a5500?, {0xc0001b7c20?, 0x11, 0x12})
    main/pkg/cmd/bazci/bazci.go:384 +0x1290
  github.com/spf13/cobra.(*Command).execute(0x12a5500, {0xc000002290, 0x12, 0x13})
    github.com/spf13/cobra/external/com_github_spf13_cobra/command.go:856 +0x67c
  github.com/spf13/cobra.(*Command).ExecuteC(0x12a5500)
    github.com/spf13/cobra/external/com_github_spf13_cobra/command.go:974 +0x3bd
  github.com/spf13/cobra.(*Command).Execute(...)
    github.com/spf13/cobra/external/com_github_spf13_cobra/command.go:902
  main.main()
    main/pkg/cmd/bazci/main.go:42 +0x7b
  Process exited with code 2

@tbg tbg mentioned this issue Jul 28, 2023
@rickystewart
Copy link
Collaborator

Also, e.g.

This one seems unrelated. Might open another issue for this.

@rickystewart rickystewart self-assigned this Jul 28, 2023
rickystewart added a commit to rickystewart/cockroach that referenced this issue Jul 28, 2023
This makes no sense, so skip these cases.

Closes: cockroachdb#107779
Closes: cockroachdb#107781

Epic: none
Release note: None
rickystewart added a commit to rickystewart/cockroach that referenced this issue Jul 28, 2023
Go is a really good language.

Informs: cockroachdb#107779

Epic: none
Release note: None
craig bot pushed a commit that referenced this issue Jul 28, 2023
…107752 #107802 #107803

106508: util/must: add runtime assertion API r=erikgrinaker a=erikgrinaker

For details and usage examples, see the [package documentation](https://github.com/erikgrinaker/cockroach/blob/must/pkg/util/must/must.go).

---

This patch adds a convenient and canonical API for runtime assertions, inspired by the Testify package used for Go test assertions. It is intended to encourage liberal use of runtime assertions throughout the code base, by making it as easy as possible to write assertions that follow best practices. It does not attempt to reinvent the wheel, but instead builds on existing infrastructure.

Assertion failures are fatal in all non-release builds, including roachprod clusters and roachtests, to ensure they will be noticed. In release builds, they instead log the failure and report it to Sentry (if enabled), and return an assertion error to the caller for propagation. This avoids excessive disruption in production environments, where an assertion failure is often scoped to an individual RPC request, transaction, or range, and crashing the node can turn a minor problem into a full-blown outage. It is still possible to kill the node when appropriate via `log.Fatalf`, but this should be the exception rather than the norm.

It also supports expensive assertions that must be compiled out of normal dev/test/release builds for performance reasons. These are instead enabled in special test builds.

This is intended to be used instead of other existing assertion mechanisms, which have various shortcomings:

* `log.Fatalf`: kills the node even in release builds, which can cause severe disruption over often minor issues.

* `errors.AssertionFailedf`: only suitable when we have an error return path, does not fatal in non-release builds, and are not always notified in release builds.

* `logcrash.ReportOrPanic`: panics rather than fatals, which can leave the node limping along. Requires the caller to implement separate assertion handling in release builds, which is easy to forget. Also requires propagating cluster settings, which aren't always available.

* `buildutil.CrdbTestBuild`: only enabled in Go tests, not roachtests, roachprod clusters, or production clusters.

* `util.RaceEnabled`: only enabled in race builds. Expensive assertions should be possible to run without the additional overhead of the race detector.

For more details and examples, see the `must` package documentation.

Resolves #94986.
Epic: none
Release note: None

107094: streamingest: unskip TestTenantStreamingUnavailableStreamAddress r=lidorcarmel a=lidorcarmel

Changing a few things to get this test to pass under stress:
- use 50 ranges instead of 10, because there are already 50-ish system ranges,
  so if we write only 10 more ranges those might not get distributed on all
  servers.
- avoid reading from the source cluster after stopping a node, it's flaky,
  see #107499 for more info.

Epic: none
Fixes: #107023
Fixes: #106865

Release note: None

107717: server/profiler: remove `server.cpu_profile.enabled` setting r=xinhaoz a=xinhaoz

Cpu profiling can be enabled by setting the cluster setting `server.cpu_profile.cpu_usage_combined_threshold`. This makes `server.cpu_profile.enabled` redundant and makes it more difficult and confusing to enable cpu profiling. This commit removes the `server.cpu_profile.enabled` setting entirely. Note that both jdefault values for the cluster settings set profiling off.

Closes: #102024

Release note (sql change): The cluster setting
`server.cpu_profile.enabled` has been removed.
`server.cpu_profile.cpu_usage_combined_threshold` can enable and disable cpu profiling.

107720: cli: add probe_range in debug.zip r=j82w a=j82w

PR #79546 introduces `crdb_internal.probe_range`. This PR adds the `crdb_internal.probe_range` to the debug.zip. The LIMIT gives a very approximately ~1000ms*100 target on how long this can take, so that running debug.zip against an unavailable cluster won't take too long.

closes: #80360

Release note (cli change): The debug.zip now includes the `crdb_internal.probe_range` table with a limit of 100 rows to avoid the query from taking to long.

107727: server: deflake TestServerShutdownReleasesSession r=rafiss a=rafiss

The tenant was not being fully stopped, so the test could encounter flakes.

fixes #107592
Release note: None

107742: ui: show txn fingerprint details page with unspecified app r=xinhaoz a=xinhaoz

Previously, when the app was not specified in the url search params for the txn details fingerprint page, the page would fail to load. This commit allows the page to load when there is no app specified but a fingerprint id that matches the requested page in the payload. The first matching fingerprint id is loaded.

Additionally, the TransactionDetailsLink will not include the appNames search param unless the provided prop is non-nullish.

Fixes: #107731

Release note (bug fix): Txn fingerprint details page in the console UI should load with the fingerprint details even if no app is specified in the URL.




Demo:
https://www.loom.com/share/810308d3dcd74ca888c42287ebafaecf

107745: kvserver: fix test merge queue when grunning unsupported r=irfansharif a=kvoli

`TestMergeQueue/load-based-merging/switch...below-threshold` asserts that switching the split objective between CPU and QPS will not cause ranges to merge, even if their pre-switch load qualified them for merging.

This test was broken when `grunning` was unsupported, as the objective never actually switches to anything other than QPS.

Add a check for `grunning` support, and assert that a merge occurs if unsupported.

Fixes: #106937
Epic: none
Release note: None

107749: opt: add enable_durable_locking_for_serializable session variable r=DrewKimball,nvanbenschoten a=michae2

Follow-up from #105857

This commit ammends 6a3e43d to add a session variable to control whether guaranteed-durable locks are used under serializable isolation.

Informs: #100194

Epic: CRDB-25322

Release note (sql change): Add a new session variable, `enable_durable_locking_for_serializable`, which controls locking durability under serializable isolation. With this set to true, SELECT FOR UPDATE locks, SELECT FOR SHARED locks, and constraint check locks (e.g. locks acquired during foreign key checks if
`enable_implicit_fk_locking_for_serializable` is set to true) will be guaranteed-durable under serializable isolation, meaning they will always be held to transaction commit. (These locks are always guaranteed-durable under weaker isolation levels.)

By default, under serializable isolation these locks are best-effort rather than guaranteed-durable, meaning in some cases (e.g. leaseholder transfer, node loss, etc.) they could be released before transaction commit. Serializable isolation does not rely on locking for correctness, only using it to improve performance under contention, so this default is a deliberate choice to avoid the performance overhead of lock replication.

107752: changefeedccl: prevent deadlock in TestChangefeedKafkaMessageTooLarge r=miretskiy a=jayshrivastava

Previously, this test would deadlock due to kafka retrying messages too many times. These messages are stored in a buffer of size 1024 created by the CDC testing infra: https://github.com/cockroachdb/cockroach/blob/5c3f96d38cdc3a2d953ca3ffb1e39e97d7e5110e/pkg/ccl/changefeedccl/testfeed_test.go#L1819

The test asserts that 2000 messages pass through the buffer. When the test finishes, it stops reading from the buffer. The problem is that due to retries, there may be more messages sent to the buffer than that are read out of the buffer. Even after the 2000 messages are read and the test is shutting down, the sink may be blocked trying to put resolved messages (plus retries) in the buffer. If this happens, the changefeed resumer (same goroutine as the kafka sink) gets blocked and does not terminate when the job is cancelled at the end of the test.

This change caps the number of retries at 200 for this test, so there should be no more than 200 extra messages plus a few resolved messages during this test. This is far less than the buffer size of 1024.

See detailed explanation in #107591.

Fixes: #107591
Epic: none
Release note: None

107802: teamcity-trigger: don't start a job for an empty target r=healthy-pod a=rickystewart

This makes no sense, so skip these cases.

Closes: #107779
Closes: #107780
Closes: #107781

Epic: none
Release note: None

107803: githubpost: set `map` field if `null` r=healthy-pod a=rickystewart

Go is a really good language.

Informs: #107779

Epic: none
Release note: None

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Lidor Carmel <lidor@cockroachlabs.com>
Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com>
Co-authored-by: j82w <jwilley@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
@craig craig bot closed this as completed in #107802 Jul 28, 2023
blathers-crl bot pushed a commit that referenced this issue Jul 28, 2023
This makes no sense, so skip these cases.

Closes: #107779
Closes: #107781

Epic: none
Release note: None
rickystewart added a commit that referenced this issue Jul 31, 2023
This makes no sense, so skip these cases.

Closes: #107779
Closes: #107781

Epic: none
Release note: None
rickystewart added a commit that referenced this issue Aug 1, 2023
This makes no sense, so skip these cases.

Closes: #107779
Closes: #107781

Epic: none
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-dev-inf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants