-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: kv/quiescence/nodes=3 failed #97232
Comments
This test verifies that ranges still quiesce after killing a node. However, at the end of the test it restarts the node to appease the dead node monitor: cockroach/pkg/cmd/roachtest/tests/kv.go Line 449 in eec5a47
It appears that this runs into a race with the scheduled backup injector that registers scheduled backups on cluster/node startup: cockroach/pkg/roachprod/install/cockroach.go Lines 798 to 806 in 399a278
Races aside, we don't really need to be restarting the node here, we can use |
@msbutler The scheduled backup injection needs to be a bit more robust. In this case it seemed like it tried connecting to the restarted node too quickly and timed out, the node likely took more than 10 seconds to fully start back up (the default statement timeout). |
97360: roachtest: don't restart node in `kv/quiescence/nodes=3` r=erikgrinaker a=erikgrinaker This roachtest verifies that ranges still quiesce when a node dies. However, it restarted the node at the end to appease the dead node monitor. This in turn caused flake with the scheduled backup injection on node startup, which could race with the node startup time. This patch instead uses `Monitor.ExpectDeath()`. Touches #97232. Epic: none Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@erikgrinaker thanks for the initial triage on this. I've seen this failure mode before and I am puzzled by it. Just asked test-eng a couple questions here. |
Previously, several roachtests failed during a cluster restart because a node serving the default scheduled backup command was not ready to serve requests. At this time, when roachprod start returns, not every node may be ready to serve requests. To prevent this failure mode, this patch changes the scheduled backup cmd during roachprod.Start() to run with infinite timeout and only on the the first node in the cluster. Fixes cockroachdb#97010, cockroachdb#97232 Release note: None Epic: none
Previously, several roachtests failed during a cluster restart because a node serving the default scheduled backup command was not ready to serve requests. At this time, when roachprod start returns, not every node may be ready to serve requests. To prevent this failure mode, this patch changes the scheduled backup cmd during roachprod.Start() to run with infinite timeout and only on the the first node in the cluster. Fixes cockroachdb#97010, cockroachdb#97232 Release note: None Epic: none
96914: sql: add spaces to delimiter in uniqueness violation error r=rharding6373 a=michae2 Whenever a mutation statement violates a uniqueness constraint we return a specific error which shows the affected row. This error is formatted to match the corresponding error in Postgres. We create this error in two places. In one place (`pkg/sql/opt/exec/execbuilder/mutation.go`) we were using ", " as the delimter between values, which matches Postgres. But in the other place (`pkg/sql/row/errors.go`) we were using "," (without a space). This patch adds the space. Epic: None Release note (bug fix): Fix formatting of uniqueness violation errors to match the corresponding errors from PostgreSQL. 97065: schemachanger: Support dropping index cascade with dependent inbound FK r=healthy-pod a=Xiang-Gu Dropping a unique index cascade will need to drop any inbound FK constraint if this index is the only uniqueness provider. This commit enables the declarative schema changer to support this behavior. Fixes: #96731 Epic: None 97142: allocator: replace read amp with io thresh r=irfansharif a=kvoli We previously checked stores' L0-sublevels to exclude IO overloaded stores from being allocation targets (#78608). This commit replaces the signal with the normalized IO overload score instead, which also factors in the L0-filecount. We started gossiping this value as of #83720. We continue gossiping L0-sublevels for mixed-version compatibility; we can stop doing this in 23.2. Resolves: #85084 Release note (ops change): We've deprecated two cluster settings: - kv.allocator.l0_sublevels_threshold - kv.allocator.l0_sublevels_threshold_enforce. The pair of them were used to control rebalancing and upreplication behavior in the face of IO overloaded stores. This has been now been replaced by other internal mechanisms. 97495: roachprod: run scheduled backup init without timeout r=renatolabs a=msbutler Previously, several roachtests failed during a cluster restart because a node serving the default scheduled backup command was not ready to serve requests. At this time, when roachprod start returns, not every node may be ready to serve requests. To prevent this failure mode, this patch changes the scheduled backup cmd during roachprod.Start() to run with infinite timeout and only on the the first node in the cluster. Fixes #97010, #97232 Release note: None Epic: none Co-authored-by: Michael Erickson <michae2@cockroachlabs.com> Co-authored-by: Xiang Gu <xiang@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com>
resolved by #97495 |
roachtest.kv/quiescence/nodes=3 failed with artifacts on master @ a0ab818e89508ca0b65926a4faac4c563d114acf:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-24581
The text was updated successfully, but these errors were encountered: