roachtest: kv/quiescence/nodes=3 failed #97232

cockroach-teamcity · 2023-02-16T06:59:12Z

roachtest.kv/quiescence/nodes=3 failed with artifacts on master @ a0ab818e89508ca0b65926a4faac4c563d114acf:

test artifacts and logs in: /artifacts/kv/quiescence/nodes=3/run_1
(cluster.go:1867).Start: parallel execution failure: ~ ./cockroach sql --url 'postgres://root@localhost:26257?sslmode=disable' "-e
CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-8725593-1676528162-07-n4cpu4/1676530610503793144?AUTH=implicit' RECURRING '*/15 * * * *'
FULL BACKUP '@hourly'
WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.
timeout: context deadline exceeded
Failed running "sql": COMMAND_PROBLEM: ssh verbose log retained in ssh_065650.503891105_n3_run-sql.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!

Jira issue: CRDB-24581}

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2023-02-20T14:19:53Z

This test verifies that ranges still quiesce after killing a node. However, at the end of the test it restarts the node to appease the dead node monitor:

cockroach/pkg/cmd/roachtest/tests/kv.go

Line 449 in eec5a47

    
           c.Start(ctx, t.L(), option.DefaultStartOpts(), install.MakeClusterSettings(), c.Node(nodes)) // satisfy dead node detector, even if test fails below

It appears that this runs into a race with the scheduled backup injector that registers scheduled backups on cluster/node startup:

cockroach/pkg/roachprod/install/cockroach.go

Lines 798 to 806 in 399a278

    
           // createFixedBackupSchedule creates a cluster backup schedule which, by 
        
           // default, runs an incremental every 15 minutes and a full every hour. On 
        
           // `roachprod create`, the user can provide a different recurrence using the 
        
           // 'schedule-backup-args' flag. If roachprod is local, the backups get stored in 
        
           // nodelocal, and otherwise in 'gs://cockroachdb-backup-testing'. 
        
           // This cmd also ensures that only one schedule will be created for the cluster. 
        
           func (c *SyncedCluster) createFixedBackupSchedule( 
        
           	ctx context.Context, l *logger.Logger, scheduledBackupArgs string, 
        
           ) error {

Races aside, we don't really need to be restarting the node here, we can use Monitor.ExpectDeath() instead. Will submit a PR.

erikgrinaker · 2023-02-20T14:46:24Z

@msbutler The scheduled backup injection needs to be a bit more robust. In this case it seemed like it tried connecting to the restarted node too quickly and timed out, the node likely took more than 10 seconds to fully start back up (the default statement timeout).

97360: roachtest: don't restart node in `kv/quiescence/nodes=3` r=erikgrinaker a=erikgrinaker This roachtest verifies that ranges still quiesce when a node dies. However, it restarted the node at the end to appease the dead node monitor. This in turn caused flake with the scheduled backup injection on node startup, which could race with the node startup time. This patch instead uses `Monitor.ExpectDeath()`. Touches #97232. Epic: none Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

msbutler · 2023-02-21T14:11:36Z

@erikgrinaker thanks for the initial triage on this. I've seen this failure mode before and I am puzzled by it. Just asked test-eng a couple questions here.

Previously, several roachtests failed during a cluster restart because a node serving the default scheduled backup command was not ready to serve requests. At this time, when roachprod start returns, not every node may be ready to serve requests. To prevent this failure mode, this patch changes the scheduled backup cmd during roachprod.Start() to run with infinite timeout and only on the the first node in the cluster. Fixes cockroachdb#97010, cockroachdb#97232 Release note: None Epic: none

96914: sql: add spaces to delimiter in uniqueness violation error r=rharding6373 a=michae2 Whenever a mutation statement violates a uniqueness constraint we return a specific error which shows the affected row. This error is formatted to match the corresponding error in Postgres. We create this error in two places. In one place (`pkg/sql/opt/exec/execbuilder/mutation.go`) we were using ", " as the delimter between values, which matches Postgres. But in the other place (`pkg/sql/row/errors.go`) we were using "," (without a space). This patch adds the space. Epic: None Release note (bug fix): Fix formatting of uniqueness violation errors to match the corresponding errors from PostgreSQL. 97065: schemachanger: Support dropping index cascade with dependent inbound FK r=healthy-pod a=Xiang-Gu Dropping a unique index cascade will need to drop any inbound FK constraint if this index is the only uniqueness provider. This commit enables the declarative schema changer to support this behavior. Fixes: #96731 Epic: None 97142: allocator: replace read amp with io thresh r=irfansharif a=kvoli We previously checked stores' L0-sublevels to exclude IO overloaded stores from being allocation targets (#78608). This commit replaces the signal with the normalized IO overload score instead, which also factors in the L0-filecount. We started gossiping this value as of #83720. We continue gossiping L0-sublevels for mixed-version compatibility; we can stop doing this in 23.2. Resolves: #85084 Release note (ops change): We've deprecated two cluster settings: - kv.allocator.l0_sublevels_threshold - kv.allocator.l0_sublevels_threshold_enforce. The pair of them were used to control rebalancing and upreplication behavior in the face of IO overloaded stores. This has been now been replaced by other internal mechanisms. 97495: roachprod: run scheduled backup init without timeout r=renatolabs a=msbutler Previously, several roachtests failed during a cluster restart because a node serving the default scheduled backup command was not ready to serve requests. At this time, when roachprod start returns, not every node may be ready to serve requests. To prevent this failure mode, this patch changes the scheduled backup cmd during roachprod.Start() to run with infinite timeout and only on the the first node in the cluster. Fixes #97010, #97232 Release note: None Epic: none Co-authored-by: Michael Erickson <michae2@cockroachlabs.com> Co-authored-by: Xiang Gu <xiang@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com>

msbutler · 2023-03-13T18:20:28Z

resolved by #97495

cockroach-teamcity added this to the 23.1 milestone Feb 16, 2023

blathers-crl bot added the T-kv KV Team label Feb 16, 2023

erikgrinaker self-assigned this Feb 20, 2023

erikgrinaker mentioned this issue Feb 20, 2023

roachtest: don't restart node in kv/quiescence/nodes=3 #97360

Merged

erikgrinaker assigned msbutler and unassigned erikgrinaker Feb 20, 2023

msbutler mentioned this issue Feb 22, 2023

roachprod: run scheduled backup init without timeout #97495

Merged

msbutler closed this as completed Mar 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: kv/quiescence/nodes=3 failed #97232

roachtest: kv/quiescence/nodes=3 failed #97232

cockroach-teamcity commented Feb 16, 2023 •

edited by cockroach-jira-scripts

Loading

erikgrinaker commented Feb 20, 2023

erikgrinaker commented Feb 20, 2023

msbutler commented Feb 21, 2023

msbutler commented Mar 13, 2023

roachtest: kv/quiescence/nodes=3 failed #97232

roachtest: kv/quiescence/nodes=3 failed #97232

Comments

cockroach-teamcity commented Feb 16, 2023 • edited by cockroach-jira-scripts Loading

erikgrinaker commented Feb 20, 2023

erikgrinaker commented Feb 20, 2023

msbutler commented Feb 21, 2023

msbutler commented Mar 13, 2023

cockroach-teamcity commented Feb 16, 2023 •

edited by cockroach-jira-scripts

Loading