Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv/quiescence/nodes=3 failed #97232

Closed
cockroach-teamcity opened this issue Feb 16, 2023 · 4 comments
Closed

roachtest: kv/quiescence/nodes=3 failed #97232

cockroach-teamcity opened this issue Feb 16, 2023 · 4 comments
Assignees
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 16, 2023

roachtest.kv/quiescence/nodes=3 failed with artifacts on master @ a0ab818e89508ca0b65926a4faac4c563d114acf:

test artifacts and logs in: /artifacts/kv/quiescence/nodes=3/run_1
(cluster.go:1867).Start: parallel execution failure: ~ ./cockroach sql --url 'postgres://root@localhost:26257?sslmode=disable' "-e
CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-8725593-1676528162-07-n4cpu4/1676530610503793144?AUTH=implicit' RECURRING '*/15 * * * *'
FULL BACKUP '@hourly'
WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.
timeout: context deadline exceeded
Failed running "sql": COMMAND_PROBLEM: ssh verbose log retained in ssh_065650.503891105_n3_run-sql.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-24581

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 16, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Feb 16, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Feb 16, 2023
@erikgrinaker
Copy link
Contributor

This test verifies that ranges still quiesce after killing a node. However, at the end of the test it restarts the node to appease the dead node monitor:

c.Start(ctx, t.L(), option.DefaultStartOpts(), install.MakeClusterSettings(), c.Node(nodes)) // satisfy dead node detector, even if test fails below

It appears that this runs into a race with the scheduled backup injector that registers scheduled backups on cluster/node startup:

// createFixedBackupSchedule creates a cluster backup schedule which, by
// default, runs an incremental every 15 minutes and a full every hour. On
// `roachprod create`, the user can provide a different recurrence using the
// 'schedule-backup-args' flag. If roachprod is local, the backups get stored in
// nodelocal, and otherwise in 'gs://cockroachdb-backup-testing'.
// This cmd also ensures that only one schedule will be created for the cluster.
func (c *SyncedCluster) createFixedBackupSchedule(
ctx context.Context, l *logger.Logger, scheduledBackupArgs string,
) error {

Races aside, we don't really need to be restarting the node here, we can use Monitor.ExpectDeath() instead. Will submit a PR.

@erikgrinaker erikgrinaker self-assigned this Feb 20, 2023
@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 20, 2023
@erikgrinaker
Copy link
Contributor

@msbutler The scheduled backup injection needs to be a bit more robust. In this case it seemed like it tried connecting to the restarted node too quickly and timed out, the node likely took more than 10 seconds to fully start back up (the default statement timeout).

craig bot pushed a commit that referenced this issue Feb 20, 2023
97360: roachtest: don't restart node in `kv/quiescence/nodes=3` r=erikgrinaker a=erikgrinaker

This roachtest verifies that ranges still quiesce when a node dies. However, it restarted the node at the end to appease the dead node monitor. This in turn caused flake with the scheduled backup injection on node startup, which could race with the node startup time.

This patch instead uses `Monitor.ExpectDeath()`.

Touches #97232.

Epic: none
Release note: None

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@msbutler
Copy link
Collaborator

@erikgrinaker thanks for the initial triage on this. I've seen this failure mode before and I am puzzled by it. Just asked test-eng a couple questions here.

msbutler added a commit to msbutler/cockroach that referenced this issue Feb 22, 2023
Previously, several roachtests failed during a cluster restart because a node
serving the default scheduled backup command was not ready to serve requests.
At this time, when roachprod start returns, not every node may be ready to
serve requests.

To prevent this failure mode, this patch changes the scheduled backup cmd
during roachprod.Start() to run with infinite timeout and only on the the first
node in the cluster.

Fixes cockroachdb#97010, cockroachdb#97232

Release note: None

Epic: none
msbutler added a commit to msbutler/cockroach that referenced this issue Feb 22, 2023
Previously, several roachtests failed during a cluster restart because a node
serving the default scheduled backup command was not ready to serve requests.
At this time, when roachprod start returns, not every node may be ready to
serve requests.

To prevent this failure mode, this patch changes the scheduled backup cmd
during roachprod.Start() to run with infinite timeout and only on the the first
node in the cluster.

Fixes cockroachdb#97010, cockroachdb#97232

Release note: None

Epic: none
craig bot pushed a commit that referenced this issue Feb 22, 2023
96914: sql: add spaces to delimiter in uniqueness violation error r=rharding6373 a=michae2

Whenever a mutation statement violates a uniqueness constraint we return a specific error which shows the affected row. This error is formatted to match the corresponding error in Postgres.

We create this error in two places. In one place (`pkg/sql/opt/exec/execbuilder/mutation.go`) we were using ", " as the delimter between values, which matches Postgres. But in the other place (`pkg/sql/row/errors.go`) we were using "," (without a space).

This patch adds the space.

Epic: None

Release note (bug fix): Fix formatting of uniqueness violation errors to match the corresponding errors from PostgreSQL.

97065: schemachanger: Support dropping index cascade with dependent inbound FK r=healthy-pod a=Xiang-Gu

Dropping a unique index cascade will need to drop any inbound FK constraint 
if this index is the only uniqueness provider.

This commit enables the declarative schema changer to support this behavior.

Fixes: #96731
Epic: None

97142: allocator: replace read amp with io thresh r=irfansharif a=kvoli

We previously checked stores' L0-sublevels to exclude IO overloaded
stores from being allocation targets (#78608). This commit replaces the signal
with the normalized IO overload score instead, which also factors in the
L0-filecount. We started gossiping this value as of #83720. We continue
gossiping L0-sublevels for mixed-version compatibility; we can stop doing this
in 23.2.

Resolves: #85084

Release note (ops change): We've deprecated two cluster settings:
- kv.allocator.l0_sublevels_threshold
- kv.allocator.l0_sublevels_threshold_enforce.
The pair of them were used to control rebalancing and upreplication behavior in
the face of IO overloaded stores. This has been now been replaced by other
internal mechanisms.

97495: roachprod: run scheduled backup init without timeout r=renatolabs a=msbutler

Previously, several roachtests failed during a cluster restart because a node serving the default scheduled backup command was not ready to serve requests. At this time, when roachprod start returns, not every node may be ready to serve requests.

To prevent this failure mode, this patch changes the scheduled backup cmd during roachprod.Start() to run with infinite timeout and only on the the first node in the cluster.

Fixes #97010, #97232

Release note: None

Epic: none

Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
Co-authored-by: Xiang Gu <xiang@cockroachlabs.com>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
@msbutler
Copy link
Collaborator

resolved by #97495

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

3 participants