Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: disk-stalled/dmsetup failed #97705

Closed
cockroach-teamcity opened this issue Feb 27, 2023 · 4 comments
Closed

roachtest: disk-stalled/dmsetup failed #97705

cockroach-teamcity opened this issue Feb 27, 2023 · 4 comments
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 27, 2023

roachtest.disk-stalled/dmsetup failed with artifacts on release-22.2 @ c9b71bafa2857ebdc5ffea629fb9546937f10ba9:

test artifacts and logs in: /artifacts/disk-stalled/dmsetup/run_1
(test_impl.go:292).Fatalf: post-stall TPS 796.68 is less than 50% of pre-stall TPS 1597.30
(test_impl.go:286).Fatal: cluster.RunE: context canceled
(test_impl.go:286).Fatal: cluster.RunE: context canceled
(test_impl.go:286).Fatal: output in run_113327.047449009_n4_cockroach_workload_run_kv: ./cockroach workload run kv --read-percent 50 --duration 10m --concurrency 256 --max-rate 2048 --tolerate-errors  --min-block-bytes=512 --max-block-bytes=512 {pgurl:1-3} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=true , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-24836

@cockroach-teamcity cockroach-teamcity added branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 27, 2023
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Feb 27, 2023
@blathers-crl blathers-crl bot added the T-storage Storage Team label Feb 27, 2023
@nicktrav
Copy link
Collaborator

This one seems similar to #97013.

I think there's something we could do to improve the comparison of the stats before / after.

Looking at the stats from this test, we ended with:

598.0s   192596          668.9          834.0      0.9      1.2      1.6      2.2 read
598.0s   192596         1027.9          996.1      2.1      5.2      7.6     10.0 write

And started with:

1.0s        0          926.1         1023.1      1.0      2.0     26.2     83.9 read
1.0s        0          925.1         1022.1      2.6      5.5     25.2     83.9 write

This isn't a release blocker. Will remove the label.

@nicktrav nicktrav removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 27, 2023
@jbowens
Copy link
Collaborator

jbowens commented Feb 27, 2023

@sumeerbhola if you have a chance to look at it, otherwise no worries.

@sumeerbhola
Copy link
Collaborator

unassigning, since didn't get a chance.

@sumeerbhola sumeerbhola removed their assignment Mar 13, 2023
jbowens added a commit to jbowens/cockroach that referenced this issue Apr 6, 2023
Previously, the post-stall TPS calculation included the time that the node was
stalled but before the stall triggered the node's exit. During this period,
overall TPS drops until the gray failure is converted into a hard failure. This
commit adjusts the post-stall TPS calculation to exclude the stalled time when
TPS is expected to tank.

Epic: None
Informs: cockroachdb#97705.
Release note: None
craig bot pushed a commit that referenced this issue Apr 6, 2023
99958: jobs,server: graceful shutdown for secondary tenant servers r=stevendanna a=knz

Epic: CRDB-23559
Fixes #92523.

All commits but the last are from #100436.

This change ensures that tenant servers managed by the server
controller receive a graceful drain request as part of the graceful
drain process of the surrounding KV node.

This change, in turn, ensures that SQL clients connected to these
secondary tenant servers benefit from the same guarantees (and
graceful periods) as clients to the system tenant.


100726: upgrades: use TestingBinaryMinSupportedVersion in tests r=rafiss a=rafiss

As described in #100552, it's important for this API to use TestingBinaryMinSupportedVersion in order to correctly bootstrap on the older version.

informs #100552 
Release note: None

100741: contextutil: teach TimeoutError to redact only the operation name r=andreimatei a=andreimatei

Before this patch, the whole message of TimeoutError was redacted in logs. Now, only the operation name is.

Release note: None
Epic: None

100778: norm: update prune cols to match PruneJoinLeftCols/PruneJoinRightCols r=msirek a=msirek

In #90599 adjustments where made to the PruneJoinLeftCols and PruneJoinRightCols
normalization rules to avoid pruning columns which might be needed when
deriving new predicates based on foreign key constraints for lookup join.

However, this caused a problem where rules might sometimes fire in an
infinite loop because the same columns to prune keep getting added as
PruneCols in calls to DerivePruneCols. The logic in prune_cols.opt and
DerivePruneCols must be kept in sync to avoid such problems, and this
PR brings it back in sync.

Epic: none
Fixes: #100478

Release note: None

100821: cmd/roachtest: adjust disk-stalled roachtests TPS calculation r=itsbilal a=jbowens

Previously, the post-stall TPS calculation included the time that the node was stalled but before the stall triggered the node's exit. During this period, overall TPS drops until the gray failure is converted into a hard failure. This commit adjusts the post-stall TPS calculation to exclude the stalled time when TPS is expected to tank.

Epic: None
Informs: #97705.
Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Mark Sirek <sirek@cockroachlabs.com>
Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Apr 6, 2023
Previously, the post-stall TPS calculation included the time that the node was
stalled but before the stall triggered the node's exit. During this period,
overall TPS drops until the gray failure is converted into a hard failure. This
commit adjusts the post-stall TPS calculation to exclude the stalled time when
TPS is expected to tank.

Epic: None
Informs: #97705.
Release note: None
blathers-crl bot pushed a commit that referenced this issue Apr 6, 2023
Previously, the post-stall TPS calculation included the time that the node was
stalled but before the stall triggered the node's exit. During this period,
overall TPS drops until the gray failure is converted into a hard failure. This
commit adjusts the post-stall TPS calculation to exclude the stalled time when
TPS is expected to tank.

Epic: None
Informs: #97705.
Release note: None
blathers-crl bot pushed a commit that referenced this issue Apr 6, 2023
Previously, the post-stall TPS calculation included the time that the node was
stalled but before the stall triggered the node's exit. During this period,
overall TPS drops until the gray failure is converted into a hard failure. This
commit adjusts the post-stall TPS calculation to exclude the stalled time when
TPS is expected to tank.

Epic: None
Informs: #97705.
Release note: None
@nicktrav
Copy link
Collaborator

This was fixed via #100821.

@jbowens jbowens added this to Storage Jun 4, 2024
@jbowens jbowens moved this to Done in Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Projects
Archived in project
Development

No branches or pull requests

4 participants