Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health checks take more than 2 hours running on 60 DB nodes setup after each nemesis, even skipped ones #9547

Open
vponomaryov opened this issue Dec 13, 2024 · 4 comments
Assignees

Comments

@vponomaryov
Copy link
Contributor

Packages

Scylla version: 2024.3.0~dev-20241209.b5f1d87f3e83 with build-id a322e4f0d7b174dd5052eb3992c8e459d1a03b7a

Kernel Version: 6.8.0-1019-aws

Issue description

Running 5dc, 60 DB nodes test the healch checks take more than 2 hours after each of the nemesis:

2024-12-09 21:19:27,050 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-1
...
2024-12-09 22:21:44,490 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-30
...
2024-12-09 23:27:39,576 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-60
...
...
2024-12-09 23:29:50,603 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-1
...
2024-12-10 01:38:26,928 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-60
...
...
2024-12-10 01:40:40,444 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-1
...
2024-12-10 02:43:18,741 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-30
...
2024-12-10 03:49:01,819 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-60

Moreover, redundant health checks cycle gets runs even if nemesis was skipped.

See Argus screenshot:
Image

Impact

Significant waste of a test run time.

How frequently does it reproduce?

1/1

Installation details

Cluster size: 60 nodes (i3en.large)

Scylla Nodes used in this run:

  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-9 (3.78.242.164 | 10.11.4.26) (shards: -1)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-8 (18.184.154.54 | 10.11.4.150) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-7 (18.195.241.190 | 10.11.6.161) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-60 (18.204.213.63 | 10.12.4.51) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-6 (18.156.122.174 | 10.11.7.240) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-59 (52.202.210.200 | 10.12.5.121) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-58 (18.212.49.124 | 10.12.5.222) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-57 (18.205.27.190 | 10.12.4.178) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-56 (3.86.145.120 | 10.12.4.208) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-55 (3.86.180.35 | 10.12.4.253) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-54 (54.172.31.83 | 10.12.4.133) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-53 (54.152.61.101 | 10.12.7.65) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-52 (3.84.51.77 | 10.12.7.208) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-51 (3.80.131.177 | 10.12.5.152) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-50 (44.203.148.120 | 10.12.7.175) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-5 (3.69.165.13 | 10.11.4.30) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-49 (54.83.154.123 | 10.12.4.235) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-48 (35.178.205.245 | 10.3.7.244) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-47 (13.41.185.36 | 10.3.4.203) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-46 (18.133.242.167 | 10.3.6.76) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-45 (13.40.198.57 | 10.3.4.39) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-44 (3.10.208.246 | 10.3.7.212) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-43 (18.175.190.20 | 10.3.6.14) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-42 (35.179.106.42 | 10.3.7.1) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-41 (13.41.163.90 | 10.3.7.29) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-40 (13.41.188.217 | 10.3.5.48) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-4 (3.122.228.171 | 10.11.4.86) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-39 (13.41.200.137 | 10.3.6.155) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-38 (35.179.176.200 | 10.3.6.185) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-37 (13.41.110.64 | 10.3.6.18) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-36 (63.35.194.131 | 10.4.7.229) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-35 (3.255.87.99 | 10.4.5.166) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-34 (3.249.14.49 | 10.4.7.12) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-33 (3.253.9.164 | 10.4.5.192) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-32 (34.242.177.119 | 10.4.4.33) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-31 (54.171.1.234 | 10.4.4.31) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-30 (52.214.62.37 | 10.4.6.18) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-3 (18.194.139.3 | 10.11.6.105) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-29 (18.201.245.171 | 10.4.7.137) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-28 (34.254.160.253 | 10.4.7.15) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-27 (18.203.68.253 | 10.4.4.56) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-26 (63.34.163.154 | 10.4.6.155) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-25 (63.33.207.112 | 10.4.7.164) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-24 (13.48.44.219 | 10.0.4.135) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-23 (13.60.215.14 | 10.0.6.184) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-22 (13.60.54.137 | 10.0.6.194) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-21 (16.171.40.7 | 10.0.4.248) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-20 (13.53.40.125 | 10.0.7.239) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-2 (35.159.65.51 | 10.11.5.164) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-19 (16.171.2.245 | 10.0.6.220) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-18 (16.171.34.56 | 10.0.5.197) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-17 (16.171.249.49 | 10.0.6.143) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-16 (13.60.58.60 | 10.0.5.149) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-15 (51.20.3.175 | 10.0.5.233) (shards: -1)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-14 (13.51.56.253 | 10.0.7.178) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-13 (13.60.174.157 | 10.0.7.116) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-12 (18.184.139.25 | 10.11.4.225) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-11 (3.72.233.226 | 10.11.7.207) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-10 (3.78.190.183 | 10.11.5.110) (shards: 2)
  • long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-1 (3.66.222.194 | 10.11.6.134) (shards: 2)

OS / Image: ami-09e71469bd2c21908 ami-0da21ef58bb231de7 ami-0201515e28dca41b1 ami-0fd8175c8145eb79f ami-03ff16ab9428aadda (aws: eu-central-1, eu-north-1, eu-west-1, eu-west-2, us-east-1)

Test: vp-longevity-aws-custom-d2-workload1-multidc-big
Test id: 54f56c8f-465d-4f59-8ba8-4829871ccff3
Test name: scylla-staging/valerii/vp-longevity-aws-custom-d2-workload1-multidc-big
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 54f56c8f-465d-4f59-8ba8-4829871ccff3
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 54f56c8f-465d-4f59-8ba8-4829871ccff3

Logs:

Jenkins job URL
Argus

@fruch
Copy link
Contributor

fruch commented Dec 15, 2024

  • since there are healthcheck calls before and after the nemesis, even skipped ones are doing the before checks.
    we might be able to remove the next healthcheck, assuming the skipped nemesis did it already.

  • regardless, for 60 nodes case, we should disable it completely, since it's gonna take more time the the actual nemesis runs.

@soyacz
Copy link
Contributor

soyacz commented Dec 16, 2024

  • since there are healthcheck calls before and after the nemesis, even skipped ones are doing the before checks.
    we might be able to remove the next healthcheck, assuming the skipped nemesis did it already.
  • regardless, for 60 nodes case, we should disable it completely, since it's gonna take more time the the actual nemesis runs.

maybe we could introduce 'fast healthcheck'? Shouldn't be it quick using raft?

@fruch
Copy link
Contributor

fruch commented Dec 16, 2024

  • since there are healthcheck calls before and after the nemesis, even skipped ones are doing the before checks.
    we might be able to remove the next healthcheck, assuming the skipped nemesis did it already.
  • regardless, for 60 nodes case, we should disable it completely, since it's gonna take more time the the actual nemesis runs.

maybe we could introduce 'fast healthcheck'? Shouldn't be it quick using raft?

I don't know what it means ? not use nodetool ? just examine group0 on one node ? or multiple nodes ?

@soyacz
Copy link
Contributor

soyacz commented Dec 16, 2024

  • since there are healthcheck calls before and after the nemesis, even skipped ones are doing the before checks.
    we might be able to remove the next healthcheck, assuming the skipped nemesis did it already.
  • regardless, for 60 nodes case, we should disable it completely, since it's gonna take more time the the actual nemesis runs.

maybe we could introduce 'fast healthcheck'? Shouldn't be it quick using raft?

I don't know what it means ? not use nodetool ? just examine group0 on one node ? or multiple nodes ?

something like that - just group0, or do it in parallel on all nodes. @aleksbykov can you suggest some fast&reliable way of using raft to quickly verify cluster health?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants