Health checks take more than 2 hours running on 60 DB nodes setup after each nemesis, even skipped ones #9547

vponomaryov · 2024-12-13T12:46:44Z

Packages

Scylla version: 2024.3.0~dev-20241209.b5f1d87f3e83 with build-id a322e4f0d7b174dd5052eb3992c8e459d1a03b7a

Kernel Version: 6.8.0-1019-aws

Issue description

Running 5dc, 60 DB nodes test the healch checks take more than 2 hours after each of the nemesis:

2024-12-09 21:19:27,050 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-1
...
2024-12-09 22:21:44,490 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-30
...
2024-12-09 23:27:39,576 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-60
...
...
2024-12-09 23:29:50,603 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-1
...
2024-12-10 01:38:26,928 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-60
...
...
2024-12-10 01:40:40,444 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-1
...
2024-12-10 02:43:18,741 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-30
...
2024-12-10 03:49:01,819 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-60

Moreover, redundant health checks cycle gets runs even if nemesis was skipped.

See Argus screenshot:

Impact

Significant waste of a test run time.

How frequently does it reproduce?

1/1

Installation details

Cluster size: 60 nodes (i3en.large)

Scylla Nodes used in this run:

long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-9 (3.78.242.164 | 10.11.4.26) (shards: -1)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-8 (18.184.154.54 | 10.11.4.150) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-7 (18.195.241.190 | 10.11.6.161) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-60 (18.204.213.63 | 10.12.4.51) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-6 (18.156.122.174 | 10.11.7.240) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-59 (52.202.210.200 | 10.12.5.121) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-58 (18.212.49.124 | 10.12.5.222) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-57 (18.205.27.190 | 10.12.4.178) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-56 (3.86.145.120 | 10.12.4.208) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-55 (3.86.180.35 | 10.12.4.253) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-54 (54.172.31.83 | 10.12.4.133) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-53 (54.152.61.101 | 10.12.7.65) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-52 (3.84.51.77 | 10.12.7.208) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-51 (3.80.131.177 | 10.12.5.152) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-50 (44.203.148.120 | 10.12.7.175) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-5 (3.69.165.13 | 10.11.4.30) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-49 (54.83.154.123 | 10.12.4.235) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-48 (35.178.205.245 | 10.3.7.244) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-47 (13.41.185.36 | 10.3.4.203) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-46 (18.133.242.167 | 10.3.6.76) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-45 (13.40.198.57 | 10.3.4.39) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-44 (3.10.208.246 | 10.3.7.212) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-43 (18.175.190.20 | 10.3.6.14) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-42 (35.179.106.42 | 10.3.7.1) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-41 (13.41.163.90 | 10.3.7.29) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-40 (13.41.188.217 | 10.3.5.48) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-4 (3.122.228.171 | 10.11.4.86) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-39 (13.41.200.137 | 10.3.6.155) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-38 (35.179.176.200 | 10.3.6.185) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-37 (13.41.110.64 | 10.3.6.18) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-36 (63.35.194.131 | 10.4.7.229) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-35 (3.255.87.99 | 10.4.5.166) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-34 (3.249.14.49 | 10.4.7.12) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-33 (3.253.9.164 | 10.4.5.192) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-32 (34.242.177.119 | 10.4.4.33) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-31 (54.171.1.234 | 10.4.4.31) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-30 (52.214.62.37 | 10.4.6.18) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-3 (18.194.139.3 | 10.11.6.105) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-29 (18.201.245.171 | 10.4.7.137) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-28 (34.254.160.253 | 10.4.7.15) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-27 (18.203.68.253 | 10.4.4.56) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-26 (63.34.163.154 | 10.4.6.155) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-25 (63.33.207.112 | 10.4.7.164) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-24 (13.48.44.219 | 10.0.4.135) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-23 (13.60.215.14 | 10.0.6.184) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-22 (13.60.54.137 | 10.0.6.194) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-21 (16.171.40.7 | 10.0.4.248) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-20 (13.53.40.125 | 10.0.7.239) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-2 (35.159.65.51 | 10.11.5.164) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-19 (16.171.2.245 | 10.0.6.220) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-18 (16.171.34.56 | 10.0.5.197) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-17 (16.171.249.49 | 10.0.6.143) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-16 (13.60.58.60 | 10.0.5.149) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-15 (51.20.3.175 | 10.0.5.233) (shards: -1)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-14 (13.51.56.253 | 10.0.7.178) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-13 (13.60.174.157 | 10.0.7.116) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-12 (18.184.139.25 | 10.11.4.225) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-11 (3.72.233.226 | 10.11.7.207) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-10 (3.78.190.183 | 10.11.5.110) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-1 (3.66.222.194 | 10.11.6.134) (shards: 2)

OS / Image: ami-09e71469bd2c21908 ami-0da21ef58bb231de7 ami-0201515e28dca41b1 ami-0fd8175c8145eb79f ami-03ff16ab9428aadda (aws: eu-central-1, eu-north-1, eu-west-1, eu-west-2, us-east-1)

Test: vp-longevity-aws-custom-d2-workload1-multidc-big
Test id: 54f56c8f-465d-4f59-8ba8-4829871ccff3
Test name: scylla-staging/valerii/vp-longevity-aws-custom-d2-workload1-multidc-big
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

longevity-aws-custom-d2-workload1-5dcs.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 54f56c8f-465d-4f59-8ba8-4829871ccff3
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 54f56c8f-465d-4f59-8ba8-4829871ccff3

Logs:

db-cluster-54f56c8f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/54f56c8f-465d-4f59-8ba8-4829871ccff3/20241210_115125/db-cluster-54f56c8f.tar.gz
sct-runner-events-54f56c8f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/54f56c8f-465d-4f59-8ba8-4829871ccff3/20241210_115125/sct-runner-events-54f56c8f.tar.gz
sct-54f56c8f.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/54f56c8f-465d-4f59-8ba8-4829871ccff3/20241210_115125/sct-54f56c8f.log.tar.gz
loader-set-54f56c8f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/54f56c8f-465d-4f59-8ba8-4829871ccff3/20241210_115125/loader-set-54f56c8f.tar.gz
monitor-set-54f56c8f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/54f56c8f-465d-4f59-8ba8-4829871ccff3/20241210_115125/monitor-set-54f56c8f.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

fruch · 2024-12-15T09:58:33Z

since there are healthcheck calls before and after the nemesis, even skipped ones are doing the before checks.
we might be able to remove the next healthcheck, assuming the skipped nemesis did it already.
regardless, for 60 nodes case, we should disable it completely, since it's gonna take more time the the actual nemesis runs.

soyacz · 2024-12-16T16:41:51Z

since there are healthcheck calls before and after the nemesis, even skipped ones are doing the before checks.
we might be able to remove the next healthcheck, assuming the skipped nemesis did it already.

regardless, for 60 nodes case, we should disable it completely, since it's gonna take more time the the actual nemesis runs.

maybe we could introduce 'fast healthcheck'? Shouldn't be it quick using raft?

fruch · 2024-12-16T16:51:15Z

since there are healthcheck calls before and after the nemesis, even skipped ones are doing the before checks.
we might be able to remove the next healthcheck, assuming the skipped nemesis did it already.

regardless, for 60 nodes case, we should disable it completely, since it's gonna take more time the the actual nemesis runs.

maybe we could introduce 'fast healthcheck'? Shouldn't be it quick using raft?

I don't know what it means ? not use nodetool ? just examine group0 on one node ? or multiple nodes ?

soyacz · 2024-12-16T17:00:25Z

since there are healthcheck calls before and after the nemesis, even skipped ones are doing the before checks.
we might be able to remove the next healthcheck, assuming the skipped nemesis did it already.

regardless, for 60 nodes case, we should disable it completely, since it's gonna take more time the the actual nemesis runs.

maybe we could introduce 'fast healthcheck'? Shouldn't be it quick using raft?

I don't know what it means ? not use nodetool ? just examine group0 on one node ? or multiple nodes ?

something like that - just group0, or do it in parallel on all nodes. @aleksbykov can you suggest some fast&reliable way of using raft to quickly verify cluster health?

github-actions bot assigned vponomaryov Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health checks take more than 2 hours running on 60 DB nodes setup after each nemesis, even skipped ones #9547

Health checks take more than 2 hours running on 60 DB nodes setup after each nemesis, even skipped ones #9547

vponomaryov commented Dec 13, 2024

Logs:

fruch commented Dec 15, 2024

soyacz commented Dec 16, 2024

fruch commented Dec 16, 2024

soyacz commented Dec 16, 2024

Health checks take more than 2 hours running on 60 DB nodes setup after each nemesis, even skipped ones #9547

Health checks take more than 2 hours running on 60 DB nodes setup after each nemesis, even skipped ones #9547

Comments

vponomaryov commented Dec 13, 2024

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

fruch commented Dec 15, 2024

soyacz commented Dec 16, 2024

fruch commented Dec 16, 2024

soyacz commented Dec 16, 2024