-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[disrupt_destroy_data_then_rebuild] nemises caused lots of raft_topology errors that never lead to failure #9031
Comments
@kbr-scylla could you please take a look at this issue? It is created as an SCT issue because |
This nemesis is shutting down a Scylla node for a while. During this time topology coordinator could try communicating with this node, e.g. during tablet migrations. This is what happened here. Cluster was doing tablet migrations when the node was killed. If the communication attempt fails due to closed connection we print an error. So it's expected that the error happens if one of the nodes is down. |
there are few nemeses that generate closed connection type errors during runtime:
such an errors are expected for these nemeses and maybe ignored by SCT to avoid redundant error messages in argus. |
@timtimb0t if you know it happens in a specific duration of a specific nemesis, you can add to the nemesis in the relevant part "ignore_..." context switch and it won't produce errors.
|
I'm not sure but seems that there is one more nemesis for the same handling (disrupt_stop_wait_start_scylla_server): |
according to @kbr-scylla explanation, this can happen every time we take down a down.
and more seems like the reason is very similar to |
Packages
Scylla version:
6.3.0~dev-20241018.b11d50f59191
with build-idd5fe38f8fd12d9b834688320151f36fb8c1e050d
Kernel Version:
6.8.0-1016-gcp
Issue description
New issue
During disrupt_destroy_data_then_rebuild nemises the test initiates rebuild process (as last step after data been destroyed) that caused bunch of scylla errors:
The scylla itself is alive and these errors never caused a failure. Seems like they occurred due to deleted data
Describe your issue in detail and steps it took to produce it.
Impact
No visible impact
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 5 nodes (n2-highmem-16)
Scylla Nodes used in this run:
OS / Image:
https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/scylla-6-3-0-dev-x86-64-2024-10-19t02-11-39
(gce: undefined_region)Test:
longevity-large-partition-200k-pks-4days-gce-test
Test id:
26155658-0ac7-449d-8d60-ed91dba49ce0
Test name:
scylla-master/tier1/longevity-large-partition-200k-pks-4days-gce-test
Test method:
longevity_large_partition_test.LargePartitionLongevityTest.test_large_partition_longevity
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 26155658-0ac7-449d-8d60-ed91dba49ce0
$ hydra investigate show-logs 26155658-0ac7-449d-8d60-ed91dba49ce0
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: