-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage_service freeze on rebuild tablet replicas of replaced node #20885
Comments
You ran out of semaphores. Interesting - haven't seen this for some time. |
The semaphore is flooded with reads for I think this is a regression introduced by #19745. |
As for how to solve it -- I have no good ideas off the top of my head. We can make the streaming reader evictable -- but then we will have the same kind of trashing on shard 0, that we have with mixed shard repairs (#18269). Another solution is to detach the internal code from the interface presented by the |
Packages
Scylla version:
6.3.0~dev-20240927.c17d35371846
with build-ida9b08d0ce1f3cf99eb39d7a8372848fa2840dc1d
Kernel Version:
6.8.0-1016-aws
Issue description
During 'disrupt_terminate_and_replace_node' nemesis, after node termination we add another node and replace the terminated one.
Looks like storage_service stuck on
storage_service - Waiting for tablet replicas from the replaced node to be rebuilt
It happened with parallel 'create_index' nemesis.
I can see in logs:
but never reached
Tablet replicas from the replaced node have been rebuilt
In the meantime, there were plenty of
reader_concurrency_semaphore
errors like (first one posted):and right after streaming errors:
Impact
Stuck node replacement
How frequently does it reproduce?
First time seen, this test changed list of executed nemesis this week and this is first run of this nemesis in this scenario.
Installation details
Cluster size: 12 nodes (i3en.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-087d814d9b6773015 ami-09b7fc7e317549f14
(aws: undefined_region)Test:
longevity-multidc-schema-topology-changes-12h-test
Test id:
857012f9-ef88-402f-a211-0318be24be7f
Test name:
scylla-master/tier1/longevity-multidc-schema-topology-changes-12h-test
Test method:
longevity_test.LongevityTest.test_custom_time
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 857012f9-ef88-402f-a211-0318be24be7f
$ hydra investigate show-logs 857012f9-ef88-402f-a211-0318be24be7f
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: