-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(nemesis): filter raft-topology errors when starting/stopping nodes #9580
fix(nemesis): filter raft-topology errors when starting/stopping nodes #9580
Conversation
sdcm/nemesis.py
Outdated
with ignore_raft_topology_cmd_failing(): | ||
self.target_node.stop_scylla_server(verify_up=False, verify_down=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here i think better to use Context manager only for operations stop/start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why ?
@decorate_with_context
that already exists on this function is a natural fit for this, and was design for those cases exactly
9cd2665
to
287dfd7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@timtimb0t , fix only precommits |
287dfd7
to
8ff4f7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add description to commit why this change is needed, so when reviewing in git blane one does not have to peek into gh.
Besides, is it required for every Scylla restart or only in places when there's a wait between stop and start? If so, I suppose there could be more places needed to adjust.
sdcm/nemesis.py
Outdated
@@ -1152,7 +1153,8 @@ def _destroy_data_and_restart_scylla(self, keyspaces_for_destroy: list = None, s | |||
self.log.debug("Chosen tables: %s", tables) | |||
|
|||
# Stop scylla service before deleting sstables to avoid partial deletion of files that are under compaction | |||
self.target_node.stop_scylla_server(verify_up=False, verify_down=True) | |||
with ignore_raft_topology_cmd_failing(): | |||
self.target_node.stop_scylla_server(verify_up=False, verify_down=True) | |||
|
|||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this block be inside context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this block be inside context?
I thought the same, but after a short conversation with @aleksbykov, I decided to do everything as is, because once the node is stopped, the cluster will recognize this and won’t send any more requests to it.
You are right; there may be other places that need a similar change, but this fix is a response to the current issue. If such an error occurs again, we will likely need a new PR. |
@@ -675,14 +675,15 @@ def _kill_scylla_daemon(self): | |||
|
|||
@target_all_nodes | |||
def disrupt_stop_wait_start_scylla_server(self, sleep_time=300): # pylint: disable=invalid-name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend using @decorate_with_context, it would be less of interruption to the code
Yes commit should always have description, also as for the title as for the description should be something more like
in such manner people can understand the reasoning, and are don't need to go for walk across multiple other issue to understand it. |
8ff4f7d
to
7ea69ad
Compare
@timtimb0t please add proper backport labels |
7ea69ad
to
3bc8153
Compare
3bc8153
to
fa1a335
Compare
raft is generating the following errors: raft_topology - topology change coordinator fiber got error std::runtime_error when one of the nodes is stopped, it was decided we can safely ignore those errors fixes:9031
fa1a335
to
1ff7b0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
raft is generating the following errors:
raft_topology - topology change coordinator fiber got error std::runtime_error
when one of the nodes is stopped, it was decided we can safely ignore those errors
fixes: #9031
PR pre-checks (self review)
backport
labelsReminders
sdcm/sct_config.py
)unit-test/
folder)