Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(nemesis): filter raft-topology errors when starting/stopping nodes #9580

Merged
merged 1 commit into from
Dec 19, 2024

Conversation

timtimb0t
Copy link
Contributor

@timtimb0t timtimb0t commented Dec 18, 2024

raft is generating the following errors:
raft_topology - topology change coordinator fiber got error std::runtime_error
when one of the nodes is stopped, it was decided we can safely ignore those errors

fixes: #9031

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

sdcm/nemesis.py Outdated Show resolved Hide resolved
sdcm/nemesis.py Outdated
Comment on lines 1156 to 1157
with ignore_raft_topology_cmd_failing():
self.target_node.stop_scylla_server(verify_up=False, verify_down=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here i think better to use Context manager only for operations stop/start

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ?

@decorate_with_context that already exists on this function is a natural fit for this, and was design for those cases exactly

@timtimb0t timtimb0t force-pushed the sct_raft_topology_errors_ignoring branch from 9cd2665 to 287dfd7 Compare December 18, 2024 11:30
aleksbykov
aleksbykov previously approved these changes Dec 18, 2024
Copy link
Contributor

@aleksbykov aleksbykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aleksbykov
Copy link
Contributor

@timtimb0t , fix only precommits

@timtimb0t timtimb0t force-pushed the sct_raft_topology_errors_ignoring branch from 287dfd7 to 8ff4f7d Compare December 18, 2024 12:00
@timtimb0t timtimb0t marked this pull request as ready for review December 18, 2024 12:02
@timtimb0t timtimb0t requested a review from soyacz December 18, 2024 12:02
Copy link
Contributor

@soyacz soyacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add description to commit why this change is needed, so when reviewing in git blane one does not have to peek into gh.

Besides, is it required for every Scylla restart or only in places when there's a wait between stop and start? If so, I suppose there could be more places needed to adjust.

sdcm/nemesis.py Outdated
@@ -1152,7 +1153,8 @@ def _destroy_data_and_restart_scylla(self, keyspaces_for_destroy: list = None, s
self.log.debug("Chosen tables: %s", tables)

# Stop scylla service before deleting sstables to avoid partial deletion of files that are under compaction
self.target_node.stop_scylla_server(verify_up=False, verify_down=True)
with ignore_raft_topology_cmd_failing():
self.target_node.stop_scylla_server(verify_up=False, verify_down=True)

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this block be inside context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this block be inside context?

I thought the same, but after a short conversation with @aleksbykov, I decided to do everything as is, because once the node is stopped, the cluster will recognize this and won’t send any more requests to it.

@timtimb0t
Copy link
Contributor Author

Please add description to commit why this change is needed, so when reviewing in git blane one does not have to peek into gh.

Besides, is it required for every Scylla restart or only in places when there's a wait between stop and start? If so, I suppose there could be more places needed to adjust.

You are right; there may be other places that need a similar change, but this fix is a response to the current issue. If such an error occurs again, we will likely need a new PR.

@@ -675,14 +675,15 @@ def _kill_scylla_daemon(self):

@target_all_nodes
def disrupt_stop_wait_start_scylla_server(self, sleep_time=300): # pylint: disable=invalid-name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend using @decorate_with_context, it would be less of interruption to the code

@fruch
Copy link
Contributor

fruch commented Dec 18, 2024

Please add description to commit why this change is needed, so when reviewing in git blane one does not have to peek into gh.

Besides, is it required for every Scylla restart or only in places when there's a wait between stop and start? If so, I suppose there could be more places needed to adjust.

Yes commit should always have description, also as for the title
sct lib doesn't mean much, in this case fix(nemesis): ...

as for the description should be something more like

fix(nemesis): filter raft-topology errors when starting/stopping nodes

raft is generating the following errors:

[ one example of such print ]

when one of the nodes is stopped, it was decided we can safely ignore those errors

Ref: [scylla issue about how core would address that]
Fixes: #9031

in such manner people can understand the reasoning, and are don't need to go for walk across multiple other issue to understand it.

@soyacz
Copy link
Contributor

soyacz commented Dec 19, 2024

@timtimb0t please add proper backport labels

@timtimb0t timtimb0t added backport/2024.2 Need backport to 2024.2 backport/6.1 Need backport to 6.1 backport/6.2 labels Dec 19, 2024
@timtimb0t timtimb0t force-pushed the sct_raft_topology_errors_ignoring branch from 7ea69ad to 3bc8153 Compare December 19, 2024 09:41
@timtimb0t timtimb0t changed the title fix(sct lib):decorators added to ignore raft-topology errors fix(nemesis): decorators added to ignore raft-topology errors Dec 19, 2024
@timtimb0t timtimb0t force-pushed the sct_raft_topology_errors_ignoring branch from 3bc8153 to fa1a335 Compare December 19, 2024 16:17
raft is generating the following errors:
raft_topology - topology change coordinator fiber got error std::runtime_error
when one of the nodes is stopped, it was decided we can safely ignore those errors
fixes:9031
@timtimb0t timtimb0t force-pushed the sct_raft_topology_errors_ignoring branch from fa1a335 to 1ff7b0a Compare December 19, 2024 16:27
@fruch fruch changed the title fix(nemesis): decorators added to ignore raft-topology errors fix(nemesis): filter raft-topology errors when starting/stopping nodes Dec 19, 2024
Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@scylladbbot scylladbbot added backport/6.2-done backport/2024.2-done Commit backported to 2024.2 backport/6.1-done Commit backported to 6.1 and removed backport/6.2 backport/2024.2 Need backport to 2024.2 backport/6.1 Need backport to 6.1 labels Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[disrupt_destroy_data_then_rebuild] nemises caused lots of raft_topology errors that never lead to failure
5 participants