-
Notifications
You must be signed in to change notification settings - Fork 926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARTEMIS-4325 - Ability to failback after failover #4522
base: main
Are you sure you want to change the base?
Conversation
This feature makes sense to me if your are not using HA (live/backup pairs), have you already considered waiting for a topology update to know when a node is up? |
I had the same thought about using the topology update to know when the "original" node came back. I think that would be simpler than manually checking and it would save a thread as well. |
Thanks for your feedback, much appreciated! In the case of failback between HA live/backup-pairs, that should be covered already. Since failback would only follow a failover which would only happen if the retry-logic failed first i.e both live and backup servers are unavailable. I think failing over (and back) between different live/backup pairs in a scenario like this make some sense, as long as it doesn't happen internally within the groups. If not, would it make sense to add some additional check to inhibit failback between live-backup-pairs? I might have missed something with the topology listener though, it would be nice to have this happen without having to "ping" the old node... but would this work if the nodeID of the broker changes (say the broker cluster is part of a blue/green deploy or similar, such that FQDN/IP remains the same but the underlying broker is replaced by one started on a new journal)? I can give that a try regardless, but it might take some time due to vacation. Br, |
Isn't the use-case in view here live-to-live failback after failover (implemented recently via ARTEMIS-4251)? I'd leave live/backup pairs out of this completely since there are already semantics in place for that use-case.
The |
0dde73c
to
8dba605
Compare
Hello, back again from vacation. I made some changes as per our discussion and rebased against main. I'm still running the failback check in a separate thread since I didn't want to potentially stall/lock the TopologyHandler but now it's not starting until the node should be available (or available soon). Let me know what you think and see if there are any other changes that should be made here. |
Opening for feedback as this might not be the right solution. It does however demonstrate the functionality I'm after.
If the feature makes sense on a high level, please let me know if the implementation needs any changes or can be made better in some way.
Use case detailed in: https://issues.apache.org/jira/browse/ARTEMIS-4325