-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
running brokers don't detect upstream broker restart after a crash #3608
Comments
garlick
added a commit
to garlick/flux-core
that referenced
this issue
Aug 30, 2021
Problem: if broker crashes without getting word to child, the child may reconnect and resume messaging once the broker returns to service. These messages will be dropped, but the child is not informed that should restart. Send a KEEPALIVE_DISCONNECT in that case, triggering a subtree panic. Also, suppress logging dropped messages from children that have been forceably disconnected, as it is expected that some messages might be sent before the KEEPALIVE_DISCONNECT is processed on the other end. Improve documentation of expected failure scenarios in the child message handler. Fixes flux-framework#3608
garlick
added a commit
to garlick/flux-core
that referenced
this issue
Sep 1, 2021
Problem: if broker crashes without getting word to child, the child may reconnect and resume messaging once the broker returns to service. These messages will be dropped, but the child is not informed that should restart. Send a KEEPALIVE_DISCONNECT in that case, triggering a subtree panic. Also, suppress logging dropped messages from children that have been forceably disconnected, as it is expected that some messages might be sent before the KEEPALIVE_DISCONNECT is processed on the other end. Improve documentation of expected failure scenarios in the child message handler. Fixes flux-framework#3608
garlick
added a commit
to garlick/flux-core
that referenced
this issue
Sep 2, 2021
Problem: if broker crashes without getting word to child, the child may reconnect and resume messaging once the broker returns to service. These messages will be dropped, but the child is not informed that should restart. Send a KEEPALIVE_DISCONNECT in that case, triggering a subtree panic. Also, suppress logging dropped messages from children that have been forceably disconnected, as it is expected that some messages might be sent before the KEEPALIVE_DISCONNECT is processed on the other end. Improve documentation of expected failure scenarios in the child message handler. Fixes flux-framework#3608
chu11
pushed a commit
to chu11/flux-core
that referenced
this issue
Sep 28, 2021
Problem: if broker crashes without getting word to child, the child may reconnect and resume messaging once the broker returns to service. These messages will be dropped, but the child is not informed that it should restart. Send a KEEPALIVE_DISCONNECT in that case, triggering a subtree panic. Also, suppress logging dropped messages from children that have been forceably disconnected, as it is expected that some messages might be sent before the KEEPALIVE_DISCONNECT is processed on the other end. Improve documentation of expected failure scenarios in the child message handler. Fixes flux-framework#3608
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Problem: if broker 0 (or a broker acting as a router) crashes and is restarted, it will not have cleanly shut down its downstream brokers. Upon restart, it will be expecting an
overlay.hello
RPC from them, and will drop any other messages. However, there is no mechanism to notify the downstream brokers that messages are being dropped, and they should recover or restart. The downstream sub-tree will have brokers running, but they will appear down to the instance.Although online recovery from an upstream crash will not be tackled in the near term development plan, we can at least add a notification mechanism so the downstream brokers can log something informative to the systemd journal and take some appropriate action.
The action could be
I would tend to favor the latter since this situation may indicate that other things are wrong with Flux that need sys admin intervention.
The text was updated successfully, but these errors were encountered: