Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running brokers don't detect upstream broker restart after a crash #3608

Closed
garlick opened this issue Apr 22, 2021 · 0 comments
Closed

running brokers don't detect upstream broker restart after a crash #3608

garlick opened this issue Apr 22, 2021 · 0 comments

Comments

@garlick
Copy link
Member

garlick commented Apr 22, 2021

Problem: if broker 0 (or a broker acting as a router) crashes and is restarted, it will not have cleanly shut down its downstream brokers. Upon restart, it will be expecting an overlay.hello RPC from them, and will drop any other messages. However, there is no mechanism to notify the downstream brokers that messages are being dropped, and they should recover or restart. The downstream sub-tree will have brokers running, but they will appear down to the instance.

Although online recovery from an upstream crash will not be tackled in the near term development plan, we can at least add a notification mechanism so the downstream brokers can log something informative to the systemd journal and take some appropriate action.

The action could be

  • exit rc=0 and let systemd automatically restart broker so it can join the restarted instance
  • exit rc !=0 and wait for a sys admin to restart

I would tend to favor the latter since this situation may indicate that other things are wrong with Flux that need sys admin intervention.

garlick added a commit to garlick/flux-core that referenced this issue Aug 30, 2021
Problem: if broker crashes without getting word to child,
the child may reconnect and resume messaging once the broker
returns to service.  These messages will be dropped, but
the child is not informed that should restart.

Send a KEEPALIVE_DISCONNECT in that case, triggering a
subtree panic.

Also, suppress logging dropped messages from children that
have been forceably disconnected, as it is expected that some
messages might be sent before the KEEPALIVE_DISCONNECT is
processed on the other end.

Improve documentation of expected failure scenarios in the
child message handler.

Fixes flux-framework#3608
garlick added a commit to garlick/flux-core that referenced this issue Sep 1, 2021
Problem: if broker crashes without getting word to child,
the child may reconnect and resume messaging once the broker
returns to service.  These messages will be dropped, but
the child is not informed that should restart.

Send a KEEPALIVE_DISCONNECT in that case, triggering a
subtree panic.

Also, suppress logging dropped messages from children that
have been forceably disconnected, as it is expected that some
messages might be sent before the KEEPALIVE_DISCONNECT is
processed on the other end.

Improve documentation of expected failure scenarios in the
child message handler.

Fixes flux-framework#3608
garlick added a commit to garlick/flux-core that referenced this issue Sep 2, 2021
Problem: if broker crashes without getting word to child,
the child may reconnect and resume messaging once the broker
returns to service.  These messages will be dropped, but
the child is not informed that should restart.

Send a KEEPALIVE_DISCONNECT in that case, triggering a
subtree panic.

Also, suppress logging dropped messages from children that
have been forceably disconnected, as it is expected that some
messages might be sent before the KEEPALIVE_DISCONNECT is
processed on the other end.

Improve documentation of expected failure scenarios in the
child message handler.

Fixes flux-framework#3608
@mergify mergify bot closed this as completed in fee1ad2 Sep 3, 2021
chu11 pushed a commit to chu11/flux-core that referenced this issue Sep 28, 2021
Problem: if broker crashes without getting word to child,
the child may reconnect and resume messaging once the broker
returns to service.  These messages will be dropped, but
the child is not informed that it should restart.

Send a KEEPALIVE_DISCONNECT in that case, triggering a
subtree panic.

Also, suppress logging dropped messages from children that
have been forceably disconnected, as it is expected that some
messages might be sent before the KEEPALIVE_DISCONNECT is
processed on the other end.

Improve documentation of expected failure scenarios in the
child message handler.

Fixes flux-framework#3608
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant