running brokers don't detect upstream broker restart after a crash #3608

garlick · 2021-04-22T15:17:13Z

Problem: if broker 0 (or a broker acting as a router) crashes and is restarted, it will not have cleanly shut down its downstream brokers. Upon restart, it will be expecting an overlay.hello RPC from them, and will drop any other messages. However, there is no mechanism to notify the downstream brokers that messages are being dropped, and they should recover or restart. The downstream sub-tree will have brokers running, but they will appear down to the instance.

Although online recovery from an upstream crash will not be tackled in the near term development plan, we can at least add a notification mechanism so the downstream brokers can log something informative to the systemd journal and take some appropriate action.

The action could be

exit rc=0 and let systemd automatically restart broker so it can join the restarted instance
exit rc !=0 and wait for a sys admin to restart

I would tend to favor the latter since this situation may indicate that other things are wrong with Flux that need sys admin intervention.

The text was updated successfully, but these errors were encountered:

Problem: if broker crashes without getting word to child, the child may reconnect and resume messaging once the broker returns to service. These messages will be dropped, but the child is not informed that should restart. Send a KEEPALIVE_DISCONNECT in that case, triggering a subtree panic. Also, suppress logging dropped messages from children that have been forceably disconnected, as it is expected that some messages might be sent before the KEEPALIVE_DISCONNECT is processed on the other end. Improve documentation of expected failure scenarios in the child message handler. Fixes flux-framework#3608

Problem: if broker crashes without getting word to child, the child may reconnect and resume messaging once the broker returns to service. These messages will be dropped, but the child is not informed that it should restart. Send a KEEPALIVE_DISCONNECT in that case, triggering a subtree panic. Also, suppress logging dropped messages from children that have been forceably disconnected, as it is expected that some messages might be sent before the KEEPALIVE_DISCONNECT is processed on the other end. Improve documentation of expected failure scenarios in the child message handler. Fixes flux-framework#3608

garlick mentioned this issue Aug 30, 2021

broker: ensure subtree restart upon loss of router node #3845

Merged

mergify bot closed this as completed in fee1ad2 Sep 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

running brokers don't detect upstream broker restart after a crash #3608

running brokers don't detect upstream broker restart after a crash #3608

garlick commented Apr 22, 2021

running brokers don't detect upstream broker restart after a crash #3608

running brokers don't detect upstream broker restart after a crash #3608

Comments

garlick commented Apr 22, 2021