-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker: ensure subtree restart upon loss of router node #3845
Conversation
523379c
to
a9175cc
Compare
a9175cc
to
be7ce13
Compare
I will work a bit on the testing here as there are some code paths in the diff that should definitely have coverage. |
Ok, that was going to be my one comment so I'll hold off reviewing for now. |
Problem: comment refers to child->connected, but that struct element no longer exists. Rephrase the comment to reflect the current code.
Problem: child_lookup() can only find online children, but we have at least two callers that call it, and then check whether the child is online. Rename to child_lookup_online() to avoid confusion. Update callers to remove redundant checks.
Problem: in child_cb(), 'sender' refers to the peer ID not a request sender ID, and is an artifact of the ROUTER socket that applies to any Flux message type, hence the variable name is confusing. Rename 'sender' to 'uuid'.
Problem: broker/overlay.c contains mixed use of !strcmp() and streq() to test for string equality. Convert remaining !strcmp() calls to streq().
aae103b
to
7a41a05
Compare
Problem: if broker crashes without getting word to child, the child may reconnect and resume messaging once the broker returns to service. These messages will be dropped, but the child is not informed that it should restart. Send a KEEPALIVE_DISCONNECT in that case, triggering a subtree panic. Also, suppress logging dropped messages from children that have been forceably disconnected, as it is expected that some messages might be sent before the KEEPALIVE_DISCONNECT is processed on the other end. Improve documentation of expected failure scenarios in the child message handler. Fixes flux-framework#3608
Problem: when a broker crashes and reconnects, the parent does not properly transition the child subtree status through SUBTREE_STATUS_LOST before updating to the new status. The fallout is: 1. If the child status has not changed from before, this confuses overlay_child_status_update(), which will not update overlay->child_hash because it only does so upon transition to online. The child node will be unable to communicate. 2. If there are pending RPCs tracked for the old broker uuid, they are not failed because the old uuid never transitions to the LOST state. 3. The code for updating the status/uuid is not reached if version negotiation fails. The parent continues to think the child is online. Call overlay_child_status_update() to transition the status to SUBTREE_STATUS_LOST before accepting the new status from the hello message. Relocate the version check between the SUBTREE_STATUS_LOST update and the new status update, so LOST will stick if version negotiation fails. Finally, update logging so the old and new UUIDs are logged at LOG_ERR level, since this occurrence may be useful forensicly. Update a test in t3300-system-basic that was fragile with respect to the changed log output.
b60772e
to
453cda1
Compare
I'm getting reproducible failures from Any idea why the
|
I can't understand what's going on here, just to make sure the
I don't understand how that RPC can get this particular error -- isn't it just an RPC to rank 0, which is clearly online? Maybe I should try a fresh clone of this branch. |
Exactly what I was about to say. I don't think it can happen on rank 0. Print out |
Ah!
FLUX_URI is still set to rank 2.. |
This fixes it diff --git a/t/t3306-system-routercrash.t b/t/t3306-system-routercrash.t
index 98345b3f3..507deac6a 100755
--- a/t/t3306-system-routercrash.t
+++ b/t/t3306-system-routercrash.t
@@ -38,7 +38,9 @@ test_expect_success 'restart broker 1' '
'
test_expect_success 'ping broker 1 via broker 2' '
+ (
FLUX_URI=$(cat uri2) test_must_fail flux ping --count=1 1
+ )
'
test_expect_success 'broker 2 was disconnected' ' Because of the way sharness uses |
Thanks, I was finding that very puzzling. I'll go ahead and incorporate that fix and repush. |
Problem: test fails occasionally in CI This was observed in CI: expecting success: pid=$(cat health.pid) && echo waiting for pid $pid && test_expect_code 1 wait $pid && grep "overlay disconnect" health.err waiting for pid 604451 not ok 15 - background RPC fails with overlay disconnect (tracker response from 6) Since the RPC was requested through rank 13, and rank 13 exits as a result of the crash of broker 6, it may be possible to receive other errors here, such as ECONNRESET. Change the test to just expect failure, not a particular one. The important outcome is that RPCs do not hang.
Problem: t3305-system-disconnect includes a couple subtests that check that a broker exited and allows for failure or not, however the tests would also succeed if the broker died by signal. Use the 'test_might_fail' macro instead of '|| /bin/true' idiom.
Problem: t/python/t0010-job-py occasionally fails: not ok 33 __main__.TestJob.test_32_job_result --- message: | Traceback (most recent call last): File "python/t0010-job.py", line 574, in test_32_job_result self.assertEqual( AssertionError: 143 != -15 ... 1..33 The test cancels a sleep job just after the start event appears in the main eventlog, but at that point, _usually_ only the shell has started, and the sleep has not yet been spawned. When the sleep actually does get spawned, the wait status is different than expected. Change the test to wait for the shell.start event in the guest.exec eventlog before canceling, and then expect 143<<8. Fixes flux-framework#3666
Problem: the system-rpctrack sharness test only tests rpc tracking in the downstream direction, while the system-disconnect test covers the upstream direction. The names do not reflect the purpose of the tests very well. Rename: t3304-system-rpctrack.t -> t3304-system-rpctrack-down.t t3305-system-disconnect.t -> t3305-system-rpctrack-up.t Improve the test descriptions as well.
Problem: system sharness test descriptiosn are poor. Flesh out descriptions for existing t330x-system-*.t tests.
Problem: t3305-system-rpctrack-up.t sets FLUX_URI in a test without running it in a subshell, which may assign FLUX_URI in the main shell as a side effect on some systems. Enclose the assignment in parenthesis to ensure FLUX_URI is not set in the main shell.
453cda1
to
9a77e05
Compare
OK, I squashed that fix and one other one that I had added incrementally to the same test and forced a push. |
I wonder if a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Codecov Report
@@ Coverage Diff @@
## master #3845 +/- ##
=======================================
Coverage 83.52% 83.52%
=======================================
Files 355 355
Lines 51971 51975 +4
=======================================
+ Hits 43409 43413 +4
Misses 8562 8562
|
Thanks - I'm going to set MWP. |
This adds code to handle the case described in #3608, where a broker might be forever cut off from the flux instance if its TBON parent restarts without informing it (for example a kill -9 or a node crash). With this change, the parent detects the child trying to communicate without first executing the hello handshake to register its uuid, and sends a KEEPALIVE_DISCONNECT message to ensure it stops. Systemd would then be able to restart it.
Also fixed a problem with the crashed broker properly rejoining its parent, discovered during testing.
There is also some minor overlay cleanup here.
This is built on top of pr #3843. Once that is merged, this can be rebased.