broker: ensure subtree restart upon loss of router node #3845

garlick · 2021-08-30T22:30:57Z

This adds code to handle the case described in #3608, where a broker might be forever cut off from the flux instance if its TBON parent restarts without informing it (for example a kill -9 or a node crash). With this change, the parent detects the child trying to communicate without first executing the hello handshake to register its uuid, and sends a KEEPALIVE_DISCONNECT message to ensure it stops. Systemd would then be able to restart it.

Also fixed a problem with the crashed broker properly rejoining its parent, discovered during testing.

There is also some minor overlay cleanup here.

This is built on top of pr #3843. Once that is merged, this can be rebased.

garlick · 2021-09-01T21:58:08Z

I will work a bit on the testing here as there are some code paths in the diff that should definitely have coverage.

grondo · 2021-09-01T22:17:32Z

I will work a bit on the testing here as there are some code paths in the diff that should definitely have coverage.

Ok, that was going to be my one comment so I'll hold off reviewing for now.

Problem: comment refers to child->connected, but that struct element no longer exists. Rephrase the comment to reflect the current code.

Problem: child_lookup() can only find online children, but we have at least two callers that call it, and then check whether the child is online. Rename to child_lookup_online() to avoid confusion. Update callers to remove redundant checks.

Problem: in child_cb(), 'sender' refers to the peer ID not a request sender ID, and is an artifact of the ROUTER socket that applies to any Flux message type, hence the variable name is confusing. Rename 'sender' to 'uuid'.

Problem: broker/overlay.c contains mixed use of !strcmp() and streq() to test for string equality. Convert remaining !strcmp() calls to streq().

Problem: if broker crashes without getting word to child, the child may reconnect and resume messaging once the broker returns to service. These messages will be dropped, but the child is not informed that it should restart. Send a KEEPALIVE_DISCONNECT in that case, triggering a subtree panic. Also, suppress logging dropped messages from children that have been forceably disconnected, as it is expected that some messages might be sent before the KEEPALIVE_DISCONNECT is processed on the other end. Improve documentation of expected failure scenarios in the child message handler. Fixes flux-framework#3608

Problem: when a broker crashes and reconnects, the parent does not properly transition the child subtree status through SUBTREE_STATUS_LOST before updating to the new status. The fallout is: 1. If the child status has not changed from before, this confuses overlay_child_status_update(), which will not update overlay->child_hash because it only does so upon transition to online. The child node will be unable to communicate. 2. If there are pending RPCs tracked for the old broker uuid, they are not failed because the old uuid never transitions to the LOST state. 3. The code for updating the status/uuid is not reached if version negotiation fails. The parent continues to think the child is online. Call overlay_child_status_update() to transition the status to SUBTREE_STATUS_LOST before accepting the new status from the hello message. Relocate the version check between the SUBTREE_STATUS_LOST update and the new status update, so LOST will stick if version negotiation fails. Finally, update logging so the old and new UUIDs are logged at LOG_ERR level, since this occurrence may be useful forensicly. Update a test in t3300-system-basic that was fragile with respect to the changed log output.

grondo · 2021-09-03T18:23:17Z

I'm getting reproducible failures from t3306-system-routercrash on my system.
Must be my environment because all tests appear to have passed in CI...

Any idea why the --wait option to flux overlay status doesn't appear to be working here?

 grondo@asp:~/git/f.git/t$ git describe
v0.28.0-211-g453cda11b
 grondo@asp:~/git/f.git/t$ ./t3306-system-routercrash.t -d -v
sharness: loading extensions from /home/grondo/git/f.git/t/sharness.d/01-setup.sh
sharness: loading extensions from /home/grondo/git/f.git/t/sharness.d/flux-sharness.sh
sharness: loading extensions from /home/grondo/git/f.git/t/sharness.d/01-setup.sh
sharness: loading extensions from /home/grondo/git/f.git/t/sharness.d/flux-sharness.sh
expecting success: 
	flux exec flux setattr log-stderr-mode local

ok 1 - tell each broker to log to stderr

expecting success: 
	echo "local://$(flux getattr rundir)/local-2" >uri2 &&
	test $(FLUX_URI=$(cat uri2) flux getattr rank) -eq 2

ok 2 - construct FLUX_URI for rank 2 (child of 1)

expecting success: 
	$startctl kill 1 9 &&
	test_expect_code 137 $startctl wait 1

flux-start: 1 (pid 16898) Killed
ok 3 - kill -9 broker 1

expecting success: 
	$startctl run 1

broker.err[0]: rank 1 uuid efeea11f-eb52-4753-852a-a6f806da9143 lost, to be superceded by new hello
broker.err[0]: hello from rank 1 uuid 8a6c9873-b258-471c-a332-6a749a4379c4 status partial
ok 4 - restart broker 1

expecting success: 
	FLUX_URI=$(cat uri2) test_must_fail flux ping --count=1 1

broker.err[2]: parent disconnect
broker.err[2]: state-machine.monitor: No route to host
flux-ping: 1!broker.ping: No route to host
ok 5 - ping broker 1 via broker 2

expecting success: 
	test_might_fail $startctl wait 2

flux-start: 2 (pid 16914) exited with rc=1
ok 6 - broker 2 was disconnected

expecting success: 
	$startctl run 2

ok 7 - restart broker 2

expecting success: 
	run_timeout 10 flux overlay status -vvv --pretty --color --wait full

flux-overlay: Upstream Flux broker is offline. Try again later.
not ok 8 - wait for rank 0 to report subtree status of full
#	
#		run_timeout 10 flux overlay status -vvv --pretty --color --wait full
#	

expecting success: 
	flux ping --count=1 2

flux-ping: 2!broker.ping: Resource temporarily unavailable
not ok 9 - ping broker 2 via broker 0
#	
#		flux ping --count=1 2
#	

expecting success: 
        flux mini run -n3 -N3 /bin/true

ok 10 - run a 3 node job

# failed 2 among 10 test(s)
1..10
broker.err[0]: rc2.0: /bin/bash -c sh ./t3306-system-routercrash.t  --verbose --debug --root=/home/grondo/git/f.git/t/trash-directory.t3306-system-routercrash Exited (rc=1) 4.1s
flux-start: 0 (pid 16882) exited with rc=1

grondo · 2021-09-03T18:38:30Z

I can't understand what's going on here, just to make sure the overlay.health RPC was being generated as expected, I added FLUX_HANDLE_TRACE and got

expecting success: 
	FLUX_HANDLE_TRACE=1 flux overlay status -vvv --pretty --color --wait full

--------------------------------------
> 0.00000
> overlay.health
> flags=topic,payload,route userid=unknown rolemask=none nodeid=0 matchtag=1
> {"wait":"full"}
--------------------------------------
< 0.00038
< overlay.health
< flags=topic,payload,route userid=1000 rolemask=owner errnum=11 matchtag=1
< Upstream Flux broker is offline. Try again later.
flux-overlay: Upstream Flux broker is offline. Try again later.
not ok 8 - wait for rank 0 to report subtree status of full
#	
#		FLUX_HANDLE_TRACE=1 flux overlay status -vvv --pretty --color --wait full
#

I don't understand how that RPC can get this particular error -- isn't it just an RPC to rank 0, which is clearly online?

Maybe I should try a fresh clone of this branch.

garlick · 2021-09-03T18:39:29Z

I don't understand how that RPC can get this particular error -- isn't it just an RPC to rank 0, which is clearly online?

Exactly what I was about to say. I don't think it can happen on rank 0. Print out $FLUX_URI maybe? Did something earlier on have a side effect?

grondo · 2021-09-03T18:40:08Z

Ah!

expecting success: 
	echo FLUX_URI=$FLUX_URI &&
	FLUX_HANDLE_TRACE=1 flux overlay status -vvv --pretty --color --wait full

FLUX_URI=local:///tmp/flux-system-xy88hs/local-2
--------------------------------------
> 0.00000
> overlay.health
> flags=topic,payload,route userid=unknown rolemask=none nodeid=0 matchtag=1
> {"wait":"full"}
--------------------------------------
< 0.00048
< overlay.health
< flags=topic,payload,route userid=1000 rolemask=owner errnum=11 matchtag=1
< Upstream Flux broker is offline. Try again later.
flux-overlay: Upstream Flux broker is offline. Try again later.

FLUX_URI is still set to rank 2..

grondo · 2021-09-03T18:43:44Z

This fixes it

diff --git a/t/t3306-system-routercrash.t b/t/t3306-system-routercrash.t
index 98345b3f3..507deac6a 100755
--- a/t/t3306-system-routercrash.t
+++ b/t/t3306-system-routercrash.t
@@ -38,7 +38,9 @@ test_expect_success 'restart broker 1' '
 '
 
 test_expect_success 'ping broker 1 via broker 2' '
+       (
        FLUX_URI=$(cat uri2) test_must_fail flux ping --count=1 1
+       )
 '
 
 test_expect_success 'broker 2 was disconnected' '

Because of the way sharness uses eval on tests, I guess env vars set on command line still get exported to the global shell environment. Ugh.

garlick · 2021-09-03T18:44:54Z

Thanks, I was finding that very puzzling. I'll go ahead and incorporate that fix and repush.

Problem: test fails occasionally in CI This was observed in CI: expecting success: pid=$(cat health.pid) && echo waiting for pid $pid && test_expect_code 1 wait $pid && grep "overlay disconnect" health.err waiting for pid 604451 not ok 15 - background RPC fails with overlay disconnect (tracker response from 6) Since the RPC was requested through rank 13, and rank 13 exits as a result of the crash of broker 6, it may be possible to receive other errors here, such as ECONNRESET. Change the test to just expect failure, not a particular one. The important outcome is that RPCs do not hang.

Problem: t3305-system-disconnect includes a couple subtests that check that a broker exited and allows for failure or not, however the tests would also succeed if the broker died by signal. Use the 'test_might_fail' macro instead of '|| /bin/true' idiom.

Problem: t/python/t0010-job-py occasionally fails: not ok 33 __main__.TestJob.test_32_job_result --- message: | Traceback (most recent call last): File "python/t0010-job.py", line 574, in test_32_job_result self.assertEqual( AssertionError: 143 != -15 ... 1..33 The test cancels a sleep job just after the start event appears in the main eventlog, but at that point, _usually_ only the shell has started, and the sleep has not yet been spawned. When the sleep actually does get spawned, the wait status is different than expected. Change the test to wait for the shell.start event in the guest.exec eventlog before canceling, and then expect 143<<8. Fixes flux-framework#3666

Problem: the system-rpctrack sharness test only tests rpc tracking in the downstream direction, while the system-disconnect test covers the upstream direction. The names do not reflect the purpose of the tests very well. Rename: t3304-system-rpctrack.t -> t3304-system-rpctrack-down.t t3305-system-disconnect.t -> t3305-system-rpctrack-up.t Improve the test descriptions as well.

Problem: system sharness test descriptiosn are poor. Flesh out descriptions for existing t330x-system-*.t tests.

Problem: t3305-system-rpctrack-up.t sets FLUX_URI in a test without running it in a subshell, which may assign FLUX_URI in the main shell as a side effect on some systems. Enclose the assignment in parenthesis to ensure FLUX_URI is not set in the main shell.

garlick · 2021-09-03T19:07:20Z

OK, I squashed that fix and one other one that I had added incrementally to the same test and forced a push.
Oh and I noticed the same kind of thing in t3305-system-rpctrack-up.t so fixed it there too in another commit.

grondo · 2021-09-03T19:08:56Z

I wonder if a flux --uri=URI ... option would be useful when the environment should be avoided. flux(1) could modify the env before starting the subcommand or calling flux_open(3)

grondo

LGTM!

codecov · 2021-09-03T19:34:43Z

Codecov Report

Merging #3845 (9a77e05) into master (67999f2) will increase coverage by 0.00%.
The diff coverage is 85.71%.

@@           Coverage Diff           @@
##           master    #3845   +/-   ##
=======================================
  Coverage   83.52%   83.52%           
=======================================
  Files         355      355           
  Lines       51971    51975    +4     
=======================================
+ Hits        43409    43413    +4     
  Misses       8562     8562

Impacted Files	Coverage Δ
src/broker/overlay.c	`88.73% <85.71%> (+0.06%)`	⬆️
src/modules/cron/cron.c	`81.77% <0.00%> (-0.48%)`	⬇️
src/modules/job-exec/job-exec.c	`75.82% <0.00%> (-0.21%)`	⬇️
src/common/libflux/message.c	`95.06% <0.00%> (-0.13%)`	⬇️
src/broker/content-cache.c	`83.33% <0.00%> (+0.25%)`	⬆️
src/common/libsubprocess/local.c	`79.65% <0.00%> (+0.34%)`	⬆️
src/modules/job-archive/job-archive.c	`60.08% <0.00%> (+0.80%)`	⬆️

garlick · 2021-09-03T19:35:57Z

Thanks - I'm going to set MWP.

garlick force-pushed the upstream_broker_crash branch from 523379c to a9175cc Compare August 30, 2021 22:57

garlick mentioned this pull request Sep 1, 2021

broker: fail pending RPCs when TBON parent goes down #3843

Merged

garlick force-pushed the upstream_broker_crash branch from a9175cc to be7ce13 Compare September 1, 2021 21:17

garlick added 4 commits September 1, 2021 18:53

broker/overlay: fix outdated comment

066896f

Problem: comment refers to child->connected, but that struct element no longer exists. Rephrase the comment to reflect the current code.

broker/overlay: [cleanup] rename local variable

e338987

Problem: in child_cb(), 'sender' refers to the peer ID not a request sender ID, and is an artifact of the ROUTER socket that applies to any Flux message type, hence the variable name is confusing. Rename 'sender' to 'uuid'.

broker/overlay: [cleanup] use streq()

4ea3a6e

Problem: broker/overlay.c contains mixed use of !strcmp() and streq() to test for string equality. Convert remaining !strcmp() calls to streq().

garlick force-pushed the upstream_broker_crash branch 3 times, most recently from aae103b to 7a41a05 Compare September 2, 2021 17:14

grondo mentioned this pull request Sep 2, 2021

NEWS.md: add release notes for flux-core v0.29.0 #3848

Merged

garlick added this to the flux-core v0.29.0 milestone Sep 3, 2021

garlick added 2 commits September 3, 2021 06:25

garlick force-pushed the upstream_broker_crash branch 2 times, most recently from b60772e to 453cda1 Compare September 3, 2021 15:27

garlick added 6 commits September 3, 2021 12:00

testsuite: cover upstream broker crash

f0046b9

testsuite: use test_might_fail

32826e5

Problem: t3305-system-disconnect includes a couple subtests that check that a broker exited and allows for failure or not, however the tests would also succeed if the broker died by signal. Use the 'test_might_fail' macro instead of '|| /bin/true' idiom.

testsuite: improve system test descriptions

a4ad45a

Problem: system sharness test descriptiosn are poor. Flesh out descriptions for existing t330x-system-*.t tests.

garlick added 2 commits September 3, 2021 12:00

testsuite: add coverage for crashing leaf node

c61c523

garlick force-pushed the upstream_broker_crash branch from 453cda1 to 9a77e05 Compare September 3, 2021 19:06

grondo approved these changes Sep 3, 2021

View reviewed changes

garlick added the merge-when-passing label Sep 3, 2021

mergify bot merged commit 3ad7f3c into flux-framework:master Sep 3, 2021

garlick deleted the upstream_broker_crash branch September 3, 2021 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broker: ensure subtree restart upon loss of router node #3845

broker: ensure subtree restart upon loss of router node #3845

garlick commented Aug 30, 2021

garlick commented Sep 1, 2021

grondo commented Sep 1, 2021

grondo commented Sep 3, 2021

grondo commented Sep 3, 2021

garlick commented Sep 3, 2021 •

edited

Loading

grondo commented Sep 3, 2021

grondo commented Sep 3, 2021

garlick commented Sep 3, 2021

garlick commented Sep 3, 2021

grondo commented Sep 3, 2021

grondo left a comment

codecov bot commented Sep 3, 2021

garlick commented Sep 3, 2021

broker: ensure subtree restart upon loss of router node #3845

broker: ensure subtree restart upon loss of router node #3845

Conversation

garlick commented Aug 30, 2021

garlick commented Sep 1, 2021

grondo commented Sep 1, 2021

grondo commented Sep 3, 2021

grondo commented Sep 3, 2021

garlick commented Sep 3, 2021 • edited Loading

grondo commented Sep 3, 2021

grondo commented Sep 3, 2021

garlick commented Sep 3, 2021

garlick commented Sep 3, 2021

grondo commented Sep 3, 2021

grondo left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 3, 2021

Codecov Report

garlick commented Sep 3, 2021

garlick commented Sep 3, 2021 •

edited

Loading