Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: ensure subtree restart upon loss of router node #3845

Merged
merged 14 commits into from
Sep 3, 2021

Conversation

garlick
Copy link
Member

@garlick garlick commented Aug 30, 2021

This adds code to handle the case described in #3608, where a broker might be forever cut off from the flux instance if its TBON parent restarts without informing it (for example a kill -9 or a node crash). With this change, the parent detects the child trying to communicate without first executing the hello handshake to register its uuid, and sends a KEEPALIVE_DISCONNECT message to ensure it stops. Systemd would then be able to restart it.

Also fixed a problem with the crashed broker properly rejoining its parent, discovered during testing.

There is also some minor overlay cleanup here.

This is built on top of pr #3843. Once that is merged, this can be rebased.

@garlick
Copy link
Member Author

garlick commented Sep 1, 2021

I will work a bit on the testing here as there are some code paths in the diff that should definitely have coverage.

@grondo
Copy link
Contributor

grondo commented Sep 1, 2021

I will work a bit on the testing here as there are some code paths in the diff that should definitely have coverage.

Ok, that was going to be my one comment so I'll hold off reviewing for now.

Problem: comment refers to child->connected, but that struct
element no longer exists.

Rephrase the comment to reflect the current code.
Problem: child_lookup() can only find online children, but
we have at least two callers that call it, and then check whether
the child is online.

Rename to child_lookup_online() to avoid confusion.
Update callers to remove redundant checks.
Problem: in child_cb(), 'sender' refers to the peer ID not
a request sender ID, and is an artifact of the ROUTER socket
that applies to any Flux message type, hence the variable name
is confusing.

Rename 'sender' to 'uuid'.
Problem: broker/overlay.c contains mixed use of !strcmp() and
streq() to test for string equality.

Convert remaining !strcmp() calls to streq().
@garlick garlick force-pushed the upstream_broker_crash branch 3 times, most recently from aae103b to 7a41a05 Compare September 2, 2021 17:14
@garlick garlick added this to the flux-core v0.29.0 milestone Sep 3, 2021
Problem: if broker crashes without getting word to child,
the child may reconnect and resume messaging once the broker
returns to service.  These messages will be dropped, but
the child is not informed that it should restart.

Send a KEEPALIVE_DISCONNECT in that case, triggering a
subtree panic.

Also, suppress logging dropped messages from children that
have been forceably disconnected, as it is expected that some
messages might be sent before the KEEPALIVE_DISCONNECT is
processed on the other end.

Improve documentation of expected failure scenarios in the
child message handler.

Fixes flux-framework#3608
Problem: when a broker crashes and reconnects, the parent
does not properly transition the child subtree status through
SUBTREE_STATUS_LOST before updating to the new status.

The fallout is:

1. If the child status has not changed from before, this
confuses overlay_child_status_update(), which will not
update overlay->child_hash because it only does so upon
transition to online.  The child node will be unable
to communicate.

2. If there are pending RPCs tracked for the old broker uuid,
they are not failed because the old uuid never transitions
to the LOST state.

3. The code for updating the status/uuid is not reached if
version negotiation fails.  The parent continues to think
the child is online.

Call overlay_child_status_update() to transition the status to
SUBTREE_STATUS_LOST before accepting the new status from the
hello message.

Relocate the version check between the SUBTREE_STATUS_LOST update
and the new status update, so LOST will stick if version negotiation
fails.

Finally, update logging so the old and new UUIDs are logged at
LOG_ERR level, since this occurrence may be useful forensicly.
Update a test in t3300-system-basic that was fragile with respect
to the changed log output.
@garlick garlick force-pushed the upstream_broker_crash branch 2 times, most recently from b60772e to 453cda1 Compare September 3, 2021 15:27
@grondo
Copy link
Contributor

grondo commented Sep 3, 2021

I'm getting reproducible failures from t3306-system-routercrash on my system.
Must be my environment because all tests appear to have passed in CI...

Any idea why the --wait option to flux overlay status doesn't appear to be working here?

 grondo@asp:~/git/f.git/t$ git describe
v0.28.0-211-g453cda11b
 grondo@asp:~/git/f.git/t$ ./t3306-system-routercrash.t -d -v
sharness: loading extensions from /home/grondo/git/f.git/t/sharness.d/01-setup.sh
sharness: loading extensions from /home/grondo/git/f.git/t/sharness.d/flux-sharness.sh
sharness: loading extensions from /home/grondo/git/f.git/t/sharness.d/01-setup.sh
sharness: loading extensions from /home/grondo/git/f.git/t/sharness.d/flux-sharness.sh
expecting success: 
	flux exec flux setattr log-stderr-mode local

ok 1 - tell each broker to log to stderr

expecting success: 
	echo "local://$(flux getattr rundir)/local-2" >uri2 &&
	test $(FLUX_URI=$(cat uri2) flux getattr rank) -eq 2

ok 2 - construct FLUX_URI for rank 2 (child of 1)

expecting success: 
	$startctl kill 1 9 &&
	test_expect_code 137 $startctl wait 1

flux-start: 1 (pid 16898) Killed
ok 3 - kill -9 broker 1

expecting success: 
	$startctl run 1

broker.err[0]: rank 1 uuid efeea11f-eb52-4753-852a-a6f806da9143 lost, to be superceded by new hello
broker.err[0]: hello from rank 1 uuid 8a6c9873-b258-471c-a332-6a749a4379c4 status partial
ok 4 - restart broker 1

expecting success: 
	FLUX_URI=$(cat uri2) test_must_fail flux ping --count=1 1

broker.err[2]: parent disconnect
broker.err[2]: state-machine.monitor: No route to host
flux-ping: 1!broker.ping: No route to host
ok 5 - ping broker 1 via broker 2

expecting success: 
	test_might_fail $startctl wait 2

flux-start: 2 (pid 16914) exited with rc=1
ok 6 - broker 2 was disconnected

expecting success: 
	$startctl run 2

ok 7 - restart broker 2

expecting success: 
	run_timeout 10 flux overlay status -vvv --pretty --color --wait full

flux-overlay: Upstream Flux broker is offline. Try again later.
not ok 8 - wait for rank 0 to report subtree status of full
#	
#		run_timeout 10 flux overlay status -vvv --pretty --color --wait full
#	

expecting success: 
	flux ping --count=1 2

flux-ping: 2!broker.ping: Resource temporarily unavailable
not ok 9 - ping broker 2 via broker 0
#	
#		flux ping --count=1 2
#	

expecting success: 
        flux mini run -n3 -N3 /bin/true

ok 10 - run a 3 node job

# failed 2 among 10 test(s)
1..10
broker.err[0]: rc2.0: /bin/bash -c sh ./t3306-system-routercrash.t  --verbose --debug --root=/home/grondo/git/f.git/t/trash-directory.t3306-system-routercrash Exited (rc=1) 4.1s
flux-start: 0 (pid 16882) exited with rc=1

@grondo
Copy link
Contributor

grondo commented Sep 3, 2021

I can't understand what's going on here, just to make sure the overlay.health RPC was being generated as expected, I added FLUX_HANDLE_TRACE and got

expecting success: 
	FLUX_HANDLE_TRACE=1 flux overlay status -vvv --pretty --color --wait full

--------------------------------------
> 0.00000
> overlay.health
> flags=topic,payload,route userid=unknown rolemask=none nodeid=0 matchtag=1
> {"wait":"full"}
--------------------------------------
< 0.00038
< overlay.health
< flags=topic,payload,route userid=1000 rolemask=owner errnum=11 matchtag=1
< Upstream Flux broker is offline. Try again later.
flux-overlay: Upstream Flux broker is offline. Try again later.
not ok 8 - wait for rank 0 to report subtree status of full
#	
#		FLUX_HANDLE_TRACE=1 flux overlay status -vvv --pretty --color --wait full
#	

I don't understand how that RPC can get this particular error -- isn't it just an RPC to rank 0, which is clearly online?

Maybe I should try a fresh clone of this branch.

@garlick
Copy link
Member Author

garlick commented Sep 3, 2021

I don't understand how that RPC can get this particular error -- isn't it just an RPC to rank 0, which is clearly online?

Exactly what I was about to say. I don't think it can happen on rank 0. Print out $FLUX_URI maybe? Did something earlier on have a side effect?

@grondo
Copy link
Contributor

grondo commented Sep 3, 2021

Ah!

expecting success: 
	echo FLUX_URI=$FLUX_URI &&
	FLUX_HANDLE_TRACE=1 flux overlay status -vvv --pretty --color --wait full

FLUX_URI=local:///tmp/flux-system-xy88hs/local-2
--------------------------------------
> 0.00000
> overlay.health
> flags=topic,payload,route userid=unknown rolemask=none nodeid=0 matchtag=1
> {"wait":"full"}
--------------------------------------
< 0.00048
< overlay.health
< flags=topic,payload,route userid=1000 rolemask=owner errnum=11 matchtag=1
< Upstream Flux broker is offline. Try again later.
flux-overlay: Upstream Flux broker is offline. Try again later.

FLUX_URI is still set to rank 2..

@grondo
Copy link
Contributor

grondo commented Sep 3, 2021

This fixes it

diff --git a/t/t3306-system-routercrash.t b/t/t3306-system-routercrash.t
index 98345b3f3..507deac6a 100755
--- a/t/t3306-system-routercrash.t
+++ b/t/t3306-system-routercrash.t
@@ -38,7 +38,9 @@ test_expect_success 'restart broker 1' '
 '
 
 test_expect_success 'ping broker 1 via broker 2' '
+       (
        FLUX_URI=$(cat uri2) test_must_fail flux ping --count=1 1
+       )
 '
 
 test_expect_success 'broker 2 was disconnected' '

Because of the way sharness uses eval on tests, I guess env vars set on command line still get exported to the global shell environment. Ugh.

@garlick
Copy link
Member Author

garlick commented Sep 3, 2021

Thanks, I was finding that very puzzling. I'll go ahead and incorporate that fix and repush.

Problem: test fails occasionally in CI

This was observed in CI:

  expecting success:
    pid=$(cat health.pid) &&
    echo waiting for pid $pid &&
    test_expect_code 1 wait $pid &&
    grep "overlay disconnect" health.err

  waiting for pid 604451
  not ok 15 - background RPC fails with overlay disconnect (tracker response from 6)

Since the RPC was requested through rank 13, and rank 13 exits
as a result of the crash of broker 6, it may be possible to receive
other errors here, such as ECONNRESET.

Change the test to just expect failure, not a particular one.
The important outcome is that RPCs do not hang.
Problem: t3305-system-disconnect includes a couple subtests
that check that a broker exited and allows for failure or not,
however the tests would also succeed if the broker died by signal.

Use the 'test_might_fail' macro instead of '|| /bin/true' idiom.
Problem: t/python/t0010-job-py occasionally fails:

  not ok 33 __main__.TestJob.test_32_job_result
  ---
    message: |
      Traceback (most recent call last):
        File "python/t0010-job.py", line 574, in test_32_job_result
          self.assertEqual(
      AssertionError: 143 != -15
  ...
  1..33

The test cancels a sleep job just after the start event appears in
the main eventlog, but at that point, _usually_ only the shell
has started, and the sleep has not yet been spawned.  When the
sleep actually does get spawned, the wait status is different than
expected.

Change the test to wait for the shell.start event in the guest.exec
eventlog before canceling, and then expect 143<<8.

Fixes flux-framework#3666
Problem: the system-rpctrack sharness test only tests rpc
tracking in the downstream direction, while the system-disconnect
test covers the upstream direction.  The names do not reflect
the purpose of the tests very well.

Rename:
t3304-system-rpctrack.t -> t3304-system-rpctrack-down.t
t3305-system-disconnect.t -> t3305-system-rpctrack-up.t

Improve the test descriptions as well.
Problem: system sharness test descriptiosn are poor.

Flesh out descriptions for existing t330x-system-*.t tests.
Problem: t3305-system-rpctrack-up.t sets FLUX_URI
in a test without running it in a subshell, which
may assign FLUX_URI in the main shell as a side effect
on some systems.

Enclose the assignment in parenthesis to ensure FLUX_URI
is not set in the main shell.
@garlick
Copy link
Member Author

garlick commented Sep 3, 2021

OK, I squashed that fix and one other one that I had added incrementally to the same test and forced a push.
Oh and I noticed the same kind of thing in t3305-system-rpctrack-up.t so fixed it there too in another commit.

@grondo
Copy link
Contributor

grondo commented Sep 3, 2021

I wonder if a flux --uri=URI ... option would be useful when the environment should be avoided. flux(1) could modify the env before starting the subcommand or calling flux_open(3)

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@codecov
Copy link

codecov bot commented Sep 3, 2021

Codecov Report

Merging #3845 (9a77e05) into master (67999f2) will increase coverage by 0.00%.
The diff coverage is 85.71%.

@@           Coverage Diff           @@
##           master    #3845   +/-   ##
=======================================
  Coverage   83.52%   83.52%           
=======================================
  Files         355      355           
  Lines       51971    51975    +4     
=======================================
+ Hits        43409    43413    +4     
  Misses       8562     8562           
Impacted Files Coverage Δ
src/broker/overlay.c 88.73% <85.71%> (+0.06%) ⬆️
src/modules/cron/cron.c 81.77% <0.00%> (-0.48%) ⬇️
src/modules/job-exec/job-exec.c 75.82% <0.00%> (-0.21%) ⬇️
src/common/libflux/message.c 95.06% <0.00%> (-0.13%) ⬇️
src/broker/content-cache.c 83.33% <0.00%> (+0.25%) ⬆️
src/common/libsubprocess/local.c 79.65% <0.00%> (+0.34%) ⬆️
src/modules/job-archive/job-archive.c 60.08% <0.00%> (+0.80%) ⬆️

@garlick
Copy link
Member Author

garlick commented Sep 3, 2021

Thanks - I'm going to set MWP.

@mergify mergify bot merged commit 3ad7f3c into flux-framework:master Sep 3, 2021
@garlick garlick deleted the upstream_broker_crash branch September 3, 2021 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants