-
Notifications
You must be signed in to change notification settings - Fork 50
System Instance L2 Resiliency
Initial Flux system instance resiliency will hinge on two main principles. (See also [reference to broker resiliency docs here?])
-
Subtree panics: Any time a broker becomes unresponsive or down RPCs will get an error and the entire subtree will be restarted. Recovering brokers will reconnect into the instance.
-
Recoverable jobs: After a subtree panic, recovering brokers will "rediscover" all running jobs so that no running jobs are lost.
Successful implementation of both of these principles shall be termed Level 2 or L2 Resilience.
Exit criteria: A restart of rank 0 broker with active jobs, both running and pending, results in a restart of all brokers participating in the system instance. After restart, running and pending jobs are recovered.
-
Recoverable jobs
- Replace fork/waitpid with a mechanism which allows "rediscovery" of
broker subprocesses.
- Investigate use of cgroups and/or systemd for this purpose
- Investigate whether flag for libsubprocess or job-exec specific facility will be most efficacious
- Redesign/rewrite job-exec for recoverable jobs, and issues described in flux-core #3346
- Design and implement job shell "detached" mode
- job shell will need to be able to operate temporarily in a mode where it has lost connection to the local broker.
- reconnect will be driven by job rediscovery
- Preserve guest KVS namespaces across rank 0 restart
- Replace fork/waitpid with a mechanism which allows "rediscovery" of
broker subprocesses.
-
Broker resiliency model:
- Document resilency mode/protocol in broker docs #3804
- Brokers detect upstream peer reboot and take themselves down #3608
- Implement RPC "health check" to help diagnose stuck services: #2797
- broker: track RPC state and send error responses for lost peers #3800
- Administratively declare live broker peers dead to cause RPCs to fail fast and force a subtree restart. #3805
Aug 2021:
- design or prototype for fork/exec/waitpid replacement scheme
- design or prototype for rpc state tracking in broker/handle
- refine work breakdown based on results
Oct 2021:
- job shell offline mode implementation
Dec 2021:
- Demonstrate restart of size=1
Feb 2022:
- Demonstrate L2 resiliency on Fluke: restart rank 0 with running and pending jobs with no loss of jobs
- Scale testing on larger systems as determined by need and system availability.