Skip to content

CY2021 Q1 2 system instance planning

Jim Garlick edited this page Jan 21, 2021 · 2 revisions

System instance development CY21 Q1-2

Note: In the descriptions below an idle node is one that has not communicated in a configurable threshold of heartbeat periods. A down node is one that can no longer communicate because it has disconnected, or because it has been denied access after being idle for too long.

Feb release

  • drain idle nodes, undrain nodes that become unidle again
  • mark nodes down that have been idle for some period
  • drain down nodes, require manual undrain on reconnect
  • (prolog/epilog design placeholder)
  • (partial resource release design placeholder)
  • (rpc failure on down broker design placeholder)

Mar release

  • implement prolog/epilog
  • drain node on prolog/epilog failure
  • (partial resource release design placeholder)
  • (rpc failure on down broker design placeholder)

Apr release

  • raise job exception(s) when nodes fail
  • implement partial resource release
  • (rpc failure on down broker design placeholder)

May release

  • RPCs to down nodes eventually fail

Milestone: level 1 resiliency

  • nodes are automatically drained when they fail
  • flux remains responsive despite compute node failures
  • jobs are killed when nodes they are running on fail
  • allocated resources can be partially reclaimed on node failure

(More releases TBD)

Milestone: level 2 resiliency

  • overlay routing can be "restarted"
  • job shells can survive their brokers being restarted
  • rolling software upgrade