Skip to content

ULFM Fault Tolerance (slice 2: agree) #582

@abouteiller

Description

@abouteiller

Problem

The monolithic ULFM proposal has been split in morsels so that the MPI Forum can focus on individual topics.

Main topic issue
#20

Proposal

The second topic slice contains the following concepts for communicators:

  • MPI_COMM_AGREE

Changes to the Text

Addition of an FT chapter containing the proposed constructs

Read text (Sept'23) https://github.com/mpi-forum/mpi-standard/pull/715/commits/9e81233953a280f867eb48fbe890f5108a5ed9af
no-no reading (diff from Sept'23) https://github.com/mpi-forum/mpi-standard/pull/715/commits/58283a760a35934c0331744f8245e552644d252a

Impact on Implementations

Implementations optionally to implement fault tolerance.
Implementations to add procedures MPI_COMM_AGREE (implementations that do not support FT can provide stubs that are not fault tolerant based on MPI_ALLREDUCE).

Impact on Users

Users can react to fault events, validate progress in collective phases, and synchronize knowledge of failures across ranks. (slice 3 will add features for repairing communicators as needed to use collective and process respawning after a fault).

References and Pull Requests

https://github.com/mpi-forum/mpi-standard/pull/715

Metadata

Metadata

Assignees

Labels

had readingCompleted the formal proposal readingmpi-nextFor inclusion in the MPI 5.1 or 6.0 standardwg-ftFault Tolerance Working Group

Type

No type

Projects

Status

Had Reading

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions