Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Disco] Treat hangup of disco worker process as kShutdown #16989

Merged

Conversation

Lunderberg
Copy link
Contributor

Prior to this commit, each disco worker needed to receive DiscoAction::kShutdown in order to close cleanly. While this is sent from the destructor of ProcessSessionObj, which owns the worker processes, this does not guarantee that the disco workers will receive the shutdown command. For example, the controller process holding the ProcessSessionObj may reach a timeout and be terminated, preventing it from sending the DiscoAction::kShutdown command.

This commit updates the disco worker to check for a closed pipe that occurs between two packets, and to treat this as if the DiscoAction::kShutdown command were received. A closed pipe that occurs at any other location is still treated as an error and reported.

Prior to this commit, each disco worker needed to receive
`DiscoAction::kShutdown` in order to close cleanly.  While this is
sent from the destructor of `ProcessSessionObj`, which owns the worker
processes, this does not guarantee that the disco workers will receive
the shutdown command.  For example, the controller process holding the
`ProcessSessionObj` may reach a timeout and be terminated, preventing
it from sending the `DiscoAction::kShutdown` command.

This commit updates the disco worker to check for a closed pipe that
occurs between two packets, and to treat this as if the
`DiscoAction::kShutdown` command were received.  A closed pipe that
occurs at any other location is still treated as an error and
reported.
@Lunderberg Lunderberg merged commit 93233a9 into apache:main May 14, 2024
19 checks passed
@Lunderberg Lunderberg deleted the disco_perform_clean_shutdown_on_hangup branch May 14, 2024 14:39
Lunderberg added a commit to Lunderberg/tvm that referenced this pull request May 14, 2024
This resolves a conflict between two recent changes.  In
apache#16989, reads of size zero are used
to identify hangups in `ProcessSession`.  In
apache#16992, reads of size zero are
treated as an error to avoid infinite loops while waiting for data to
be ready.

For a long-term resolution, the `dmlc::Stream` interface will need to
be updated, so that the `Write` method returns the number of bytes
written, just as the `Read` method currently does.  This will allow
the calling scope to verify the number of bytes received.
Lunderberg added a commit that referenced this pull request May 15, 2024
This resolves a conflict between two recent changes.  In
#16989, reads of size zero are used
to identify hangups in `ProcessSession`.  In
#16992, reads of size zero are
treated as an error to avoid infinite loops while waiting for data to
be ready.

For a long-term resolution, the `dmlc::Stream` interface will need to
be updated, so that the `Write` method returns the number of bytes
written, just as the `Read` method currently does.  This will allow
the calling scope to verify the number of bytes received.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants