Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Error propagation/handle mechanism for PrototypeMultiprocessingReadingService #969

Closed
3 of 6 tasks
ejguan opened this issue Jan 26, 2023 · 0 comments
Closed
3 of 6 tasks

Comments

@ejguan
Copy link
Contributor

ejguan commented Jan 26, 2023

🚀 The feature

Currently, if any Error happens in the sub-process, the child process would just exits without passing Error message back to the main process. And, the data pipeline in the main process will continue without proper exit.
In order to achieve this feature, we need to do the following step:

  • Add a way to pass Error message
  • Handle Error
    • Properly exit other worker process when one worker sends Error message to main process
    • Properly exit worker processes when the Error comes from the dispatching process, then exit main process

Motivation, pitch

Increase reliability of DataLoader with PrototypeMultiprocessingReadingService

Alternatives

No response

Additional context

No response

ejguan added a commit to ejguan/data that referenced this issue Feb 22, 2023
Summary:
Partially fixes pytorch#969

### Changes

- Add `ExceptionWrapper` to attach traceback to the Exception
  - Reason: traceback is unserializable. So, it has to be passed by string
  - In order to provide informative Error message, pass name for each process like `dispatching process` and `worker process <id>`.
- Add tests to validate Error propagation from the dispatching process
  - parametrize the tests
- Fix a bug for `round_robin_demux` to return a list of DataPipe rather than a single DataPipe when `num_of_instances` is 1.

Pull Request resolved: pytorch#1036

Reviewed By: NivekT

Differential Revision: D43472709

Pulled By: ejguan

fbshipit-source-id: e5c9e581ca881f523fb568b6f46bf16ecfc243d2
ejguan added a commit that referenced this issue Feb 22, 2023
Summary:
Partially fixes #969

### Changes

- Add `ExceptionWrapper` to attach traceback to the Exception
  - Reason: traceback is unserializable. So, it has to be passed by string
  - In order to provide informative Error message, pass name for each process like `dispatching process` and `worker process <id>`.
- Add tests to validate Error propagation from the dispatching process
  - parametrize the tests
- Fix a bug for `round_robin_demux` to return a list of DataPipe rather than a single DataPipe when `num_of_instances` is 1.

Pull Request resolved: #1036

Reviewed By: NivekT

Differential Revision: D43472709

Pulled By: ejguan

fbshipit-source-id: e5c9e581ca881f523fb568b6f46bf16ecfc243d2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant