Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attach traceback to Exception & Test disatpching process #1036

Closed
wants to merge 4 commits into from

Conversation

ejguan
Copy link
Contributor

@ejguan ejguan commented Feb 21, 2023

Partially fixes #969

Changes

  • Add ExceptionWrapper to attach traceback to the Exception
    • Reason: traceback is unserializable. So, it has to be passed by string
    • In order to provide informative Error message, pass name for each process like dispatching process and worker process <id>.
  • Add tests to validate Error propagation from the dispatching process
    • parametrize the tests
  • Fix a bug for round_robin_demux to return a list of DataPipe rather than a single DataPipe when num_of_instances is 1.

@ejguan ejguan requested a review from NivekT February 21, 2023 19:16
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 21, 2023
@ejguan ejguan requested a review from wenleix February 21, 2023 19:16
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good to me! I ran the code locally to confirm that it works and the message looks right. It worked.

One thing is that it took awhile to raise the exception (and eventually shutdown). Any reason why?

)
return datapipe
return [datapipe]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: In what cases will users intentionally want to set num_instances=1 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the case when num_workers=1 with dispatching process.

)
return process, req_queue, res_queue


def CreateThreadForDataPipeline(datapipe):
def CreateThreadForDataPipeline(datapipe, name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to ask before: if this function is used anywhere currently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any usage for sure. We can remove it in the future since we might rely on asyncio to achieve threading.

torchdata/dataloader2/communication/eventloop.py Outdated Show resolved Hide resolved
self.assertTrue("Caught _CustomException in dispatching process" in exc_msg)
self.assertTrue("Original Traceback" in exc_msg)
self.assertTrue("_CustomException: oops" in exc_msg)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe check if DL2 is shutdown properly? Is that possible to test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, DL2 is not properly shutdown for now because we only raise Error in the main process. We haven't handled to shutdown other worker processes when on process has Error. This probably the same reason that it takes longer time to shutdown.

This is tracked in #969

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan merged this pull request in f083d52.

ejguan added a commit to ejguan/data that referenced this pull request Feb 22, 2023
Summary:
Partially fixes pytorch#969

### Changes

- Add `ExceptionWrapper` to attach traceback to the Exception
  - Reason: traceback is unserializable. So, it has to be passed by string
  - In order to provide informative Error message, pass name for each process like `dispatching process` and `worker process <id>`.
- Add tests to validate Error propagation from the dispatching process
  - parametrize the tests
- Fix a bug for `round_robin_demux` to return a list of DataPipe rather than a single DataPipe when `num_of_instances` is 1.

Pull Request resolved: pytorch#1036

Reviewed By: NivekT

Differential Revision: D43472709

Pulled By: ejguan

fbshipit-source-id: e5c9e581ca881f523fb568b6f46bf16ecfc243d2
ejguan added a commit that referenced this pull request Feb 22, 2023
Summary:
Partially fixes #969

### Changes

- Add `ExceptionWrapper` to attach traceback to the Exception
  - Reason: traceback is unserializable. So, it has to be passed by string
  - In order to provide informative Error message, pass name for each process like `dispatching process` and `worker process <id>`.
- Add tests to validate Error propagation from the dispatching process
  - parametrize the tests
- Fix a bug for `round_robin_demux` to return a list of DataPipe rather than a single DataPipe when `num_of_instances` is 1.

Pull Request resolved: #1036

Reviewed By: NivekT

Differential Revision: D43472709

Pulled By: ejguan

fbshipit-source-id: e5c9e581ca881f523fb568b6f46bf16ecfc243d2
@ejguan ejguan added the topic: improvements topic category label Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged topic: improvements topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Error propagation/handle mechanism for PrototypeMultiprocessingReadingService
3 participants