Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guarantee datapipe being reset iterator when all loops have received reset request in the dispatching process #994

Closed
wants to merge 6 commits into from

Conversation

ejguan
Copy link
Contributor

@ejguan ejguan commented Feb 7, 2023

Changes

  • Fix S3 Tests in Fix test_remote_io.py due to mutating public s3 bucket #997
  • Ignore AttributeError for ReadingService when gc gets involved
  • Add a reset_iterator_counter to dispatching process to guarantee that the datapipe is reset when all loops have received the request. Otherwise, datapipe can be reset in the middle of iteration of the other loops.
  • Remove reference of the thread from Prefetcher. This would prevent racing condition when both finally in generator and reset function are accessing the same thread.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 7, 2023
@ejguan ejguan changed the title Fix tests Fix S3 Tests and guarantee datapipe being reset when all loops have received reset request in the dispatching process Feb 7, 2023
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

[["s3://ai2-public-datasets/charades/"], 18], # folder without '/'
[["s3://ai2-public-datasets/charad"], 18], # prefix
[
(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change 1

Comment on lines 97 to 104
def __del__(self):
try:
self.finalize()
except AttributeError:
pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment - a little bit surprised we need this given ProtoMultiRS's finalize seems to be catching every possible AttributeError? Maybe the issue is Distributed? Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProtoMPRS doesn't handle all of AttributeError like the self._worker_processes from

for process, req_queue, res_queue in self._worker_processes:

Technical speaking, I should remove those try-except clauses in finalize to simplify the codebase

Comment on lines +148 to +159
# Ensure only reset iterator once for the dispatching process
if reset_iterator_counter is not None:
reset_iterator_counter.increment()
while not reset_iterator_counter.is_reached():
yield True
# Sync between loops within the dispatching process
source_datapipe.reset_iterator()
yield True
reset_iterator_counter.reset()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to do something similar to this for resume dispatching process. cc: @NivekT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm I understand - this is to handle the situation where some workers are handling GetNextRequest while some are trying to reset? You want all GetNext to be done before the dispatching process executes reset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not workers. It only happens to the dispatching process when multiple leaf DataPipes shares the same data source (Round robin demux on the same DataPipe) in a single process.
It handles the case when some loops have received reset while the others haven't. We want to wait to request getNext until all loops have received reset. Otherwise, there will be a case that the data source is reset during the middle of iteration for other loops

torchdata/datapipes/iter/util/prefetcher.py Show resolved Hide resolved
@ejguan
Copy link
Contributor Author

ejguan commented Feb 7, 2023

I will rerun all tests tmrw when PyTorch nightly has been updated.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Hope the nightly gets uploaded and all the CIs get fixed

Comment on lines 97 to 104
def __del__(self):
try:
self.finalize()
except AttributeError:
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment - a little bit surprised we need this given ProtoMultiRS's finalize seems to be catching every possible AttributeError? Maybe the issue is Distributed? Thoughts?

Comment on lines 53 to 54
if self._reached:
return self._reached
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These two lines can be removed? But I guess it is slightly faster, so I'm indifferent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, you are right. I will remove them tmrw

Comment on lines +148 to +159
# Ensure only reset iterator once for the dispatching process
if reset_iterator_counter is not None:
reset_iterator_counter.increment()
while not reset_iterator_counter.is_reached():
yield True
# Sync between loops within the dispatching process
source_datapipe.reset_iterator()
yield True
reset_iterator_counter.reset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm I understand - this is to handle the situation where some workers are handling GetNextRequest while some are trying to reset? You want all GetNext to be done before the dispatching process executes reset?

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan
Copy link
Contributor Author

ejguan commented Feb 8, 2023

The test takes way more time to finish. However, I can't really reproduce it either on linux or mac

Edit: Find the culprit test test_basic_mapdatapipe_threading because I forget to update map.DataPipeBehindQueues. Will fix it shrotly.

@ejguan
Copy link
Contributor Author

ejguan commented Feb 8, 2023

And, I am going to remove all S3 related commits. To fix S3 test, I plan to rely on #997

@ejguan ejguan changed the title Fix S3 Tests and guarantee datapipe being reset when all loops have received reset request in the dispatching process Guarantee datapipe being reset iterator when all loops have received reset request in the dispatching process Feb 8, 2023
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan merged this pull request in b450cfd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants