Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement FullSyncIterDataPipe #713

Closed
wants to merge 5 commits into from
Closed

Conversation

ejguan
Copy link
Contributor

@ejguan ejguan commented Aug 3, 2022

Changes

  • Add _PrefetchExecutor to run prefetching in multi-threading
    • This should be reused by PrefetchIterDataPipe
  • Add FullSyncIterDataPipe
    • Tested the result using elastic training and normal distributed training.
    • Add distributed unit test for GLOO
    • Add serialization test (How to do checkpoint for FullSyncIterDataPipe is unclear to me)
      • Serialization is working properly as I am using spawn to serialize/deserialize such DataPipe to distributed processes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 3, 2022
return self.error is not None


class _PrefetchExecutor:
Copy link
Contributor Author

@ejguan ejguan Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This executor takes reference from the internal implementation: https://fburl.com/code/7dk6mvs4
On top of the implementation, I added prefetch_size and attached index to Expected object to make sure it can work with Prefetch in the future.

@ejguan ejguan marked this pull request as draft August 4, 2022 15:58
@ejguan
Copy link
Contributor Author

ejguan commented Aug 4, 2022

Adding test now.

@ejguan ejguan force-pushed the fullsync branch 2 times, most recently from ffe3a1e to fd98a28 Compare August 5, 2022 16:30
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan ejguan marked this pull request as ready for review August 5, 2022 17:45
Comment on lines 52 to 53
data_length = 23
dp = IterableWrapper(list(range(data_length))).sharding_filter().fullsync()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without fullsync, this pipeline would hang forever.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan ejguan requested a review from Miiira August 8, 2022 16:35
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

# LICENSE file in the root directory of this source tree.

# Use the same timeout as PyTorch Distributed
default_timeout_in_s = 30 * 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment (no change) - we should put other things such as default buffer size here too.

which is caused by uneven sharded data (functional name: ``fullsync``). It should
be appended at the end of the graph of ``DataPipe`` by ``DistributedReadingService``
automatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do we recommend against usage of this DataPipe outside of a ReadingService? If not, can we potentially include an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Will add it even though we should always recommend users relying on RS

torchdata/datapipes/iter/util/prefetch.py Show resolved Hide resolved
torchdata/datapipes/iter/util/prefetch.py Show resolved Hide resolved
torchdata/datapipes/iter/util/prefetch.py Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Comment on lines +205 to +212
def __getstate__(self):
if IterDataPipe.getstate_hook is not None:
return IterDataPipe.getstate_hook(self)
state = (
self.datapipe,
self.timeout,
)
return state
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, checkpoint for fullsync or prefetch is a little tricky.
Let's confirm the expected behavior. When we do checkpoint, we should pause any further prefetching and save all prefetched data into a buffer. Then, we serialize the buffer ant inner datapipe (because we have to serialize datapipe after prefetching is done). And, only when we start iteration again, would we start prefetching again.

WDYT: @VitalyFedyunin @NivekT

Then, the whole logic of fullsync should be changed. This is even more complicated when the data ends when put the prefetched data into the buffer. I might open a new PR to achieve serialization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think we should stop the prefetch and capture current data. I feel this can be similar to internal client snapshot, so https://fburl.com/code/6hrjawgh may be helpful for reference

@ejguan
Copy link
Contributor Author

ejguan commented Aug 12, 2022

I will land this PR for now. List two follow-up works:

  • Implement serialization logic for fullsync
  • Implement padding for fullsync (I need to take a deeper look to see if we can combine it with wraparound)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants