-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataLoader2] Implementing single iterator constraint #700
Conversation
[ghstack-poisoned]
ghstack-source-id: 7343eea9cd48b50fce4c88e614fa46fb5c05c143 Pull Request resolved: #700
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if those comments make sense to you.
torchdata/dataloader2/dataloader2.py
Outdated
""" | ||
Delegate methods (e.g. `limit`, `pause`, `resume`, etc) to the iterator | ||
created by the ReadingService and DataPipe. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we directly delegate to datapipe_iter
from dataloader_iter
, we don't need this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with this change, but I would flag that methods such as state_dict
and shutdown
will not be delegated to DataLoader2
. I am going to change the implementation to do that. As long as that is not an issue, we can delegate directly to dataloader_iter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we want to allow users to do dl2.limit()
, dl2.resume()
, and etc? Or we want them to always invoke those from the iterator?
If we want the former, we will need to keep this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, users should only invoke such APIs on iterator object. I am fine either.
cc: @Miiira
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think invoking from one single place (i.e. iterator) rather than multiple places is better and less error-prone.
The internal test is calling .resume
on the iterator, so I think it is fine to remove the API from the DataLoader2
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we only need limit
, resume
on iterator. I'm good with removing this from DataLoader2
. Maybe we also want to look at Lightning DataModule train_dataloader
return type with this change
Differential Revision: [D38262867](https://our.internmc.facebook.com/intern/diff/D38262867) [ghstack-poisoned]
ghstack-source-id: 5255e18391a09b4c3d05771f6bd3cca03a5d1235 Pull Request resolved: #700
Differential Revision: [D38262867](https://our.internmc.facebook.com/intern/diff/D38262867) [ghstack-poisoned]
ghstack-source-id: 7d139b7a313f304cdd6676fcc944dcfa896bff02 Pull Request resolved: #700
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Changes: * Imposing the single iterator constraint on `DataLoader2` by having its `__iter__` method returns a wrapper that can track if an iterator is valid or not * Previously, `__iter__` returns `self`, but now it will return an instance of `_DataLoader2Iterator` * Aside from the behavior related to reset (see below), we expect other behaviors to stay the same. Prior behavior: ```python dl = DataLoader2(IterableWrapper(range(10))) it1 = iter(dl) print(next(it1)) # 0 it2 = iter(dl) # No reset here print(next(it2)) # 1 print(next(it1)) # 2 ``` New behavior with this PR: ```python dl = DataLoader2(IterableWrapper(range(10))) it1 = iter(dl) print(next(it1)) # 0 it2 = iter(dl) # DataLoader2 resets with the creation of a new iterator print(next(it2)) # 0 print(next(it1)) # Raises exception, since it1 is no longer valid ``` Differential Revision: [D38262867](https://our.internmc.facebook.com/intern/diff/D38262867) [ghstack-poisoned]
ghstack-source-id: b84c6c7ae0be647e53315fffc86964ae1f2eba52 Pull Request resolved: #700
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this PR (one fix) and added a test to make through attributes/methods are being passed through. I will be looking at internal tests again shortly.
class _ReadingServiceWrapper: | ||
def __init__(self, dp): | ||
self.dp = dp | ||
|
||
def __iter__(self): | ||
self.it = iter(self.dp) | ||
return self | ||
|
||
def __next__(self): | ||
return next(self.it) | ||
|
||
@staticmethod | ||
def return_one(): | ||
return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A method that dataloader._datapipe_iter
will have
Changes: * Imposing the single iterator constraint on `DataLoader2` by having its `__iter__` method returns a wrapper that can track if an iterator is valid or not * Previously, `__iter__` returns `self`, but now it will return an instance of `_DataLoader2Iterator` * Aside from the behavior related to reset (see below), we expect other behaviors to stay the same. Prior behavior: ```python dl = DataLoader2(IterableWrapper(range(10))) it1 = iter(dl) print(next(it1)) # 0 it2 = iter(dl) # No reset here print(next(it2)) # 1 print(next(it1)) # 2 ``` New behavior with this PR: ```python dl = DataLoader2(IterableWrapper(range(10))) it1 = iter(dl) print(next(it1)) # 0 it2 = iter(dl) # DataLoader2 resets with the creation of a new iterator print(next(it2)) # 0 print(next(it1)) # Raises exception, since it1 is no longer valid ``` Differential Revision: [D38262867](https://our.internmc.facebook.com/intern/diff/D38262867) [ghstack-poisoned]
ghstack-source-id: 50e50b8909cf9d27656893cd4b185e1d577cf3a0 Pull Request resolved: #700
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Stack from ghstack:
Changes:
DataLoader2
by having its__iter__
method returns a wrapper that can track if an iterator is valid or not__iter__
returnsself
, but now it will return an instance of_DataLoader2Iterator
Prior behavior:
New behavior with this PR:
Differential Revision: D38262867