Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataLoader2] Implementing single iterator constraint #700

Closed
wants to merge 5 commits into from

Conversation

NivekT
Copy link
Contributor

@NivekT NivekT commented Jul 29, 2022

Stack from ghstack:

Changes:

  • Imposing the single iterator constraint on DataLoader2 by having its __iter__ method returns a wrapper that can track if an iterator is valid or not
  • Previously, __iter__ returns self, but now it will return an instance of _DataLoader2Iterator
  • Aside from the behavior related to reset (see below), we expect other behaviors to stay the same.

Prior behavior:

dl = DataLoader2(IterableWrapper(range(10)))
it1 = iter(dl)
print(next(it1))  # 0
it2 = iter(dl)  # No reset here
print(next(it2))  # 1
print(next(it1))  # 2

New behavior with this PR:

dl = DataLoader2(IterableWrapper(range(10)))
it1 = iter(dl)
print(next(it1))  # 0
it2 = iter(dl)  # DataLoader2 resets with the creation of a new iterator
print(next(it2))  # 0
print(next(it1))  # Raises exception, since it1 is no longer valid

Differential Revision: D38262867

NivekT added a commit that referenced this pull request Jul 29, 2022
ghstack-source-id: 7343eea9cd48b50fce4c88e614fa46fb5c05c143
Pull Request resolved: #700
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 29, 2022
@NivekT NivekT requested a review from ejguan July 29, 2022 00:00
@NivekT
Copy link
Contributor Author

NivekT commented Jul 29, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if those comments make sense to you.

torchdata/dataloader2/dataloader2.py Outdated Show resolved Hide resolved
torchdata/dataloader2/dataloader2.py Show resolved Hide resolved
Comment on lines 130 to 133
"""
Delegate methods (e.g. `limit`, `pause`, `resume`, etc) to the iterator
created by the ReadingService and DataPipe.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we directly delegate to datapipe_iter from dataloader_iter, we don't need this function.

Copy link
Contributor Author

@NivekT NivekT Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with this change, but I would flag that methods such as state_dict and shutdown will not be delegated to DataLoader2. I am going to change the implementation to do that. As long as that is not an issue, we can delegate directly to dataloader_iter.

Copy link
Contributor Author

@NivekT NivekT Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we want to allow users to do dl2.limit(), dl2.resume(), and etc? Or we want them to always invoke those from the iterator?

If we want the former, we will need to keep this method.

Copy link
Contributor

@ejguan ejguan Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, users should only invoke such APIs on iterator object. I am fine either.

cc: @Miiira

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think invoking from one single place (i.e. iterator) rather than multiple places is better and less error-prone.

The internal test is calling .resume on the iterator, so I think it is fine to remove the API from the DataLoader2 class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we only need limit, resume on iterator. I'm good with removing this from DataLoader2. Maybe we also want to look at Lightning DataModule train_dataloader return type with this change

torchdata/dataloader2/dataloader2.py Outdated Show resolved Hide resolved
torchdata/dataloader2/dataloader2.py Show resolved Hide resolved
test/test_dataloader2.py Outdated Show resolved Hide resolved
NivekT added a commit that referenced this pull request Jul 29, 2022
ghstack-source-id: 5255e18391a09b4c3d05771f6bd3cca03a5d1235
Pull Request resolved: #700
NivekT added a commit that referenced this pull request Jul 29, 2022
ghstack-source-id: 7d139b7a313f304cdd6676fcc944dcfa896bff02
Pull Request resolved: #700
@NivekT
Copy link
Contributor Author

NivekT commented Jul 29, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Changes:
* Imposing the single iterator constraint on `DataLoader2` by having its `__iter__` method returns a wrapper that can track if an iterator is valid or not
* Previously, `__iter__` returns `self`, but now it will return an instance of `_DataLoader2Iterator`
* Aside from the behavior related to reset (see below), we expect other behaviors to stay the same.

Prior behavior:
```python
dl = DataLoader2(IterableWrapper(range(10)))
it1 = iter(dl)
print(next(it1))  # 0
it2 = iter(dl)  # No reset here
print(next(it2))  # 1
print(next(it1))  # 2
```

New behavior with this PR:
```python
dl = DataLoader2(IterableWrapper(range(10)))
it1 = iter(dl)
print(next(it1))  # 0
it2 = iter(dl)  # DataLoader2 resets with the creation of a new iterator
print(next(it2))  # 0
print(next(it1))  # Raises exception, since it1 is no longer valid
```

Differential Revision: [D38262867](https://our.internmc.facebook.com/intern/diff/D38262867)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 1, 2022
ghstack-source-id: b84c6c7ae0be647e53315fffc86964ae1f2eba52
Pull Request resolved: #700
@NivekT
Copy link
Contributor Author

NivekT commented Aug 1, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor Author

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this PR (one fix) and added a test to make through attributes/methods are being passed through. I will be looking at internal tests again shortly.

Comment on lines +24 to +37
class _ReadingServiceWrapper:
def __init__(self, dp):
self.dp = dp

def __iter__(self):
self.it = iter(self.dp)
return self

def __next__(self):
return next(self.it)

@staticmethod
def return_one():
return 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A method that dataloader._datapipe_iter will have

Changes:
* Imposing the single iterator constraint on `DataLoader2` by having its `__iter__` method returns a wrapper that can track if an iterator is valid or not
* Previously, `__iter__` returns `self`, but now it will return an instance of `_DataLoader2Iterator`
* Aside from the behavior related to reset (see below), we expect other behaviors to stay the same.

Prior behavior:
```python
dl = DataLoader2(IterableWrapper(range(10)))
it1 = iter(dl)
print(next(it1))  # 0
it2 = iter(dl)  # No reset here
print(next(it2))  # 1
print(next(it1))  # 2
```

New behavior with this PR:
```python
dl = DataLoader2(IterableWrapper(range(10)))
it1 = iter(dl)
print(next(it1))  # 0
it2 = iter(dl)  # DataLoader2 resets with the creation of a new iterator
print(next(it2))  # 0
print(next(it1))  # Raises exception, since it1 is no longer valid
```

Differential Revision: [D38262867](https://our.internmc.facebook.com/intern/diff/D38262867)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 1, 2022
ghstack-source-id: 50e50b8909cf9d27656893cd4b185e1d577cf3a0
Pull Request resolved: #700
@NivekT
Copy link
Contributor Author

NivekT commented Aug 1, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants