Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataPipe] Correct the type of exception that is being raised by ShufflerMapDataPipe #82666

Closed
wants to merge 2 commits into from

Conversation

NivekT
Copy link
Contributor

@NivekT NivekT commented Aug 2, 2022

Stack from ghstack:

Fixes pytorch/data#708

The following code snippet used to fail, now it has been added as a test case:

dp1 = dp.map.SequenceWrapper(range(10))
shuffle_dp1 = dp1.shuffle()
dp2 = dp.map.SequenceWrapper(range(10))
shuffle_dp2 = dp2.shuffle()
zip_dp = shuffle_dp1.zip(shuffle_dp2)
list(zip_dp)  # This used to fail

The issue was that ShufflerMapDataPipe raises a KeyError when an out of bound index is passed into it, but that was not handled by zip_dp's __getitem__ which only handled IndexError. With this change, it handles both.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 2, 2022

🔗 Helpful links

✅ No Failures (3 Pending)

As of commit 8d9ac2f (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@NivekT NivekT requested a review from ejguan August 2, 2022 20:17
@NivekT NivekT added module: data torch.utils.data release notes: dataloader release notes category topic: bug fixes topic category labels Aug 2, 2022
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, why not change Shuffler to raise IndexError when key doesn't exist?

@NivekT
Copy link
Contributor Author

NivekT commented Aug 2, 2022

Then, why not change Shuffler to raise IndexError when key doesn't exist?

I have considered that and I'm mostly indifferent. I found the definitions:

IndexError is raised when trying to access an invalid index within a list , the KeyError is raised when accessing an invalid key within a dict.

The exception comes from Shuffler's __getitem__, it is a KeyError because it is raised by a dict. This makes sense to me since I don't necessarily think of MapDataPipe as a list.

More practically, I worry that if there is another MapDataPipe that raises KeyError rather than IndexError for a similar reason, Zipper will not be able to handle it. Therefore fixing it for Zipper may make more sense.

@ejguan
Copy link
Contributor

ejguan commented Aug 2, 2022

The exception comes from Shuffler's __getitem__, it is a KeyError because it is raised by a dict. This makes sense to me since I don't necessarily think of MapDataPipe as a list.

Even though we don't necessarily think of MapDataPipe as a list, users still want to do list(dp) right? This is the reason that IndexError is still well handled by list inside Zipper. Let's say I have a custom DataPipe:

class DP(MapDataPipe):
    def __getitem__(self, index):
        return self.dp[index]
    def __len__(self):
        return len(self.dp)

dp = SequenceWrapper(...).shuffle()
dp = DP(dp)
list(dp)

I would expect KeyError is raised as well. IndexError is used as a protocol for list function to notice the end of iteration. Without changing Shuffler, we are adding extra requirement for users to implement all DataPipe handling KeyError in __getitem__.

More practically, I worry that if there is another MapDataPipe that raises KeyError rather than IndexError for a similar reason, Zipper will not be able to handle it. Therefore fixing it for Zipper may make more sense.

I think this would be more dangerous as such KeyError might be expected. We are silently suppressing such Error.

WDYT?

@NivekT
Copy link
Contributor Author

NivekT commented Aug 2, 2022

If we leave Zipper as it is, aren't we expecting dp[index] to always raise IndexError if the index is out of bound (for any dp that gets passed into Zipper)?

This is because we aren't calling list() inside Zipper, but dp[index]

@ejguan
Copy link
Contributor

ejguan commented Aug 2, 2022

If we leave Zipper as it is, aren't we expecting dp[index] to always raise IndexError if the index is out of bound (for any dp that gets passed into Zipper)?

Yeah. Here is the link from Python official about __getitem__: https://docs.python.org/3/reference/datamodel.html#object.__getitem__
And, here is the link for the definition of Sequence and Mapping: https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes

I would expect MapDataPipe as a Sequence since only __getitem__ and __len__ are required.

@NivekT
Copy link
Contributor Author

NivekT commented Aug 2, 2022

Okay if we treat MapDataPipe as Sequence then it makes sense to always raise IndexError. I can make the change.

Fixes pytorch/data#708

The following code snippet used to fail, now it has been added as a test case:
```python
dp1 = dp.map.SequenceWrapper(range(10))
shuffle_dp1 = dp1.shuffle()
dp2 = dp.map.SequenceWrapper(range(10))
shuffle_dp2 = dp2.shuffle()
zip_dp = shuffle_dp1.zip(shuffle_dp2)
list(zip_dp)  # This used to fail
```

The issue was that `ShufflerMapDataPipe` raises a `KeyError` when an out of bound index is passed into it, but that was not handled by `zip_dp`'s `__getitem__` which only handled `IndexError`. With this change, it handles both.

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 2, 2022
ghstack-source-id: 05929c9b4dbe55d2968c8123217dc6a27775be6d
Pull Request resolved: #82666
@NivekT NivekT requested a review from ejguan August 3, 2022 18:30
@NivekT NivekT changed the title [DataPipe] Fix ZipperMapDataPipe handling of KeyError [DataPipe] Correct the type of exception that is being raised by ShufflerMapDataPipe Aug 3, 2022
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@NivekT
Copy link
Contributor Author

NivekT commented Aug 3, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

facebook-github-bot pushed a commit that referenced this pull request Aug 4, 2022
…flerMapDataPipe (#82666) (#82666)

Summary:
Fixes pytorch/data#708

The following code snippet used to fail, now it has been added as a test case:
```python
dp1 = dp.map.SequenceWrapper(range(10))
shuffle_dp1 = dp1.shuffle()
dp2 = dp.map.SequenceWrapper(range(10))
shuffle_dp2 = dp2.shuffle()
zip_dp = shuffle_dp1.zip(shuffle_dp2)
list(zip_dp)  # This used to fail
```

The issue was that `ShufflerMapDataPipe` raises a `KeyError` when an out of bound index is passed into it, but that was not handled by `zip_dp`'s `__getitem__` which only handled `IndexError`. With this change, it handles both.

Pull Request resolved: #82666
Approved by: https://github.com/ejguan

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/14b660fcc0c32a7478c69a25a253b67d0ed36364

Reviewed By: kit1980

Differential Revision: D38424978

Pulled By: NivekT

fbshipit-source-id: f3de6e0bb0d74249b51dd77f7a58de4ac35c4e2e
@facebook-github-bot facebook-github-bot deleted the gh/nivekt/54/head branch August 7, 2022 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants