Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add list_file() functional API to FSSpecFileLister and IoPathFileLister #463

Closed
wants to merge 22 commits into from

Conversation

bushshrub
Copy link
Contributor

@bushshrub bushshrub commented May 25, 2022

Fixes #387

Changes

  • Adds list_file() method on IoPathFileListerIterDataPipe
  • Adds list_file() method on FSSpecFileListerIterDataPipe
  • Add tests for those methods

Additional comments

I feel as if the implementation is quite naive. Would appreciate any feedback on it.

@facebook-github-bot
Copy link
Contributor

Hi @xiurobert!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@bushshrub
Copy link
Contributor Author

Additional notes: Perhaps return list(self) could work as well.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the functional API refers to functional_datapipe. See ref:

@functional_datapipe("open_file_by_fsspec")

By adding this decorator the class, we are able invoke such class using functional call.

@functional_datapipe("list_file_by_fsspec")
class FSSpecFileListerIterDataPipe(IterDataPipe[str]):
    ...

dp = IterableWrapper(["file://folder", ])
dp = dp.list_file_by_fsspec()  # Functional API here 
list(dp)  # return list of files in folder

@bushshrub
Copy link
Contributor Author

Sorry, my bad. Must have misunderstood. Will make the changes

@bushshrub
Copy link
Contributor Author

@ejguan updated it along with relevant tests

test/test_local_io.py Outdated Show resolved Hide resolved
test/test_local_io.py Outdated Show resolved Hide resolved
test/test_fsspec.py Outdated Show resolved Hide resolved
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 25, 2022
@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@bushshrub
Copy link
Contributor Author

All tests are passing on my machine so far.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much. Overall, LGTM with two minor comments.

test/test_fsspec.py Outdated Show resolved Hide resolved
test/test_local_io.py Outdated Show resolved Hide resolved
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

Question: Do you plan on working on pytorch/pytorch#78263 as well? I see an issue but not a PR for PyTorch Core.

@bushshrub
Copy link
Contributor Author

Yep, I do intend to do so. I opened the issue because the CONTRIBUTING.md file specifies I should be opening one before working on any "features"

@bushshrub
Copy link
Contributor Author

@NivekT is it ok if I directly open the pull request with PyTorch core? Their contributing guidelines mention that I should be opening an issue before implementing a feature.

@bushshrub
Copy link
Contributor Author

Thanks for working on this!

Question: Do you plan on working on pytorch/pytorch#78263 as well? I see an issue but not a PR for PyTorch Core.

Just saw the PyTorch FAQ, I could probably open the PR since it's a small change.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torchdata/datapipes/iter/load/iopath.py Outdated Show resolved Hide resolved
torchdata/datapipes/iter/load/fsspec.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, LGTM

@bushshrub bushshrub requested a review from NivekT May 31, 2022 15:33
@bushshrub
Copy link
Contributor Author

Will the APIs in torchdata be updated with the new grammar as well?

@ejguan
Copy link
Contributor

ejguan commented May 31, 2022

Will the APIs in torchdata be updated with the new grammar as well?

Yeah. @NivekT Just opened a PR to change the names of DataPipe. See: #479

@bushshrub
Copy link
Contributor Author

Awesome.

test/test_fsspec.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few bugs in your tests.

test/test_local_io.py Outdated Show resolved Hide resolved
test/test_fsspec.py Outdated Show resolved Hide resolved
test/test_fsspec.py Outdated Show resolved Hide resolved
@bushshrub bushshrub requested a review from ejguan June 1, 2022 04:02
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add tests for list_files_by_iopath in

data/test/test_local_io.py

Lines 655 to 673 in aab67f3

@skipIfNoIoPath
def test_io_path_file_lister_iterdatapipe(self):
datapipe = IoPathFileLister(root=self.temp_sub_dir.name)
# check all file paths within sub_folder are listed
for path in datapipe:
self.assertTrue(path in self.temp_sub_files)
@skipIfNoIoPath
def test_io_path_file_lister_iterdatapipe_with_list(self):
datapipe = IoPathFileLister(root=[self.temp_sub_dir.name, self.temp_sub_dir_2.name])
file_lister = list(datapipe)
file_lister.sort()
all_temp_files = list(self.temp_sub_files + self.temp_sub_files_2)
all_temp_files.sort()
# check all file paths within sub_folder are listed
self.assertEqual(file_lister, all_temp_files)
?

datapipe = IterableWrapper(["file://" + self.temp_sub_dir.name, "file://" + self.temp_sub_dir_2.name])
datapipe = datapipe.list_files_by_fsspec()
res = list(datapipe).sort()
self.assertEqual(res, temp_files)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the test is still failing here. Could you please fix it by mimicking the test case from line 80 to line 89?

@bushshrub
Copy link
Contributor Author

Could you please also add tests for list_files_by_iopath in

data/test/test_local_io.py

Lines 655 to 673 in aab67f3

@skipIfNoIoPath
def test_io_path_file_lister_iterdatapipe(self):
datapipe = IoPathFileLister(root=self.temp_sub_dir.name)
# check all file paths within sub_folder are listed
for path in datapipe:
self.assertTrue(path in self.temp_sub_files)
@skipIfNoIoPath
def test_io_path_file_lister_iterdatapipe_with_list(self):
datapipe = IoPathFileLister(root=[self.temp_sub_dir.name, self.temp_sub_dir_2.name])
file_lister = list(datapipe)
file_lister.sort()
all_temp_files = list(self.temp_sub_files + self.temp_sub_files_2)
all_temp_files.sort()
# check all file paths within sub_folder are listed
self.assertEqual(file_lister, all_temp_files)

?

I believe they are here https://github.com/pytorch/data/pull/463/files#diff-6e69ca11dfe73793a94592ec3e4a303e6807afa8a7fed4d88168b50d9be829e3R663-R667

@ejguan
Copy link
Contributor

ejguan commented Jun 2, 2022

You can run your test on your local machine via python test/test_fsspec.py and python test/test_local_io.py. You might need to install fsspec and iopath via pip install fsspec iopath.

@bushshrub
Copy link
Contributor Author

Changes made @ejguan

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, LGTM

@bushshrub
Copy link
Contributor Author

bushshrub commented Jun 2, 2022

Did this work? I squashed the commits locally and force-pushed.

@ejguan
Copy link
Contributor

ejguan commented Jun 2, 2022

Did this work?

It works now. Let me import your PR. Don't worry about squashing the commits. We will do it automatically.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@bushshrub
Copy link
Contributor Author

Great.

ejguan pushed a commit to ejguan/data that referenced this pull request Jun 6, 2022
…er (pytorch#463)

Summary:
Fixes pytorch#387

### Changes
- Adds `list_file()` method on `IoPathFileListerIterDataPipe`
- Adds `list_file()` method on `FSSpecFileListerIterDataPipe`
- Add tests for those methods

#### Additional comments
I feel as if the implementation is quite naive. Would appreciate any feedback on it.

Pull Request resolved: pytorch#463

Reviewed By: NivekT

Differential Revision: D36777142

Pulled By: ejguan

fbshipit-source-id: 1c4474776f3fcd377ae545bd8bd7bf26d0b2fa88
ejguan pushed a commit that referenced this pull request Jun 6, 2022
…er (#463)

Summary:
Fixes #387

### Changes
- Adds `list_file()` method on `IoPathFileListerIterDataPipe`
- Adds `list_file()` method on `FSSpecFileListerIterDataPipe`
- Add tests for those methods

#### Additional comments
I feel as if the implementation is quite naive. Would appreciate any feedback on it.

Pull Request resolved: #463

Reviewed By: NivekT

Differential Revision: D36777142

Pulled By: ejguan

fbshipit-source-id: 1c4474776f3fcd377ae545bd8bd7bf26d0b2fa88
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add functional API to FileLister
4 participants