Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train][tune] Safely check if the storage filesystem is pyarrow.fs.S3FileSystem #48216

Merged
merged 4 commits into from
Oct 30, 2024

Conversation

ccoulombe
Copy link
Contributor

@ccoulombe ccoulombe commented Oct 23, 2024

Why are these changes needed?

When PyArrow does not have support, checking if the instance is pyarrow.fs.S3FileSystem raises an exception.

>>> import pyarrow.fs
>>> pyarrow.fs.S3FileSystem
...
ImportError: The pyarrow installation is not built with support for 'S3FileSystem'

Fix by checking if the S3FileSystem is importable before doing this typecheck.

Related issue number

Fixes #48187

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(
    • Tested by a hotfix locally

@rkooo567
Copy link
Contributor

Thanks for the contribution! Btw, it has a test failure;

Screenshot 2024-10-24 at 1 53 26 PM

@rkooo567 rkooo567 added the go add ONLY when ready to merge, run all tests label Oct 24, 2024
@rkooo567
Copy link
Contributor

can you fix that? I guess there are times the destination_file_system is None

@justinvyu
Copy link
Contributor

@ccoulombe I'd prefer not to do the string check. Let's do this instead:

try:
    from pyarrow.fs import S3FileSystem
except ImportError:
    S3FileSystem = None

if S3FileSystem and isinstance(..., S3FileSystem):
    ...

@ccoulombe
Copy link
Contributor Author

ccoulombe commented Oct 25, 2024

@justinvyu

I'd prefer not to do the string check.

Sure, but out of curiosity : why ?

fyi, type_name is used elsewhere in the code base to check for the type of FS

@justinvyu
Copy link
Contributor

String check may be less stable than checking for the class since pyarrow could change the type name. Other pyarrow.fs.FileSystem subclasses could also potentially collide with the same type name (ex: the fsspec s3fs, although this one is fine because there's a fsspec:: prefix). But I would rather just guarantee that this logic is only done for this specific pyarrow implementation since it's a bug in pyarrow.fs.S3FileSystem specifically.

Yeah, type_name should probably be removed for type checks, but I think the only other usage is for collecting telemetry.

ccoulombe and others added 2 commits October 25, 2024 15:58
When PyArrow does not have support, checking if the instance is `pyarrow.fs.S3FileSystem` raises an exception.

Fixes ray-project#48187

Signed-off-by: Charles Coulombe <ccoulombe@users.noreply.github.com>
Signed-off-by: Charles Coulombe <charles.coulombe@calculquebec.ca>
…mport

Signed-off-by: Charles Coulombe <charles.coulombe@calculquebec.ca>
@ccoulombe
Copy link
Contributor Author

@justinvyu Updated, and I signed off the commits afterwards. Cheers

ccoulombe and others added 2 commits October 25, 2024 16:43
Signed-off-by: Charles Coulombe <charles.coulombe@calculquebec.ca>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@justinvyu justinvyu changed the title Updated storage to check type_name instead of instance [train][tune] Safely check if the storage filesystem is pyarrow.fs.S3FileSystem Oct 30, 2024
@justinvyu justinvyu merged commit 11d5ca4 into ray-project:master Oct 30, 2024
5 checks passed
Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024
…3FileSystem` (ray-project#48216)

## Why are these changes needed?

When PyArrow does not have support for S3, accessing `pyarrow.fs.S3FileSystem` raises an exception.
Fix by checking if the `S3FileSystem` is importable before doing this typecheck.

---------

Signed-off-by: Charles Coulombe <ccoulombe@users.noreply.github.com>
Signed-off-by: Charles Coulombe <charles.coulombe@calculquebec.ca>
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
…3FileSystem` (ray-project#48216)

## Why are these changes needed?

When PyArrow does not have support for S3, accessing `pyarrow.fs.S3FileSystem` raises an exception.
Fix by checking if the `S3FileSystem` is importable before doing this typecheck.

---------

Signed-off-by: Charles Coulombe <ccoulombe@users.noreply.github.com>
Signed-off-by: Charles Coulombe <charles.coulombe@calculquebec.ca>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
…3FileSystem` (ray-project#48216)

## Why are these changes needed?

When PyArrow does not have support for S3, accessing `pyarrow.fs.S3FileSystem` raises an exception.
Fix by checking if the `S3FileSystem` is importable before doing this typecheck.

---------

Signed-off-by: Charles Coulombe <ccoulombe@users.noreply.github.com>
Signed-off-by: Charles Coulombe <charles.coulombe@calculquebec.ca>
Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Ray Train] hardcoded pyarrow.fs.S3Filesystem check in Ray Train but Arrow S3 support is optional
3 participants