Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add interface to fsspec #943

Closed
wants to merge 5 commits into from
Closed

Add interface to fsspec #943

wants to merge 5 commits into from

Conversation

mariosasko
Copy link
Contributor

@mariosasko mariosasko commented Jul 7, 2022

Adds an interface to fsspec to enable access to the Hub as if it were a local file system.

This implementation uses the following scheme to infer parameters required for the protocol class initialization from a remote URL:
hf://{repo_id}[@{revision}]:/path_in_repo

For instance, this means one can simply push a Pandas dataframe to the Hub as follows:

df.to_csv("hf://repo_id:/path_in_repo", storage_options={"repo_type": "dataset"})

Things to discuss

  • if open's mode is "w", the current implementation doesn't create a repo to store the file if it doesn't already exist and throws an error instead. Let me know if you think it should create a repo.
  • also, feel free to suggest a better protocol scheme

Notes

  • this PR builds upon datasets' HfFileSystem (that version doesn't expose methods that can modify a Hub repo)
  • (for local testing) currently requires calling fsspec.register_implementation(HfFileSystem.protocol, HfFileSystem) to register the hf protocol in the local registry -> as soon as this PR is merged I'll add the protocol to fsspec's "official" registry to make this step unnecesarry

TODOs

  • add docs in a subsequent PR
  • add the hf protocol to fsspec's "official" registry

@mariosasko mariosasko requested review from adrinjalali and lhoestq July 7, 2022 13:35
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 7, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it !

Note that usually such a filesystem implementation is in its own package, e.g. s3fs, gcsfs, ossfs, adlfs. So if we want to be consistent with the rest of the ecosystem we could have hffs that requires huggingface_hub

cc @severo @julien-c this can be useful for datasets-server to use dask to process parquet files in hub repos 👀

# TODO(QL): add sizes
self._repo_entries_spec[hf_file.rfilename] = {
"name": hf_file.rfilename,
"size": None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file sizes are still not available in the siblings ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, still not available. Maybe @SBrandeis can help us with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On it

@mariosasko
Copy link
Contributor Author

@lhoestq Yes, I think having a dedicated package is also OK. huggingface_hub usually stores Hub integrations (and is well-established), so this is the reason why I put the code here. But as long as we register this implementation in the "official" registry, discoverability shouldn't be a problem anyways.

Still, I would like to hear what others think.

@lhoestq
Copy link
Member

lhoestq commented Jul 20, 2022

cc @adrinjalali @osanseviero do you have an opinion about whether this class should be in huggingface_hub or in a separate package hffs ?

@adrinjalali
Copy link
Contributor

If it's going to depend on huggingface_hub then I have no preference. But as a user, I'd expect a package like hffs to be or to depend on a package much smaller than hfh. So no strong preference here on my side.

@lhoestq
Copy link
Member

lhoestq commented Jul 20, 2022

Given that it requires an extra dependency fsspec and for consistency with the other filesystems contributed by the community, I'd be down to create the hffs package

@julien-c
Copy link
Member

no strong opinion either

Given that it's not a lot of code I think having it here would be fine, for simplicity purposes.

Does fsspec bring other transitive dependencies, or is it dependency-less?

@lhoestq
Copy link
Member

lhoestq commented Jul 20, 2022

Does fsspec bring other transitive dependencies, or is it dependency-less?

It has no dependencies

@julien-c
Copy link
Member

what do you think @osanseviero @Pierrci @SBrandeis?

@SBrandeis
Copy link
Contributor

Can we publish a "subpart" of huggingface_hub as a package on PyPI?
Maybe it makes sense to have two separate packages but to maintain them in the same repo (huggingface_hub)?
I don't know if this is doable or desirable.

@adrinjalali
Copy link
Contributor

Can we publish a "subpart" of huggingface_hub as a package on PyPI?

I've also argued in favor of this in the past, but folks have made the argument that it might not get enough attention if it's a small library. I like the idea of a huggingface_hub-core which only includes the REST wrappers and probably the git interfaces. If that happens, we can also re-work the git interface part.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion either regarding hosting it in its own package. If that's the convention (as it seems to be with s3fs, gcsfs etc) then definitely fine for me to split it, but if you'd like to contribute it here for simplicity's sake and to offer a powerful API on top of it, it's also fine by me.

Do you expect a lot of changes to be done to this code, or do you expect it to be pretty static? Do you expect users to use this feature (or hffs) directly, or for it to be used under the hood in libraries such as datasets?

@@ -40,6 +40,7 @@ def get_version() -> str:
"pytest-cov",
"datasets",
"soundfile",
"fsspec",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be added to its own extras so that it can be installed via pip install huggingface_hub[fsspec] or be added as a requirement for a downstream library requiring it

@lhoestq
Copy link
Member

lhoestq commented Jul 21, 2022

Do you expect users to use this feature (or hffs) directly, or for it to be used under the hood in libraries such as datasets?

We already have a similar class in datasets (read-only though) that we use to run glob patterns in dataset repositories.

I think the main usage is going to be

  1. for pandas/dask users to read/write to HF Hub (this also includes the datasets-server project, which is likely to use dask for distributed processing)
  2. for HF users training models and streaming data from the Hub (via datasets or others like torch datapipes)
  3. for HF power users who need to do advanced manipulations in repos

@mariosasko
Copy link
Contributor Author

Even though the implementation is pretty simple at the moment, it can quickly become more complex if we decide to override some more fsspec methods, add support for async IO, etc., so I think it's OK to have a separate repo to avoid unnecessary clutter in this repo (I'm also not a fan of having multiple packages/submodules in this repo tbh) and to follow fsspec's convention.

@osanseviero
Copy link
Contributor

Maybe it makes sense to have two separate packages but to maintain them in the same repo (huggingface_hub)?
I don't know if this is doable or desirable.

I would avoid two libraries in this same repo as this has led to confusion in the past + maintenance becomes a bit trickier.

I don't have a strong opinion between having the code within huggingface_hub library or if it should be consistent with rest of ecosystem hffs, but hffs makes a bit more sense I think.

@mariosasko
Copy link
Contributor Author

@lhoestq I've made some changes (namely the use of self.dircache for caching) to be more consistent with the existing implementations. Let me know what you think.

@lhoestq
Copy link
Member

lhoestq commented Jul 27, 2022

created the repo here: https://github.com/huggingface/hffs

@Wauplin
Copy link
Contributor

Wauplin commented Aug 17, 2022

Hey everyone 👋 It seems you all came to the conclusion to continue the work in the separate repo https://github.com/huggingface/hffs. Does that mean we can close this PR then ?

Btw I really like the df.to_csv("hf://repo_id:/path_in_repo") syntax ! It looks super cool 🔥

@mariosasko mariosasko closed this Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants