-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add interface to fsspec
#943
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love it !
Note that usually such a filesystem implementation is in its own package, e.g. s3fs, gcsfs, ossfs, adlfs. So if we want to be consistent with the rest of the ecosystem we could have hffs that requires huggingface_hub
cc @severo @julien-c this can be useful for datasets-server to use dask to process parquet files in hub repos 👀
src/huggingface_hub/hf_filesystem.py
Outdated
# TODO(QL): add sizes | ||
self._repo_entries_spec[hf_file.rfilename] = { | ||
"name": hf_file.rfilename, | ||
"size": None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file sizes are still not available in the siblings ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, still not available. Maybe @SBrandeis can help us with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On it
@lhoestq Yes, I think having a dedicated package is also OK. Still, I would like to hear what others think. |
cc @adrinjalali @osanseviero do you have an opinion about whether this class should be in |
If it's going to depend on |
Given that it requires an extra dependency |
no strong opinion either Given that it's not a lot of code I think having it here would be fine, for simplicity purposes. Does |
It has no dependencies |
what do you think @osanseviero @Pierrci @SBrandeis? |
Can we publish a "subpart" of |
I've also argued in favor of this in the past, but folks have made the argument that it might not get enough attention if it's a small library. I like the idea of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinion either regarding hosting it in its own package. If that's the convention (as it seems to be with s3fs, gcsfs etc) then definitely fine for me to split it, but if you'd like to contribute it here for simplicity's sake and to offer a powerful API on top of it, it's also fine by me.
Do you expect a lot of changes to be done to this code, or do you expect it to be pretty static? Do you expect users to use this feature (or hffs) directly, or for it to be used under the hood in libraries such as datasets?
@@ -40,6 +40,7 @@ def get_version() -> str: | |||
"pytest-cov", | |||
"datasets", | |||
"soundfile", | |||
"fsspec", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be added to its own extras
so that it can be installed via pip install huggingface_hub[fsspec]
or be added as a requirement for a downstream library requiring it
We already have a similar class in I think the main usage is going to be
|
Even though the implementation is pretty simple at the moment, it can quickly become more complex if we decide to override some more |
I would avoid two libraries in this same repo as this has led to confusion in the past + maintenance becomes a bit trickier. I don't have a strong opinion between having the code within |
@lhoestq I've made some changes (namely the use of |
created the repo here: https://github.com/huggingface/hffs |
Hey everyone 👋 It seems you all came to the conclusion to continue the work in the separate repo https://github.com/huggingface/hffs. Does that mean we can close this PR then ? Btw I really like the |
Adds an interface to
fsspec
to enable access to the Hub as if it were a local file system.This implementation uses the following scheme to infer parameters required for the protocol class initialization from a remote URL:
hf://{repo_id}[@{revision}]:/path_in_repo
For instance, this means one can simply push a Pandas dataframe to the Hub as follows:
Things to discuss
open
's mode is"w"
, the current implementation doesn't create a repo to store the file if it doesn't already exist and throws an error instead. Let me know if you think it should create a repo.Notes
datasets
'HfFileSystem
(that version doesn't expose methods that can modify a Hub repo)fsspec.register_implementation(HfFileSystem.protocol, HfFileSystem)
to register thehf
protocol in the local registry -> as soon as this PR is merged I'll add the protocol tofsspec
's "official" registry to make this step unnecesarryTODOs
hf
protocol tofsspec
's "official" registry