-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Proposal for a way to cache files in downstream libraries #1088
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Codecov ReportBase: 84.91% // Head: 84.99% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1088 +/- ##
==========================================
+ Coverage 84.91% 84.99% +0.07%
==========================================
Files 38 39 +1
Lines 3931 3951 +20
==========================================
+ Hits 3338 3358 +20
Misses 593 593
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Cool thanks ! Can it even be possible to have the "datasets" folder one level higher ?
Then there's "modules" (top level as well) which is the directory where we copy python scripts. IMO this one doesn't need to be part of this change (for now ?) For context: The "modules" directory is added to the python path in I think we keep using one directory "modules" at the top level, this way any HF lib only need to add one directory in the python path. For now it can be excluded from the HFH cache since it only contains small files |
We also use the |
Thanks for the feedback :) To answer your questions 1 by 1:
What would be the goal of having it at the root of the cache ? The main advantage when I thought about it was that since it is a new folder in the cache, it would not conflict with any of the existing libraries (datasets, transformers and others). In any case, the downstream library shouldn't care too much where the cache it is only a path returned from
I am fine with keeping the
In any case, as I see it, the
Does
Yep, very specific use case (and already existing) so no need to change IMO. |
Users of Therefore if the main concern is to avoid conflicts I think we can just rename it "datasets-cache"
Yup it's quite small (usually a few KB per dataset)
Both add |
I would also like to avoid conflicts with existing folders for the future Also, we it would allow to have a clear distinction between folders that can be scanned via
Then let's not focus too much on that. It light-weight and already working so no need to touch it. |
(nit) I just thought about it but renaming |
Ok I see ! I like EDIT: hmm actually let me think more about this |
Here is another idea we had with @polinaeterna
In my opinion this is ideal for a This is a bit different from what's currently done in |
yes this structure is good for us ( but it would require either adding both |
@lhoestq @polinaeterna I'm sorry but I think this will be unlikely to happen exactly like this. Repos from the Hub are now cached under I see at least 2 reasons to keep the
That been said, it is still possible to move the In the end, you can would end up with something like this:
Would that be ok ? |
Hi @Wauplin, Indeed yesterday we had a meeting of the datasets team and we discussed about this topic. The request by @lhoestq and @polinaeterna of putting everything in a per-dataset directory, has an underlying need: to be able to easily find/delete all cached files of a specific dataset:
Therefore, if we agree on your proposed directory structure:
Additionally, for the "modules" directory, I think you were right (@lhoestq could you confirm this?): they need to be in a separate parent directory (that will be added to the import path), for all datasets: Another remark: please note the specificity of the "downloaded" directory:
I think this difference must be implemented downstream by datasets team. |
I see, thanks @Wauplin and @albertvillanova :) sounds good to me super excited about this !
We could also add the |
Not so much opinionated about this but if you do so I think you should align with the |
I see - let's keep |
Ok everyone, thanks for the feedback and discussions. I think we reached a common agreement for this PR then. @LysandreJik @lhoestq I've ping you for the review but other contributors are welcomed to review if wanted. Thanks in advance ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very impressive PR, thanks @Wauplin!
Thanks for the review @LysandreJik , I'm merging :) |
Awesome thanks :) |
This is a proposal following discussions started with the
datasets
team (@lhoestq @albertvillanova).The goal is to have a proper way to cache any kind of files from a downstream library and manage them (e.g.: scan and delete) from
huggingface_hub
. Fromhfh
's perspective, there is not much work to do. We should have a canonical procedure to generate cache paths for a library. Then within a cache folder, the downstream library handles its files as it wants. Once this helper starts to be used, we can adapt thescan-cache
anddelete-cache
commands.I tried to document the
cached_assets_path()
helper to describe the way I see it. Any feedback is welcomed, this is really just a proposal. All the examples are verydatasets
-focused but I think this could benefit to other libraries astransformers
(@sgugger @LysandreJik ),diffusers
(@apolinario @patrickvonplaten) orskops
(@adrinjalali @merveenoyan) to store any kind of intermediate files. IMO the difficulty mainly resides in making the feature used 😄.EDIT: see generated documentation here.
EDIT 2:
assets/
might be a better naming here (common naming in dev)WDYT ?
(cc @julien-c @osanseviero as well)
Example:
And the generated tree: