-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: add an utility to scan cache #990
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Side note: this only scans the "new" cache layout and makes some assumptions on the expected structure. I am not yet sure all weird cases are handled. Especially:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks a lot for diving into this. Left a couple of comments but I don't feel strongly about anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks good to me! Looking forward the final PR :)
@osanseviero If you have time could you try to checkout to this branch and run If too complicated, I will update the PR with a proper CLI when I have time. |
I got the following error
|
@osanseviero Thanks for trying it out ! I am not exactly sure what happened here (surely a cache folder without |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few comments, looks generally good to me
Another question/comment: maybe the CachedRepoInfo
should also expose a dict of refs, or maybe a helper to get a CachedRevisionInfo
from a ref
About the pruning, IMO we can keep it for later and/or leave it as the user's responsibility. It will be hard to find a single strategy that works well to decide what to prune, so it should maybe be up to the user to implement. For instance, you mention lastModified
(of a file on disk?) but i have an old snapshot of gpt2 I use all the time... Or you have a lastFileAccess file stat on some OSes but not sure it's robust either...
@julien-c Yep I talked about In general pruning/cache management will be done separately. My opinion is that we can have an helper to delete a specific def delete_cached_revision(repo_id: str, repo_type: str, revision: str) -> None:
... EDIT: added |
Good idea, I've added a |
Codecov Report
@@ Coverage Diff @@
## main #990 +/- ##
==========================================
+ Coverage 81.45% 82.16% +0.70%
==========================================
Files 31 35 +4
Lines 3419 3588 +169
==========================================
+ Hits 2785 2948 +163
- Misses 634 640 +6
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
I took the feedback, made some changes and added a proper command to the existing In the meantime I will write documentation and docstrings for this tool.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems pretty solid to me, approving, but please get another reviewer's feedback before merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool and excellent docs! Thanks for this 🔥 (I didn't review all the code in depth yet, mostly tests, usage and documentation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very impressive PR! You have codecov complaining about a few internal methods that aren't tested; Not super important, but I would also aim for coverage for these methods.
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
Thanks @LysandreJik and @osanseviero for your review and feedback! EDIT: i'll add some tests as suggested by Lysandre :) |
…ggingface_hub into 972-utility-to-list-cache
Following issue #972.
This is a draft of what the scan of the cache directory could look like. At the moment there is only a
scan_cache
method that returns structured information about the cache.I think the most important question is what information do we want in
HFCacheInfo
,CachedRepoInfo
,CachedRevisionInfo
andCachedFileInfo
objects. In theory, the current content would be enough to make the utils that @sgugger described in the issue. Another information that I have not yet scanned is "last_updated" timestamp. I guess that could be useful to "prune" a repo to keep only the latest downloaded version (to confirm...).Any suggestion / feedback / comment is very welcomed as this is a draft (ping @osanseviero @sgugger @LysandreJik @mariosasko who showed interest in this feature).
I also made a quick and dirty script (scan.py
) to show what a CLI command could display:EDIT: refer to the updated CLI example in #990 (comment)