Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: add an utility to scan cache #990

Merged
merged 39 commits into from
Aug 30, 2022
Merged

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Aug 11, 2022

Following issue #972.

This is a draft of what the scan of the cache directory could look like. At the moment there is only a scan_cache method that returns structured information about the cache.

I think the most important question is what information do we want in HFCacheInfo, CachedRepoInfo, CachedRevisionInfo and CachedFileInfo objects. In theory, the current content would be enough to make the utils that @sgugger described in the issue. Another information that I have not yet scanned is "last_updated" timestamp. I guess that could be useful to "prune" a repo to keep only the latest downloaded version (to confirm...).

Any suggestion / feedback / comment is very welcomed as this is a draft (ping @osanseviero @sgugger @LysandreJik @mariosasko who showed interest in this feature).


I also made a quick and dirty script (scan.py) to show what a CLI command could display:

EDIT: refer to the updated CLI example in #990 (comment)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 11, 2022

The documentation is not available anymore as the PR was closed or merged.

@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 11, 2022

Side note: this only scans the "new" cache layout and makes some assumptions on the expected structure. I am not yet sure all weird cases are handled. Especially:

  1. Is it possible to have a blob that is never referenced (no symlink to it ?)
  2. Is it possible that a symlink refers to a non-existing blob ?
  3. Is it possible that a blob is not fully downloaded ?
  4. Is there a cache lock that I should be aware of ? If someone is scanning + downloading at the same time, results might not be accurate.

Copy link
Contributor

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot for diving into this. Left a couple of comments but I don't feel strongly about anything.

src/huggingface_hub/utils/cache_explorer.py Outdated Show resolved Hide resolved
src/huggingface_hub/utils/cache_explorer.py Outdated Show resolved Hide resolved
src/huggingface_hub/utils/cache_explorer.py Outdated Show resolved Hide resolved
src/huggingface_hub/utils/cache_explorer.py Outdated Show resolved Hide resolved
src/huggingface_hub/utils/cache_explorer.py Outdated Show resolved Hide resolved
src/huggingface_hub/utils/cache_explorer.py Outdated Show resolved Hide resolved
@Wauplin Wauplin marked this pull request as draft August 12, 2022 06:11
@Wauplin Wauplin added help wanted Extra attention is needed discussion labels Aug 12, 2022
Copy link
Contributor

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me! Looking forward the final PR :)

@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 23, 2022

@osanseviero If you have time could you try to checkout to this branch and run python scan.py and copy-paste the output ? I'd like to know if you get any errors. I made some tests with my local cache but it's a very recent and tiny cache. Would be good to get feedback from someone that is more likely to have more discrepancies in its cache.

If too complicated, I will update the PR with a proper CLI when I have time.

@osanseviero
Copy link
Contributor

I got the following error

python scan.py
Error while scanning /home/osanseviero/.cache/huggingface/hub/models--bert-base-cased: Snapshots dir doesn't exist !
Datasets:

Models:
  osanseviero--wine-quality 268.0KiB (5 files)
    /home/osanseviero/.cache/huggingface/hub/models--osanseviero--wine-quality
    Revisions:
        974ee1d10a13d2cb802c5b9c26b504a9c58a09af: main
          268.0KiB (5 files)

@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 23, 2022

@osanseviero Thanks for trying it out !

I am not exactly sure what happened here (surely a cache folder without snapshots folder, but why ?).
I will update the script to make it robust to errors (just log and ignore them) + add better error messages. I'll ping you when I have a better version to test.

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments, looks generally good to me

Another question/comment: maybe the CachedRepoInfo should also expose a dict of refs, or maybe a helper to get a CachedRevisionInfo from a ref

About the pruning, IMO we can keep it for later and/or leave it as the user's responsibility. It will be hard to find a single strategy that works well to decide what to prune, so it should maybe be up to the user to implement. For instance, you mention lastModified (of a file on disk?) but i have an old snapshot of gpt2 I use all the time... Or you have a lastFileAccess file stat on some OSes but not sure it's robust either...

src/huggingface_hub/utils/cache_explorer.py Outdated Show resolved Hide resolved
src/huggingface_hub/utils/cache_explorer.py Outdated Show resolved Hide resolved
src/huggingface_hub/utils/cache_explorer.py Outdated Show resolved Hide resolved
src/huggingface_hub/utils/cache_explorer.py Outdated Show resolved Hide resolved
src/huggingface_hub/utils/cache_explorer.py Outdated Show resolved Hide resolved
src/huggingface_hub/utils/cache_explorer.py Outdated Show resolved Hide resolved
src/huggingface_hub/utils/cache_explorer.py Outdated Show resolved Hide resolved
@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 25, 2022

About the pruning, IMO we can keep it for later and/or leave it as the user's responsibility. It will be hard to find a single strategy that works well to decide what to prune, so it should maybe be up to the user to implement. For instance, you mention lastModified (of a file on disk?) but i have an old snapshot of gpt2 I use all the time... Or you have a lastFileAccess file stat on some OSes but not sure it's robust either...

@julien-c Yep I talked about lastUpdated date because it's the only robust-enough timestamp that I thought of. I saw it more as an information to display to the user rather than a value to decide if we want to delete the file or not.

In general pruning/cache management will be done separately. My opinion is that we can have an helper to delete a specific revision from a specific repo_id. The helper deletes the blob files that are symlink-ed only to this revision so that other revisions are still valid. In addition it also delete the refs that points to this revision. This helper is generic-enough so that we don't take any decisions. The user can reuse it to build its own high-level pruning strategy.

def delete_cached_revision(repo_id: str, repo_type: str, revision: str) -> None:
    ...

EDIT: added repo_type.

@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 25, 2022

Another question/comment: maybe the CachedRepoInfo should also expose a dict of refs, or maybe a helper to get a CachedRevisionInfo from a ref

Good idea, I've added a CachedRepoInfo.refs property that computes the mapping "ref -> revision info"

@codecov
Copy link

codecov bot commented Aug 25, 2022

Codecov Report

Merging #990 (e4a3478) into main (6173c38) will increase coverage by 0.70%.
The diff coverage is 91.54%.

@@            Coverage Diff             @@
##             main     #990      +/-   ##
==========================================
+ Coverage   81.45%   82.16%   +0.70%     
==========================================
  Files          31       35       +4     
  Lines        3419     3588     +169     
==========================================
+ Hits         2785     2948     +163     
- Misses        634      640       +6     
Impacted Files Coverage Δ
src/huggingface_hub/__init__.py 78.37% <ø> (ø)
src/huggingface_hub/commands/huggingface_cli.py 0.00% <0.00%> (ø)
src/huggingface_hub/commands/cache.py 82.35% <82.35%> (ø)
src/huggingface_hub/utils/_cache_manager.py 92.85% <92.85%> (ø)
src/huggingface_hub/_commit_api.py 92.03% <100.00%> (-0.21%) ⬇️
src/huggingface_hub/commands/_cli_utils.py 100.00% <100.00%> (ø)
src/huggingface_hub/commands/user.py 30.13% <100.00%> (-3.20%) ⬇️
src/huggingface_hub/community.py 91.35% <100.00%> (-0.31%) ⬇️
src/huggingface_hub/hf_api.py 86.84% <100.00%> (-0.05%) ⬇️
src/huggingface_hub/utils/__init__.py 100.00% <100.00%> (ø)
... and 4 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 25, 2022

I took the feedback, made some changes and added a proper command to the existing huggingface-cli tool. I also changed printed output to have something more tabular and grep-able. This is most likely the final version of this tool for this PR. If you have any feedback, please tell me.

In the meantime I will write documentation and docstrings for this tool.

➜ huggingface-cli scan-cache --help
usage: huggingface-cli <command> [<args>] scan-cache [-h] [--dir DIR] [-v]

options:
  -h, --help     show this help message and exit
  --dir DIR      cache directory to scan (optional). Default to the default HuggingFace cache.
  -v, --verbose  show a more verbose output
➜ huggingface-cli scan-cache       
REPO ID                     REPO TYPE SIZE ON DISK NB FILES REFS                LOCAL PATH                                                                
--------------------------- --------- ------------ -------- ------------------- ------------------------------------------------------------------------- 
glue                        dataset         116.3K       15 2.4.0, main, 1.17.0 /Users/lucain/.cache/huggingface/hub/datasets--glue                       
google/fleurs               dataset          64.9M        6 refs/pr/1, main     /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs             
Jean-Baptiste/camembert-ner model           441.0M        7 main                /Users/lucain/.cache/huggingface/hub/models--Jean-Baptiste--camembert-ner 
bert-base-cased             model             1.9G       13 main                /Users/lucain/.cache/huggingface/hub/models--bert-base-cased              
t5-base                     model            10.1K        3 main                /Users/lucain/.cache/huggingface/hub/models--t5-base                      
t5-small                    model           970.7M       11 refs/pr/1, main     /Users/lucain/.cache/huggingface/hub/models--t5-small                     

Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G.
Got 1 error(s) while scanning. Use -vvv to print details.
➜ huggingface-cli scan-cache -v
REPO ID                     REPO TYPE REVISION                                 SIZE ON DISK NB FILES REFS        LOCAL PATH                                                                                                                   
--------------------------- --------- ---------------------------------------- ------------ -------- ----------- ---------------------------------------------------------------------------------------------------------------------------- 
glue                        dataset   9338f7b671827df886678df2bdd7cc7b4f36dffd        97.7K       14 main, 2.4.0 /Users/lucain/.cache/huggingface/hub/datasets--glue/snapshots/9338f7b671827df886678df2bdd7cc7b4f36dffd                       
glue                        dataset   f021ae41c879fcabcf823648ec685e3fead91fe7        97.8K       14 1.17.0      /Users/lucain/.cache/huggingface/hub/datasets--glue/snapshots/f021ae41c879fcabcf823648ec685e3fead91fe7                       
google/fleurs               dataset   129b6e96cf1967cd5d2b9b6aec75ce6cce7c89e8        25.4K        3 refs/pr/1   /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs/snapshots/129b6e96cf1967cd5d2b9b6aec75ce6cce7c89e8             
google/fleurs               dataset   24f85a01eb955224ca3946e70050869c56446805        64.9M        4 main        /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs/snapshots/24f85a01eb955224ca3946e70050869c56446805             
Jean-Baptiste/camembert-ner model     dbec8489a1c44ecad9da8a9185115bccabd799fe       441.0M        7 main        /Users/lucain/.cache/huggingface/hub/models--Jean-Baptiste--camembert-ner/snapshots/dbec8489a1c44ecad9da8a9185115bccabd799fe 
bert-base-cased             model     378aa1bda6387fd00e824948ebe3488630ad8565         1.5G        9             /Users/lucain/.cache/huggingface/hub/models--bert-base-cased/snapshots/378aa1bda6387fd00e824948ebe3488630ad8565              
bert-base-cased             model     a8d257ba9925ef39f3036bfc338acf5283c512d9         1.4G        9 main        /Users/lucain/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9              
t5-base                     model     23aa4f41cb7c08d4b05c8f327b22bfa0eb8c7ad9        10.1K        3 main        /Users/lucain/.cache/huggingface/hub/models--t5-base/snapshots/23aa4f41cb7c08d4b05c8f327b22bfa0eb8c7ad9                      
t5-small                    model     98ffebbb27340ec1b1abd7c45da12c253ee1882a       726.2M        6 refs/pr/1   /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/98ffebbb27340ec1b1abd7c45da12c253ee1882a                     
t5-small                    model     d0a119eedb3718e34c648e594394474cf95e0617       485.8M        6             /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d0a119eedb3718e34c648e594394474cf95e0617                     
t5-small                    model     d78aea13fa7ecd06c29e3e46195d6341255065d5       970.7M        9 main        /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5                     

Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G.
Got 1 error(s) while scanning. Use -vvv to print details.
➜ huggingface-cli scan-cache -vvv
REPO ID                     REPO TYPE REVISION                                 SIZE ON DISK NB FILES REFS        LOCAL PATH                                                                                                                   
--------------------------- --------- ---------------------------------------- ------------ -------- ----------- ---------------------------------------------------------------------------------------------------------------------------- 
glue                        dataset   9338f7b671827df886678df2bdd7cc7b4f36dffd        97.7K       14 main, 2.4.0 /Users/lucain/.cache/huggingface/hub/datasets--glue/snapshots/9338f7b671827df886678df2bdd7cc7b4f36dffd                       
glue                        dataset   f021ae41c879fcabcf823648ec685e3fead91fe7        97.8K       14 1.17.0      /Users/lucain/.cache/huggingface/hub/datasets--glue/snapshots/f021ae41c879fcabcf823648ec685e3fead91fe7                       
google/fleurs               dataset   129b6e96cf1967cd5d2b9b6aec75ce6cce7c89e8        25.4K        3 refs/pr/1   /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs/snapshots/129b6e96cf1967cd5d2b9b6aec75ce6cce7c89e8             
google/fleurs               dataset   24f85a01eb955224ca3946e70050869c56446805        64.9M        4 main        /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs/snapshots/24f85a01eb955224ca3946e70050869c56446805             
Jean-Baptiste/camembert-ner model     dbec8489a1c44ecad9da8a9185115bccabd799fe       441.0M        7 main        /Users/lucain/.cache/huggingface/hub/models--Jean-Baptiste--camembert-ner/snapshots/dbec8489a1c44ecad9da8a9185115bccabd799fe 
bert-base-cased             model     378aa1bda6387fd00e824948ebe3488630ad8565         1.5G        9             /Users/lucain/.cache/huggingface/hub/models--bert-base-cased/snapshots/378aa1bda6387fd00e824948ebe3488630ad8565              
bert-base-cased             model     a8d257ba9925ef39f3036bfc338acf5283c512d9         1.4G        9 main        /Users/lucain/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9              
t5-base                     model     23aa4f41cb7c08d4b05c8f327b22bfa0eb8c7ad9        10.1K        3 main        /Users/lucain/.cache/huggingface/hub/models--t5-base/snapshots/23aa4f41cb7c08d4b05c8f327b22bfa0eb8c7ad9                      
t5-small                    model     98ffebbb27340ec1b1abd7c45da12c253ee1882a       726.2M        6 refs/pr/1   /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/98ffebbb27340ec1b1abd7c45da12c253ee1882a                     
t5-small                    model     d0a119eedb3718e34c648e594394474cf95e0617       485.8M        6             /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d0a119eedb3718e34c648e594394474cf95e0617                     
t5-small                    model     d78aea13fa7ecd06c29e3e46195d6341255065d5       970.7M        9 main        /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5                     

Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G.
Got 1 error(s) while scanning.
Snapshots dir doesn't exist in cached repo: /Users/lucain/.cache/huggingface/hub/models--tests--fixtures--working_repo_2--FROM_PRETRAINED/snapshots

@Wauplin Wauplin removed help wanted Extra attention is needed discussion labels Aug 26, 2022
@Wauplin Wauplin changed the title [WIP] First proposal for a utility to scan cache Feature: add an utility to scan cache Aug 26, 2022
@Wauplin Wauplin marked this pull request as ready for review August 26, 2022 14:09
@Wauplin Wauplin added this to the v0.10 milestone Aug 26, 2022
Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems pretty solid to me, approving, but please get another reviewer's feedback before merge?

@LysandreJik LysandreJik self-requested a review August 30, 2022 09:23
Copy link
Contributor

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool and excellent docs! Thanks for this 🔥 (I didn't review all the code in depth yet, mostly tests, usage and documentation)

docs/source/how-to-cache.mdx Outdated Show resolved Hide resolved
docs/source/how-to-cache.mdx Outdated Show resolved Hide resolved
docs/source/how-to-cache.mdx Outdated Show resolved Hide resolved
docs/source/how-to-cache.mdx Outdated Show resolved Hide resolved
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive PR! You have codecov complaining about a few internal methods that aren't tested; Not super important, but I would also aim for coverage for these methods.

src/huggingface_hub/utils/_cache_manager.py Outdated Show resolved Hide resolved
docs/source/how-to-cache.mdx Outdated Show resolved Hide resolved
docs/source/how-to-cache.mdx Outdated Show resolved Hide resolved
docs/source/how-to-cache.mdx Outdated Show resolved Hide resolved
docs/source/how-to-cache.mdx Outdated Show resolved Hide resolved
src/huggingface_hub/commands/_cli_utils.py Outdated Show resolved Hide resolved
src/huggingface_hub/commands/cache.py Show resolved Hide resolved
Wauplin and others added 7 commits August 30, 2022 13:40
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 30, 2022

Thanks @LysandreJik and @osanseviero for your review and feedback!
There are still two small comments left from Lysandre but apart from that I think we are good to merge.

EDIT: i'll add some tests as suggested by Lysandre :)

@Wauplin Wauplin merged commit 48ddc62 into main Aug 30, 2022
@Wauplin Wauplin deleted the 972-utility-to-list-cache branch August 30, 2022 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants