-
Notifications
You must be signed in to change notification settings - Fork 581
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Feature: add an utility to scan cache (#990)
* WIP utility to scan cache * add example in scan.py * code quality * property in CachedRepoInfo + typing fixes * changes from feedback * rename to private util module * remove scan.py script and make proper CLI * review CLI help * remove unused colors * start documentation * add file to doctree * fix snippets ? * try generating doc from docstring * doc * refacto to frozen dataclasses * always more doc * always more docs * forgotten line * finalize doc * add text to snippet type * test cache scanner * mypy * siort * add tests for CLI * fix cli tests * Update docs/source/how-to-cache.mdx Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Update docs/source/how-to-cache.mdx Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Update docs/source/how-to-cache.mdx Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Add valuerror is cache dir is missing * update doc * Change from errors to warnings * Update src/huggingface_hub/commands/_cli_utils.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> * test cli utils * Test scan cache cli initialization * make style * typing Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
- Loading branch information
1 parent
7b57719
commit 48ddc62
Showing
18 changed files
with
1,442 additions
and
166 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,240 @@ | ||
# Manage `huggingface_hub` cache-system | ||
|
||
## Understand caching | ||
|
||
The Hugging Face Hub cache-system is designed to be the central cache shared across libraries | ||
that depend on the Hub. It has been updated in v0.8.0 to prevent re-downloading same files | ||
between revisions. | ||
|
||
The caching system is designed as follows: | ||
|
||
``` | ||
<CACHE_DIR> | ||
├─ <MODELS> | ||
├─ <DATASETS> | ||
├─ <SPACES> | ||
``` | ||
|
||
The `<CACHE_DIR>` is usually your user's home directory. However, it is customizable with the | ||
`cache_dir` argument on all methods, or by specifying either `HF_HOME` or | ||
`HUGGINGFACE_HUB_CACHE` environment variable. | ||
|
||
Models, datasets and spaces share a common root. Each of these repositories contains the | ||
repository type, the namespace (organization or username) if it exists and the | ||
repository name: | ||
|
||
``` | ||
<CACHE_DIR> | ||
├─ models--julien-c--EsperBERTo-small | ||
├─ models--lysandrejik--arxiv-nlp | ||
├─ models--bert-base-cased | ||
├─ datasets--glue | ||
├─ datasets--huggingface--DataMeasurementsFiles | ||
├─ spaces--dalle-mini--dalle-mini | ||
``` | ||
|
||
It is within these folders that all files will now be downloaded from the Hub. Caching ensures that | ||
a file isn't downloaded twice if it already exists and wasn't updated; but if it was updated, | ||
and you're asking for the latest file, then it will download the latest file (while keeping | ||
the previous file intact in case you need it again). | ||
|
||
In order to achieve this, all folders contain the same skeleton: | ||
|
||
``` | ||
<CACHE_DIR> | ||
├─ datasets--glue | ||
│ ├─ refs | ||
│ ├─ blobs | ||
│ ├─ snapshots | ||
... | ||
``` | ||
|
||
Each folder is designed to contain the following: | ||
|
||
### Refs | ||
|
||
The `refs` folder contains files which indicates the latest revision of the given reference. For example, | ||
if we have previously fetched a file from the `main` branch of a repository, the `refs` | ||
folder will contain a file named `main`, which will itself contain the commit identifier of the current head. | ||
|
||
If the latest commit of `main` has `aaaaaa` as identifier, then it will contain `aaaaaa`. | ||
|
||
If that same branch gets updated with a new commit, that has `bbbbbb` as an identifier, then | ||
re-downloading a file from that reference will update the `refs/main` file to contain `bbbbbb`. | ||
|
||
### Blobs | ||
|
||
The `blobs` folder contains the actual files that we have downloaded. The name of each file is their hash. | ||
|
||
### Snapshots | ||
|
||
The `snapshots` folder contains symlinks to the blobs mentioned above. It is itself made up of several folders: | ||
one per known revision! | ||
|
||
In the explanation above, we had initially fetched a file from the `aaaaaa` revision, before fetching a file from | ||
the `bbbbbb` revision. In this situation, we would now have two folders in the `snapshots` folder: `aaaaaa` | ||
and `bbbbbb`. | ||
|
||
In each of these folders, live symlinks that have the names of the files that we have downloaded. For example, | ||
if we had downloaded the `README.md` file at revision `aaaaaa`, we would have the following path: | ||
|
||
``` | ||
<CACHE_DIR>/<REPO_NAME>/snapshots/aaaaaa/README.md | ||
``` | ||
|
||
That `README.md` file is actually a symlink linking to the blob that has the hash of the file. | ||
|
||
By creating the skeleton this way we open the mechanism to file sharing: if the same file was fetched in | ||
revision `bbbbbb`, it would have the same hash and the file would not need to be re-downloaded. | ||
|
||
### In practice | ||
|
||
In practice, your cache should look like the following tree: | ||
|
||
```text | ||
[ 96] . | ||
└── [ 160] models--julien-c--EsperBERTo-small | ||
├── [ 160] blobs | ||
│ ├── [321M] 403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd | ||
│ ├── [ 398] 7cb18dc9bafbfcf74629a4b760af1b160957a83e | ||
│ └── [1.4K] d7edf6bd2a681fb0175f7735299831ee1b22b812 | ||
├── [ 96] refs | ||
│ └── [ 40] main | ||
└── [ 128] snapshots | ||
├── [ 128] 2439f60ef33a0d46d85da5001d52aeda5b00ce9f | ||
│ ├── [ 52] README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812 | ||
│ └── [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd | ||
└── [ 128] bbc77c8132af1cc5cf678da3f1ddf2de43606d48 | ||
├── [ 52] README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e | ||
└── [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd | ||
``` | ||
|
||
## Scan your cache | ||
|
||
At the moment, cached files are never deleted from your local directory: when you download | ||
a new revision of a branch, previous files are kept in case you need them again. | ||
Therefore it can be useful to scan your cache directory in order to know which repos | ||
and revisions are taking the most disk space. `huggingface_hub` provides an helper to | ||
do so that can be used via `huggingface-cli` or in a python script. | ||
|
||
### From the terminal | ||
|
||
The easiest way to scan your HF cache-system is to use the `scan-cache` command from | ||
`huggingface-cli` tool. This command scans the cache and prints a report with information | ||
like repo id, repo type, disk usage, refs and full local path. | ||
|
||
The snippet below shows a scan report in a folder in which 4 models and 2 datasets are | ||
cached. | ||
|
||
```text | ||
➜ huggingface-cli scan-cache | ||
REPO ID REPO TYPE SIZE ON DISK NB FILES REFS LOCAL PATH | ||
--------------------------- --------- ------------ -------- ------------------- ------------------------------------------------------------------------- | ||
glue dataset 116.3K 15 2.4.0, main, 1.17.0 /Users/lucain/.cache/huggingface/hub/datasets--glue | ||
google/fleurs dataset 64.9M 6 refs/pr/1, main /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs | ||
Jean-Baptiste/camembert-ner model 441.0M 7 main /Users/lucain/.cache/huggingface/hub/models--Jean-Baptiste--camembert-ner | ||
bert-base-cased model 1.9G 13 main /Users/lucain/.cache/huggingface/hub/models--bert-base-cased | ||
t5-base model 10.1K 3 main /Users/lucain/.cache/huggingface/hub/models--t5-base | ||
t5-small model 970.7M 11 refs/pr/1, main /Users/lucain/.cache/huggingface/hub/models--t5-small | ||
Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G. | ||
Got 1 warning(s) while scanning. Use -vvv to print details. | ||
``` | ||
|
||
To get a more detailed report, use the `--verbose` option. For each repo, you get a | ||
list of all revisions that have been downloaded. As explained above, the files that don't | ||
change between 2 revisions are shared thanks to the symlinks. This means that the size of | ||
the repo on disk is expected to be less than the sum of the size of each of its revisions. | ||
For example, here `bert-base-cased` has 2 revisions of 1.4G and 1.5G but the total disk | ||
usage is only 1.9G. | ||
|
||
```text | ||
➜ huggingface-cli scan-cache -v | ||
REPO ID REPO TYPE REVISION SIZE ON DISK NB FILES REFS LOCAL PATH | ||
--------------------------- --------- ---------------------------------------- ------------ -------- ----------- ---------------------------------------------------------------------------------------------------------------------------- | ||
glue dataset 9338f7b671827df886678df2bdd7cc7b4f36dffd 97.7K 14 main, 2.4.0 /Users/lucain/.cache/huggingface/hub/datasets--glue/snapshots/9338f7b671827df886678df2bdd7cc7b4f36dffd | ||
glue dataset f021ae41c879fcabcf823648ec685e3fead91fe7 97.8K 14 1.17.0 /Users/lucain/.cache/huggingface/hub/datasets--glue/snapshots/f021ae41c879fcabcf823648ec685e3fead91fe7 | ||
google/fleurs dataset 129b6e96cf1967cd5d2b9b6aec75ce6cce7c89e8 25.4K 3 refs/pr/1 /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs/snapshots/129b6e96cf1967cd5d2b9b6aec75ce6cce7c89e8 | ||
google/fleurs dataset 24f85a01eb955224ca3946e70050869c56446805 64.9M 4 main /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs/snapshots/24f85a01eb955224ca3946e70050869c56446805 | ||
Jean-Baptiste/camembert-ner model dbec8489a1c44ecad9da8a9185115bccabd799fe 441.0M 7 main /Users/lucain/.cache/huggingface/hub/models--Jean-Baptiste--camembert-ner/snapshots/dbec8489a1c44ecad9da8a9185115bccabd799fe | ||
bert-base-cased model 378aa1bda6387fd00e824948ebe3488630ad8565 1.5G 9 /Users/lucain/.cache/huggingface/hub/models--bert-base-cased/snapshots/378aa1bda6387fd00e824948ebe3488630ad8565 | ||
bert-base-cased model a8d257ba9925ef39f3036bfc338acf5283c512d9 1.4G 9 main /Users/lucain/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9 | ||
t5-base model 23aa4f41cb7c08d4b05c8f327b22bfa0eb8c7ad9 10.1K 3 main /Users/lucain/.cache/huggingface/hub/models--t5-base/snapshots/23aa4f41cb7c08d4b05c8f327b22bfa0eb8c7ad9 | ||
t5-small model 98ffebbb27340ec1b1abd7c45da12c253ee1882a 726.2M 6 refs/pr/1 /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/98ffebbb27340ec1b1abd7c45da12c253ee1882a | ||
t5-small model d0a119eedb3718e34c648e594394474cf95e0617 485.8M 6 /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d0a119eedb3718e34c648e594394474cf95e0617 | ||
t5-small model d78aea13fa7ecd06c29e3e46195d6341255065d5 970.7M 9 main /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5 | ||
Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G. | ||
Got 1 warning(s) while scanning. Use -vvv to print details. | ||
``` | ||
|
||
#### Grep example | ||
|
||
Since the output is in tabular format, you can combine it with any `grep`-like tools to | ||
filter the entries. Here is an example to filter only revisions from the "t5-small" | ||
model on a Unix-based machine. | ||
|
||
```text | ||
➜ eval "huggingface-cli scan-cache -v" | grep "t5-small" | ||
t5-small model 98ffebbb27340ec1b1abd7c45da12c253ee1882a 726.2M 6 refs/pr/1 /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/98ffebbb27340ec1b1abd7c45da12c253ee1882a | ||
t5-small model d0a119eedb3718e34c648e594394474cf95e0617 485.8M 6 /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d0a119eedb3718e34c648e594394474cf95e0617 | ||
t5-small model d78aea13fa7ecd06c29e3e46195d6341255065d5 970.7M 9 main /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5 | ||
``` | ||
|
||
### From Python | ||
|
||
For a more advanced usage, use [`scan_cache_dir`] which is the python utility called by | ||
the CLI tool. | ||
|
||
You can use it to get a detailed report structured around 4 dataclasses: | ||
|
||
- [`HFCacheInfo`]: complete report returned by [`scan_cache_dir`] | ||
- [`CachedRepoInfo`]: information about a cached repo | ||
- [`CachedRevisionInfo`]: information about a cached revision (e.g. "snapshot") inside a repo | ||
- [`CachedFileInfo`]: information about a cached file in a snapshot | ||
|
||
Here is a simple usage example. See reference for details. | ||
|
||
```py | ||
>>> from huggingface_hub import scan_cache_dir | ||
|
||
>>> hf_cache_info = scan_cache_dir() | ||
HFCacheInfo( | ||
size_on_disk=3398085269, | ||
repos=frozenset({ | ||
CachedRepoInfo( | ||
repo_id='t5-small', | ||
repo_type='model', | ||
repo_path=PosixPath(...), | ||
size_on_disk=970726914, | ||
nb_files=11, | ||
revisions=frozenset({ | ||
CachedRevisionInfo( | ||
commit_hash='d78aea13fa7ecd06c29e3e46195d6341255065d5', | ||
size_on_disk=970726339, | ||
snapshot_path=PosixPath(...), | ||
files=frozenset({ | ||
CachedFileInfo( | ||
file_name='config.json', | ||
size_on_disk=1197 | ||
file_path=PosixPath(...), | ||
blob_path=PosixPath(...), | ||
), | ||
CachedFileInfo(...), | ||
... | ||
}), | ||
), | ||
CachedRevisionInfo(...), | ||
... | ||
}), | ||
), | ||
CachedRepoInfo(...), | ||
... | ||
}), | ||
warnings=[ | ||
CorruptedCacheException("Snapshots dir doesn't exist in cached repo: ..."), | ||
CorruptedCacheException(...), | ||
... | ||
], | ||
) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# Cache-system reference | ||
|
||
The caching system was updated in v0.8.0 to become the central cache-system shared | ||
across libraries that depend on the Hub. Read the [cache-system guide](../how-to-cache) | ||
for a detailed presentation of caching at HF. | ||
|
||
## Helpers | ||
|
||
### scan_cache_dir | ||
|
||
[[autodoc]] huggingface_hub.scan_cache_dir | ||
|
||
## Data structures | ||
|
||
All structures are built and returned by [`scan_cache_dir`] and are immutable. | ||
|
||
### HFCacheInfo | ||
|
||
[[autodoc]] huggingface_hub.HFCacheInfo | ||
|
||
### CachedRepoInfo | ||
|
||
[[autodoc]] huggingface_hub.CachedRepoInfo | ||
- size_on_disk_str | ||
- refs | ||
|
||
### CachedRevisionInfo | ||
|
||
[[autodoc]] huggingface_hub.CachedRevisionInfo | ||
- size_on_disk_str | ||
- nb_files | ||
|
||
### CachedFileInfo | ||
|
||
[[autodoc]] huggingface_hub.CachedFileInfo | ||
- size_on_disk_str | ||
|
||
## Exceptions | ||
|
||
### CorruptedCacheException | ||
|
||
[[autodoc]] huggingface_hub.CorruptedCacheException |
Oops, something went wrong.