Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: add an utility to scan cache #990

Merged
merged 39 commits into from
Aug 30, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
df1c40f
WIP utility to scan cache
Wauplin Aug 11, 2022
f09758e
add example in scan.py
Wauplin Aug 11, 2022
0012b9f
code quality
Wauplin Aug 11, 2022
f5a1a49
tMerge branch 'main' into 972-utility-to-list-cache
Wauplin Aug 23, 2022
5dfad62
property in CachedRepoInfo + typing fixes
Wauplin Aug 25, 2022
ebd3eb0
changes from feedback
Wauplin Aug 25, 2022
fdfab6d
rename to private util module
Wauplin Aug 25, 2022
48a4f1b
remove scan.py script and make proper CLI
Wauplin Aug 25, 2022
855faa2
review CLI help
Wauplin Aug 25, 2022
c17ead6
remove unused colors
Wauplin Aug 25, 2022
fff9134
start documentation
Wauplin Aug 25, 2022
4b167f3
add file to doctree
Wauplin Aug 25, 2022
787b2e8
fix snippets ?
Wauplin Aug 25, 2022
f569eec
try generating doc from docstring
Wauplin Aug 26, 2022
a9cc5e3
doc
Wauplin Aug 26, 2022
f1564ca
refacto to frozen dataclasses
Wauplin Aug 26, 2022
256c111
always more doc
Wauplin Aug 26, 2022
2003ad9
always more docs
Wauplin Aug 26, 2022
15c1b46
forgotten line
Wauplin Aug 26, 2022
0602e03
finalize doc
Wauplin Aug 26, 2022
b5af99e
add text to snippet type
Wauplin Aug 26, 2022
2f3ee6d
test cache scanner
Wauplin Aug 26, 2022
f325e45
mypy
Wauplin Aug 26, 2022
78f1124
siort
Wauplin Aug 26, 2022
792fb1a
add tests for CLI
Wauplin Aug 26, 2022
f92c92b
fix cli tests
Wauplin Aug 26, 2022
d1343eb
Merge branch 'main' into 972-utility-to-list-cache
Wauplin Aug 26, 2022
94e8c07
Update docs/source/how-to-cache.mdx
Wauplin Aug 30, 2022
00f3815
Update docs/source/how-to-cache.mdx
Wauplin Aug 30, 2022
26534e3
Update docs/source/how-to-cache.mdx
Wauplin Aug 30, 2022
f86a7e3
Add valuerror is cache dir is missing
Wauplin Aug 30, 2022
8a0f37e
update doc
Wauplin Aug 30, 2022
72eb799
Change from errors to warnings
Wauplin Aug 30, 2022
c9a357a
Update src/huggingface_hub/commands/_cli_utils.py
Wauplin Aug 30, 2022
2c1163f
test cli utils
Wauplin Aug 30, 2022
1a435e4
Test scan cache cli initialization
Wauplin Aug 30, 2022
434b9bc
Merge branch '972-utility-to-list-cache' of github.com:huggingface/hu…
Wauplin Aug 30, 2022
6511094
make style
Wauplin Aug 30, 2022
e4a3478
typing
Wauplin Aug 30, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@
- local: how-to-inference
title: Access the Inference API
- local: how-to-discussions-and-pull-requests
title: Interact with Discussions and Pull Requests
title: Interact with Discussions and Pull Requests
- local: how-to-cache
title: Manage the Cache
title: "Guides"
- sections:
- local: package_reference/repository
Expand All @@ -33,4 +35,6 @@
title: Utilities
- local: package_reference/community
title: Discussions and Pull Requests
- local: package_reference/cache
title: Cache-system reference
title: "Reference"
240 changes: 240 additions & 0 deletions docs/source/how-to-cache.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
# Manage `huggingface_hub` cache-system
Wauplin marked this conversation as resolved.
Show resolved Hide resolved

## Understand caching

The Hugging Face Hub cache-system is designed to be the central cache shared across libraries
that depend on the Hub. It has been updated in v0.8.0 to prevent re-downloading same files
between revisions.

The caching system is designed as follows:

```
<CACHE_DIR>
├─ <MODELS>
├─ <DATASETS>
├─ <SPACES>
```

The `<CACHE_DIR>` is usually your user's home directory. However, it is customizable with the
`cache_dir` argument on all methods, or by specifying either `HF_HOME` or
`HUGGINGFACE_HUB_CACHE` environment variable.

Models, datasets and spaces share a common root. Each of these repositories contains the
repository type, the namespace (organization or username) if it exists and the
repository name:

```
<CACHE_DIR>
├─ models--julien-c--EsperBERTo-small
├─ models--lysandrejik--arxiv-nlp
├─ models--bert-base-cased
├─ datasets--glue
├─ datasets--huggingface--DataMeasurementsFiles
├─ spaces--dalle-mini--dalle-mini
```

It is within these folders that all files will now be downloaded from the Hub. Caching ensures that
a file isn't downloaded twice if it already exists and wasn't updated; but if it was updated,
and you're asking for the latest file, then it will download the latest file (while keeping
the previous file intact in case you need it again).

In order to achieve this, all folders contain the same skeleton:

```
<CACHE_DIR>
├─ datasets--glue
│ ├─ refs
│ ├─ blobs
│ ├─ snapshots
...
```

Each folder is designed to contain the following:

### Refs

The `refs` folder contains files which indicates the latest revision of the given reference. For example,
if we have previously fetched a file from the `main` branch of a repository, the `refs`
folder will contain a file named `main`, which will itself contain the commit identifier of the current head.

If the latest commit of `main` has `aaaaaa` as identifier, then it will contain `aaaaaa`.

If that same branch gets updated with a new commit, that has `bbbbbb` as an identifier, then
re-downloading a file from that reference will update the `refs/main` file to contain `bbbbbb`.

### Blobs

The `blobs` folder contains the actual files that we have downloaded. The name of each file is their hash.

### Snapshots

The `snapshots` folder contains symlinks to the blobs mentioned above. It is itself made up of several folders:
one per known revision!

In the explanation above, we had initially fetched a file from the `aaaaaa` revision, before fetching a file from
the `bbbbbb` revision. In this situation, we would now have two folders in the `snapshots` folder: `aaaaaa`
and `bbbbbb`.

In each of these folders, live symlinks that have the names of the files that we have downloaded. For example,
if we had downloaded the `README.md` file at revision `aaaaaa`, we would have the following path:

```
<CACHE_DIR>/<REPO_NAME>/snapshots/aaaaaa/README.md
```

That `README.md` file is actually a symlink linking to the blob that has the hash of the file.

By creating the skeleton this way we open the mechanism to file sharing: if the same file was fetched in
revision `bbbbbb`, it would have the same hash and the file would not need to be re-downloaded.

### In practice

In practice, your cache should look like the following tree:

```text
[ 96] .
└── [ 160] models--julien-c--EsperBERTo-small
├── [ 160] blobs
│ ├── [321M] 403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
│ ├── [ 398] 7cb18dc9bafbfcf74629a4b760af1b160957a83e
│ └── [1.4K] d7edf6bd2a681fb0175f7735299831ee1b22b812
├── [ 96] refs
│ └── [ 40] main
└── [ 128] snapshots
├── [ 128] 2439f60ef33a0d46d85da5001d52aeda5b00ce9f
│ ├── [ 52] README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812
│ └── [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
└── [ 128] bbc77c8132af1cc5cf678da3f1ddf2de43606d48
├── [ 52] README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e
└── [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
```

## Scan your cache

At the moment, cached files are never deleted from your local directory: when you download
a new revision of a branch, previous files are kept in case you need them again.
Therefore it can be useful to scan your cache directory in order to know which repos
and revisions are taking the most disk space. `huggingface_hub` provides an helper to
do so that can be used via `huggingface-cli` or in a python script.

### From the terminal

The easiest way to scan your HF cache-system is to use the `scan-cache` command from
`huggingface-cli` tool. This command scans the cache and prints a report with information
like repo id, repo type, disk usage, refs and full local path.

The snippet below shows a scan report in a folder in which 4 models and 2 datasets are
cached.

```text
➜ huggingface-cli scan-cache
REPO ID REPO TYPE SIZE ON DISK NB FILES REFS LOCAL PATH
--------------------------- --------- ------------ -------- ------------------- -------------------------------------------------------------------------
glue dataset 116.3K 15 2.4.0, main, 1.17.0 /Users/lucain/.cache/huggingface/hub/datasets--glue
google/fleurs dataset 64.9M 6 refs/pr/1, main /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs
Jean-Baptiste/camembert-ner model 441.0M 7 main /Users/lucain/.cache/huggingface/hub/models--Jean-Baptiste--camembert-ner
bert-base-cased model 1.9G 13 main /Users/lucain/.cache/huggingface/hub/models--bert-base-cased
t5-base model 10.1K 3 main /Users/lucain/.cache/huggingface/hub/models--t5-base
t5-small model 970.7M 11 refs/pr/1, main /Users/lucain/.cache/huggingface/hub/models--t5-small

Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G.
Got 1 warning(s) while scanning. Use -vvv to print details.
```

To get a more detailed report, use the `--verbose` option. For each repo, you get a
list of all revisions that have been downloaded. As explained above, the files that don't
change between 2 revisions are shared thanks to the symlinks. This means that the size of
the repo on disk is expected to be less than the sum of the size of each of its revisions.
For example, here `bert-base-cased` has 2 revisions of 1.4G and 1.5G but the total disk
usage is only 1.9G.

```text
➜ huggingface-cli scan-cache -v
REPO ID REPO TYPE REVISION SIZE ON DISK NB FILES REFS LOCAL PATH
--------------------------- --------- ---------------------------------------- ------------ -------- ----------- ----------------------------------------------------------------------------------------------------------------------------
glue dataset 9338f7b671827df886678df2bdd7cc7b4f36dffd 97.7K 14 main, 2.4.0 /Users/lucain/.cache/huggingface/hub/datasets--glue/snapshots/9338f7b671827df886678df2bdd7cc7b4f36dffd
glue dataset f021ae41c879fcabcf823648ec685e3fead91fe7 97.8K 14 1.17.0 /Users/lucain/.cache/huggingface/hub/datasets--glue/snapshots/f021ae41c879fcabcf823648ec685e3fead91fe7
google/fleurs dataset 129b6e96cf1967cd5d2b9b6aec75ce6cce7c89e8 25.4K 3 refs/pr/1 /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs/snapshots/129b6e96cf1967cd5d2b9b6aec75ce6cce7c89e8
google/fleurs dataset 24f85a01eb955224ca3946e70050869c56446805 64.9M 4 main /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs/snapshots/24f85a01eb955224ca3946e70050869c56446805
Jean-Baptiste/camembert-ner model dbec8489a1c44ecad9da8a9185115bccabd799fe 441.0M 7 main /Users/lucain/.cache/huggingface/hub/models--Jean-Baptiste--camembert-ner/snapshots/dbec8489a1c44ecad9da8a9185115bccabd799fe
bert-base-cased model 378aa1bda6387fd00e824948ebe3488630ad8565 1.5G 9 /Users/lucain/.cache/huggingface/hub/models--bert-base-cased/snapshots/378aa1bda6387fd00e824948ebe3488630ad8565
bert-base-cased model a8d257ba9925ef39f3036bfc338acf5283c512d9 1.4G 9 main /Users/lucain/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9
t5-base model 23aa4f41cb7c08d4b05c8f327b22bfa0eb8c7ad9 10.1K 3 main /Users/lucain/.cache/huggingface/hub/models--t5-base/snapshots/23aa4f41cb7c08d4b05c8f327b22bfa0eb8c7ad9
t5-small model 98ffebbb27340ec1b1abd7c45da12c253ee1882a 726.2M 6 refs/pr/1 /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/98ffebbb27340ec1b1abd7c45da12c253ee1882a
t5-small model d0a119eedb3718e34c648e594394474cf95e0617 485.8M 6 /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d0a119eedb3718e34c648e594394474cf95e0617
t5-small model d78aea13fa7ecd06c29e3e46195d6341255065d5 970.7M 9 main /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5

Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G.
Got 1 warning(s) while scanning. Use -vvv to print details.
```

#### Grep example

Since the output is in tabular format, you can combine it with any `grep`-like tools to
filter the entries. Here is an example to filter only revisions from the "t5-small"
model on a Unix-based machine.

```text
➜ eval "huggingface-cli scan-cache -v" | grep "t5-small"
t5-small model 98ffebbb27340ec1b1abd7c45da12c253ee1882a 726.2M 6 refs/pr/1 /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/98ffebbb27340ec1b1abd7c45da12c253ee1882a
t5-small model d0a119eedb3718e34c648e594394474cf95e0617 485.8M 6 /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d0a119eedb3718e34c648e594394474cf95e0617
t5-small model d78aea13fa7ecd06c29e3e46195d6341255065d5 970.7M 9 main /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5
```

### From Python

For a more advanced usage, use [`scan_cache_dir`] which is the python utility called by
the CLI tool.

You can use it to get a detailed report structured around 4 dataclasses:

- [`HFCacheInfo`]: complete report returned by [`scan_cache_dir`]
- [`CachedRepoInfo`]: information about a cached repo
- [`CachedRevisionInfo`]: information about a cached revision (e.g. "snapshot") inside a repo
- [`CachedFileInfo`]: information about a cached file in a snapshot

Here is a simple usage example. See reference for details.

```py
>>> from huggingface_hub import scan_cache_dir

>>> hf_cache_info = scan_cache_dir()
HFCacheInfo(
size_on_disk=3398085269,
repos=frozenset({
CachedRepoInfo(
repo_id='t5-small',
repo_type='model',
repo_path=PosixPath(...),
size_on_disk=970726914,
nb_files=11,
revisions=frozenset({
CachedRevisionInfo(
commit_hash='d78aea13fa7ecd06c29e3e46195d6341255065d5',
size_on_disk=970726339,
snapshot_path=PosixPath(...),
files=frozenset({
CachedFileInfo(
file_name='config.json',
size_on_disk=1197
file_path=PosixPath(...),
blob_path=PosixPath(...),
),
CachedFileInfo(...),
...
}),
),
CachedRevisionInfo(...),
...
}),
),
CachedRepoInfo(...),
...
}),
warnings=[
CorruptedCacheException("Snapshots dir doesn't exist in cached repo: ..."),
CorruptedCacheException(...),
...
],
)
```
42 changes: 42 additions & 0 deletions docs/source/package_reference/cache.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Cache-system reference

The caching system was updated in v0.8.0 to become the central cache-system shared
across libraries that depend on the Hub. Read the [cache-system guide](../how-to-cache)
for a detailed presentation of caching at HF.

## Helpers

### scan_cache_dir

[[autodoc]] huggingface_hub.scan_cache_dir

## Data structures

All structures are built and returned by [`scan_cache_dir`] and are immutable.

### HFCacheInfo

[[autodoc]] huggingface_hub.HFCacheInfo

### CachedRepoInfo

[[autodoc]] huggingface_hub.CachedRepoInfo
- size_on_disk_str
- refs

### CachedRevisionInfo

[[autodoc]] huggingface_hub.CachedRevisionInfo
- size_on_disk_str
- nb_files

### CachedFileInfo

[[autodoc]] huggingface_hub.CachedFileInfo
- size_on_disk_str

## Exceptions

### CorruptedCacheException

[[autodoc]] huggingface_hub.CorruptedCacheException
Loading