Skip to content

Commit

Permalink
Feature: add an utility to scan cache (#990)
Browse files Browse the repository at this point in the history
* WIP utility to scan cache

* add example in scan.py

* code quality

* property in CachedRepoInfo + typing fixes

* changes from feedback

* rename to private util module

* remove scan.py script and make proper CLI

* review CLI help

* remove unused colors

* start documentation

* add file to doctree

* fix snippets ?

* try generating doc from docstring

* doc

* refacto to frozen dataclasses

* always more doc

* always more docs

* forgotten line

* finalize doc

* add text to snippet type

* test cache scanner

* mypy

* siort

* add tests for CLI

* fix cli tests

* Update docs/source/how-to-cache.mdx

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Update docs/source/how-to-cache.mdx

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Update docs/source/how-to-cache.mdx

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Add valuerror is cache dir is missing

* update doc

* Change from errors to warnings

* Update src/huggingface_hub/commands/_cli_utils.py

Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

* test cli utils

* Test scan cache cli initialization

* make style

* typing

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
  • Loading branch information
3 people authored Aug 30, 2022
1 parent 7b57719 commit 48ddc62
Show file tree
Hide file tree
Showing 18 changed files with 1,442 additions and 166 deletions.
6 changes: 5 additions & 1 deletion docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@
- local: how-to-inference
title: Access the Inference API
- local: how-to-discussions-and-pull-requests
title: Interact with Discussions and Pull Requests
title: Interact with Discussions and Pull Requests
- local: how-to-cache
title: Manage the Cache
title: "Guides"
- sections:
- local: package_reference/repository
Expand All @@ -33,4 +35,6 @@
title: Utilities
- local: package_reference/community
title: Discussions and Pull Requests
- local: package_reference/cache
title: Cache-system reference
title: "Reference"
240 changes: 240 additions & 0 deletions docs/source/how-to-cache.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
# Manage `huggingface_hub` cache-system

## Understand caching

The Hugging Face Hub cache-system is designed to be the central cache shared across libraries
that depend on the Hub. It has been updated in v0.8.0 to prevent re-downloading same files
between revisions.

The caching system is designed as follows:

```
<CACHE_DIR>
├─ <MODELS>
├─ <DATASETS>
├─ <SPACES>
```

The `<CACHE_DIR>` is usually your user's home directory. However, it is customizable with the
`cache_dir` argument on all methods, or by specifying either `HF_HOME` or
`HUGGINGFACE_HUB_CACHE` environment variable.

Models, datasets and spaces share a common root. Each of these repositories contains the
repository type, the namespace (organization or username) if it exists and the
repository name:

```
<CACHE_DIR>
├─ models--julien-c--EsperBERTo-small
├─ models--lysandrejik--arxiv-nlp
├─ models--bert-base-cased
├─ datasets--glue
├─ datasets--huggingface--DataMeasurementsFiles
├─ spaces--dalle-mini--dalle-mini
```

It is within these folders that all files will now be downloaded from the Hub. Caching ensures that
a file isn't downloaded twice if it already exists and wasn't updated; but if it was updated,
and you're asking for the latest file, then it will download the latest file (while keeping
the previous file intact in case you need it again).

In order to achieve this, all folders contain the same skeleton:

```
<CACHE_DIR>
├─ datasets--glue
│ ├─ refs
│ ├─ blobs
│ ├─ snapshots
...
```

Each folder is designed to contain the following:

### Refs

The `refs` folder contains files which indicates the latest revision of the given reference. For example,
if we have previously fetched a file from the `main` branch of a repository, the `refs`
folder will contain a file named `main`, which will itself contain the commit identifier of the current head.

If the latest commit of `main` has `aaaaaa` as identifier, then it will contain `aaaaaa`.

If that same branch gets updated with a new commit, that has `bbbbbb` as an identifier, then
re-downloading a file from that reference will update the `refs/main` file to contain `bbbbbb`.

### Blobs

The `blobs` folder contains the actual files that we have downloaded. The name of each file is their hash.

### Snapshots

The `snapshots` folder contains symlinks to the blobs mentioned above. It is itself made up of several folders:
one per known revision!

In the explanation above, we had initially fetched a file from the `aaaaaa` revision, before fetching a file from
the `bbbbbb` revision. In this situation, we would now have two folders in the `snapshots` folder: `aaaaaa`
and `bbbbbb`.

In each of these folders, live symlinks that have the names of the files that we have downloaded. For example,
if we had downloaded the `README.md` file at revision `aaaaaa`, we would have the following path:

```
<CACHE_DIR>/<REPO_NAME>/snapshots/aaaaaa/README.md
```

That `README.md` file is actually a symlink linking to the blob that has the hash of the file.

By creating the skeleton this way we open the mechanism to file sharing: if the same file was fetched in
revision `bbbbbb`, it would have the same hash and the file would not need to be re-downloaded.

### In practice

In practice, your cache should look like the following tree:

```text
[ 96] .
└── [ 160] models--julien-c--EsperBERTo-small
├── [ 160] blobs
│ ├── [321M] 403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
│ ├── [ 398] 7cb18dc9bafbfcf74629a4b760af1b160957a83e
│ └── [1.4K] d7edf6bd2a681fb0175f7735299831ee1b22b812
├── [ 96] refs
│ └── [ 40] main
└── [ 128] snapshots
├── [ 128] 2439f60ef33a0d46d85da5001d52aeda5b00ce9f
│ ├── [ 52] README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812
│ └── [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
└── [ 128] bbc77c8132af1cc5cf678da3f1ddf2de43606d48
├── [ 52] README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e
└── [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
```

## Scan your cache

At the moment, cached files are never deleted from your local directory: when you download
a new revision of a branch, previous files are kept in case you need them again.
Therefore it can be useful to scan your cache directory in order to know which repos
and revisions are taking the most disk space. `huggingface_hub` provides an helper to
do so that can be used via `huggingface-cli` or in a python script.

### From the terminal

The easiest way to scan your HF cache-system is to use the `scan-cache` command from
`huggingface-cli` tool. This command scans the cache and prints a report with information
like repo id, repo type, disk usage, refs and full local path.

The snippet below shows a scan report in a folder in which 4 models and 2 datasets are
cached.

```text
➜ huggingface-cli scan-cache
REPO ID REPO TYPE SIZE ON DISK NB FILES REFS LOCAL PATH
--------------------------- --------- ------------ -------- ------------------- -------------------------------------------------------------------------
glue dataset 116.3K 15 2.4.0, main, 1.17.0 /Users/lucain/.cache/huggingface/hub/datasets--glue
google/fleurs dataset 64.9M 6 refs/pr/1, main /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs
Jean-Baptiste/camembert-ner model 441.0M 7 main /Users/lucain/.cache/huggingface/hub/models--Jean-Baptiste--camembert-ner
bert-base-cased model 1.9G 13 main /Users/lucain/.cache/huggingface/hub/models--bert-base-cased
t5-base model 10.1K 3 main /Users/lucain/.cache/huggingface/hub/models--t5-base
t5-small model 970.7M 11 refs/pr/1, main /Users/lucain/.cache/huggingface/hub/models--t5-small
Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G.
Got 1 warning(s) while scanning. Use -vvv to print details.
```

To get a more detailed report, use the `--verbose` option. For each repo, you get a
list of all revisions that have been downloaded. As explained above, the files that don't
change between 2 revisions are shared thanks to the symlinks. This means that the size of
the repo on disk is expected to be less than the sum of the size of each of its revisions.
For example, here `bert-base-cased` has 2 revisions of 1.4G and 1.5G but the total disk
usage is only 1.9G.

```text
➜ huggingface-cli scan-cache -v
REPO ID REPO TYPE REVISION SIZE ON DISK NB FILES REFS LOCAL PATH
--------------------------- --------- ---------------------------------------- ------------ -------- ----------- ----------------------------------------------------------------------------------------------------------------------------
glue dataset 9338f7b671827df886678df2bdd7cc7b4f36dffd 97.7K 14 main, 2.4.0 /Users/lucain/.cache/huggingface/hub/datasets--glue/snapshots/9338f7b671827df886678df2bdd7cc7b4f36dffd
glue dataset f021ae41c879fcabcf823648ec685e3fead91fe7 97.8K 14 1.17.0 /Users/lucain/.cache/huggingface/hub/datasets--glue/snapshots/f021ae41c879fcabcf823648ec685e3fead91fe7
google/fleurs dataset 129b6e96cf1967cd5d2b9b6aec75ce6cce7c89e8 25.4K 3 refs/pr/1 /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs/snapshots/129b6e96cf1967cd5d2b9b6aec75ce6cce7c89e8
google/fleurs dataset 24f85a01eb955224ca3946e70050869c56446805 64.9M 4 main /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs/snapshots/24f85a01eb955224ca3946e70050869c56446805
Jean-Baptiste/camembert-ner model dbec8489a1c44ecad9da8a9185115bccabd799fe 441.0M 7 main /Users/lucain/.cache/huggingface/hub/models--Jean-Baptiste--camembert-ner/snapshots/dbec8489a1c44ecad9da8a9185115bccabd799fe
bert-base-cased model 378aa1bda6387fd00e824948ebe3488630ad8565 1.5G 9 /Users/lucain/.cache/huggingface/hub/models--bert-base-cased/snapshots/378aa1bda6387fd00e824948ebe3488630ad8565
bert-base-cased model a8d257ba9925ef39f3036bfc338acf5283c512d9 1.4G 9 main /Users/lucain/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9
t5-base model 23aa4f41cb7c08d4b05c8f327b22bfa0eb8c7ad9 10.1K 3 main /Users/lucain/.cache/huggingface/hub/models--t5-base/snapshots/23aa4f41cb7c08d4b05c8f327b22bfa0eb8c7ad9
t5-small model 98ffebbb27340ec1b1abd7c45da12c253ee1882a 726.2M 6 refs/pr/1 /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/98ffebbb27340ec1b1abd7c45da12c253ee1882a
t5-small model d0a119eedb3718e34c648e594394474cf95e0617 485.8M 6 /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d0a119eedb3718e34c648e594394474cf95e0617
t5-small model d78aea13fa7ecd06c29e3e46195d6341255065d5 970.7M 9 main /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5
Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G.
Got 1 warning(s) while scanning. Use -vvv to print details.
```

#### Grep example

Since the output is in tabular format, you can combine it with any `grep`-like tools to
filter the entries. Here is an example to filter only revisions from the "t5-small"
model on a Unix-based machine.

```text
➜ eval "huggingface-cli scan-cache -v" | grep "t5-small"
t5-small model 98ffebbb27340ec1b1abd7c45da12c253ee1882a 726.2M 6 refs/pr/1 /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/98ffebbb27340ec1b1abd7c45da12c253ee1882a
t5-small model d0a119eedb3718e34c648e594394474cf95e0617 485.8M 6 /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d0a119eedb3718e34c648e594394474cf95e0617
t5-small model d78aea13fa7ecd06c29e3e46195d6341255065d5 970.7M 9 main /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5
```

### From Python

For a more advanced usage, use [`scan_cache_dir`] which is the python utility called by
the CLI tool.

You can use it to get a detailed report structured around 4 dataclasses:

- [`HFCacheInfo`]: complete report returned by [`scan_cache_dir`]
- [`CachedRepoInfo`]: information about a cached repo
- [`CachedRevisionInfo`]: information about a cached revision (e.g. "snapshot") inside a repo
- [`CachedFileInfo`]: information about a cached file in a snapshot

Here is a simple usage example. See reference for details.

```py
>>> from huggingface_hub import scan_cache_dir

>>> hf_cache_info = scan_cache_dir()
HFCacheInfo(
size_on_disk=3398085269,
repos=frozenset({
CachedRepoInfo(
repo_id='t5-small',
repo_type='model',
repo_path=PosixPath(...),
size_on_disk=970726914,
nb_files=11,
revisions=frozenset({
CachedRevisionInfo(
commit_hash='d78aea13fa7ecd06c29e3e46195d6341255065d5',
size_on_disk=970726339,
snapshot_path=PosixPath(...),
files=frozenset({
CachedFileInfo(
file_name='config.json',
size_on_disk=1197
file_path=PosixPath(...),
blob_path=PosixPath(...),
),
CachedFileInfo(...),
...
}),
),
CachedRevisionInfo(...),
...
}),
),
CachedRepoInfo(...),
...
}),
warnings=[
CorruptedCacheException("Snapshots dir doesn't exist in cached repo: ..."),
CorruptedCacheException(...),
...
],
)
```
42 changes: 42 additions & 0 deletions docs/source/package_reference/cache.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Cache-system reference

The caching system was updated in v0.8.0 to become the central cache-system shared
across libraries that depend on the Hub. Read the [cache-system guide](../how-to-cache)
for a detailed presentation of caching at HF.

## Helpers

### scan_cache_dir

[[autodoc]] huggingface_hub.scan_cache_dir

## Data structures

All structures are built and returned by [`scan_cache_dir`] and are immutable.

### HFCacheInfo

[[autodoc]] huggingface_hub.HFCacheInfo

### CachedRepoInfo

[[autodoc]] huggingface_hub.CachedRepoInfo
- size_on_disk_str
- refs

### CachedRevisionInfo

[[autodoc]] huggingface_hub.CachedRevisionInfo
- size_on_disk_str
- nb_files

### CachedFileInfo

[[autodoc]] huggingface_hub.CachedFileInfo
- size_on_disk_str

## Exceptions

### CorruptedCacheException

[[autodoc]] huggingface_hub.CorruptedCacheException
Loading

0 comments on commit 48ddc62

Please sign in to comment.