Feature: add an utility to scan cache (#990)

* WIP utility to scan cache * add example in scan.py * code quality * property in CachedRepoInfo + typing fixes * changes from feedback * rename to private util module * remove scan.py script and make proper CLI * review CLI help * remove unused colors * start documentation * add file to doctree * fix snippets ? * try generating doc from docstring * doc * refacto to frozen dataclasses * always more doc * always more docs * forgotten line * finalize doc * add text to snippet type * test cache scanner * mypy * siort * add tests for CLI * fix cli tests * Update docs/source/how-to-cache.mdx Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Update docs/source/how-to-cache.mdx Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Update docs/source/how-to-cache.mdx Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Add valuerror is cache dir is missing * update doc * Change from errors to warnings * Update src/huggingface_hub/commands/_cli_utils.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> * test cli utils * Test scan cache cli initialization * make style * typing Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
huggingface · Aug 30, 2022 · 48ddc62 · 48ddc62
1 parent 7b57719
commit 48ddc62
Show file tree

Hide file tree

Showing 18 changed files with 1,442 additions and 166 deletions.
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -16,7 +16,9 @@
   - local: how-to-inference
     title: Access the Inference API
   - local: how-to-discussions-and-pull-requests
-    title: Interact with Discussions and Pull Requests 
+    title: Interact with Discussions and Pull Requests
+  - local: how-to-cache
+    title: Manage the Cache
   title: "Guides"
 - sections:
     - local: package_reference/repository
@@ -33,4 +35,6 @@
       title: Utilities
     - local: package_reference/community
       title: Discussions and Pull Requests
+    - local: package_reference/cache
+      title: Cache-system reference
   title: "Reference"
diff --git a/docs/source/how-to-cache.mdx b/docs/source/how-to-cache.mdx
@@ -0,0 +1,240 @@
+# Manage `huggingface_hub` cache-system
+
+## Understand caching
+
+The Hugging Face Hub cache-system is designed to be the central cache shared across libraries
+that depend on the Hub. It has been updated in v0.8.0 to prevent re-downloading same files
+between revisions.
+
+The caching system is designed as follows:
+
+```
+<CACHE_DIR>
+├─ <MODELS>
+├─ <DATASETS>
+├─ <SPACES>
+```
+
+The `<CACHE_DIR>` is usually your user's home directory. However, it is customizable with the
+`cache_dir` argument on all methods, or by specifying either `HF_HOME` or
+`HUGGINGFACE_HUB_CACHE` environment variable.
+
+Models, datasets and spaces share a common root. Each of these repositories contains the
+repository type, the namespace (organization or username) if it exists and the
+repository name:
+
+```
+<CACHE_DIR>
+├─ models--julien-c--EsperBERTo-small
+├─ models--lysandrejik--arxiv-nlp
+├─ models--bert-base-cased
+├─ datasets--glue
+├─ datasets--huggingface--DataMeasurementsFiles
+├─ spaces--dalle-mini--dalle-mini
+```
+
+It is within these folders that all files will now be downloaded from the Hub. Caching ensures that
+a file isn't downloaded twice if it already exists and wasn't updated; but if it was updated,
+and you're asking for the latest file, then it will download the latest file (while keeping
+the previous file intact in case you need it again).
+
+In order to achieve this, all folders contain the same skeleton:
+
+```
+<CACHE_DIR>
+├─ datasets--glue
+│  ├─ refs
+│  ├─ blobs
+│  ├─ snapshots
+...
+```
+
+Each folder is designed to contain the following:
+
+### Refs
+
+The `refs` folder contains files which indicates the latest revision of the given reference. For example,
+if we have previously fetched a file from the `main` branch of a repository, the `refs`
+folder will contain a file named `main`, which will itself contain the commit identifier of the current head.
+
+If the latest commit of `main` has `aaaaaa` as identifier, then it will contain `aaaaaa`.
+
+If that same branch gets updated with a new commit, that has `bbbbbb` as an identifier, then
+re-downloading a file from that reference will update the `refs/main` file to contain `bbbbbb`.
+
+### Blobs
+
+The `blobs` folder contains the actual files that we have downloaded. The name of each file is their hash.
+
+### Snapshots
+
+The `snapshots` folder contains symlinks to the blobs mentioned above. It is itself made up of several folders:
+one per known revision!
+
+In the explanation above, we had initially fetched a file from the `aaaaaa` revision, before fetching a file from
+the `bbbbbb` revision. In this situation, we would now have two folders in the `snapshots` folder: `aaaaaa`
+and `bbbbbb`.
+
+In each of these folders, live symlinks that have the names of the files that we have downloaded. For example,
+if we had downloaded the `README.md` file at revision `aaaaaa`, we would have the following path:
+
+```
+<CACHE_DIR>/<REPO_NAME>/snapshots/aaaaaa/README.md
+```
+
+That `README.md` file is actually a symlink linking to the blob that has the hash of the file.
+
+By creating the skeleton this way we open the mechanism to file sharing: if the same file was fetched in
+revision `bbbbbb`, it would have the same hash and the file would not need to be re-downloaded.
+
+### In practice
+
+In practice, your cache should look like the following tree:
+
+```text
+    [  96]  .
+    └── [ 160]  models--julien-c--EsperBERTo-small
+        ├── [ 160]  blobs
+        │   ├── [321M]  403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
+        │   ├── [ 398]  7cb18dc9bafbfcf74629a4b760af1b160957a83e
+        │   └── [1.4K]  d7edf6bd2a681fb0175f7735299831ee1b22b812
+        ├── [  96]  refs
+        │   └── [  40]  main
+        └── [ 128]  snapshots
+            ├── [ 128]  2439f60ef33a0d46d85da5001d52aeda5b00ce9f
+            │   ├── [  52]  README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812
+            │   └── [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
+            └── [ 128]  bbc77c8132af1cc5cf678da3f1ddf2de43606d48
+                ├── [  52]  README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e
+                └── [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
+```
+
+## Scan your cache
+
+At the moment, cached files are never deleted from your local directory: when you download
+a new revision of a branch, previous files are kept in case you need them again.
+Therefore it can be useful to scan your cache directory in order to know which repos
+and revisions are taking the most disk space. `huggingface_hub` provides an helper to
+do so that can be used via `huggingface-cli` or in a python script.
+
+### From the terminal
+
+The easiest way to scan your HF cache-system is to use the `scan-cache` command from
+`huggingface-cli` tool. This command scans the cache and prints a report with information
+like repo id, repo type, disk usage, refs and full local path.
+
+The snippet below shows a scan report in a folder in which 4 models and 2 datasets are
+cached.
+
+```text
+➜ huggingface-cli scan-cache
+REPO ID                     REPO TYPE SIZE ON DISK NB FILES REFS                LOCAL PATH
+--------------------------- --------- ------------ -------- ------------------- -------------------------------------------------------------------------
+glue                        dataset         116.3K       15 2.4.0, main, 1.17.0 /Users/lucain/.cache/huggingface/hub/datasets--glue
+google/fleurs               dataset          64.9M        6 refs/pr/1, main     /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs
+Jean-Baptiste/camembert-ner model           441.0M        7 main                /Users/lucain/.cache/huggingface/hub/models--Jean-Baptiste--camembert-ner
+bert-base-cased             model             1.9G       13 main                /Users/lucain/.cache/huggingface/hub/models--bert-base-cased
+t5-base                     model            10.1K        3 main                /Users/lucain/.cache/huggingface/hub/models--t5-base
+t5-small                    model           970.7M       11 refs/pr/1, main     /Users/lucain/.cache/huggingface/hub/models--t5-small
+
+Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G.
+Got 1 warning(s) while scanning. Use -vvv to print details.
+```
+
+To get a more detailed report, use the `--verbose` option. For each repo, you get a
+list of all revisions that have been downloaded. As explained above, the files that don't
+change between 2 revisions are shared thanks to the symlinks. This means that the size of
+the repo on disk is expected to be less than the sum of the size of each of its revisions.
+For example, here `bert-base-cased` has 2 revisions of 1.4G and 1.5G but the total disk
+usage is only 1.9G.
+
+```text
+➜ huggingface-cli scan-cache -v
+REPO ID                     REPO TYPE REVISION                                 SIZE ON DISK NB FILES REFS        LOCAL PATH
+--------------------------- --------- ---------------------------------------- ------------ -------- ----------- ----------------------------------------------------------------------------------------------------------------------------
+glue                        dataset   9338f7b671827df886678df2bdd7cc7b4f36dffd        97.7K       14 main, 2.4.0 /Users/lucain/.cache/huggingface/hub/datasets--glue/snapshots/9338f7b671827df886678df2bdd7cc7b4f36dffd
+glue                        dataset   f021ae41c879fcabcf823648ec685e3fead91fe7        97.8K       14 1.17.0      /Users/lucain/.cache/huggingface/hub/datasets--glue/snapshots/f021ae41c879fcabcf823648ec685e3fead91fe7
+google/fleurs               dataset   129b6e96cf1967cd5d2b9b6aec75ce6cce7c89e8        25.4K        3 refs/pr/1   /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs/snapshots/129b6e96cf1967cd5d2b9b6aec75ce6cce7c89e8
+google/fleurs               dataset   24f85a01eb955224ca3946e70050869c56446805        64.9M        4 main        /Users/lucain/.cache/huggingface/hub/datasets--google--fleurs/snapshots/24f85a01eb955224ca3946e70050869c56446805
+Jean-Baptiste/camembert-ner model     dbec8489a1c44ecad9da8a9185115bccabd799fe       441.0M        7 main        /Users/lucain/.cache/huggingface/hub/models--Jean-Baptiste--camembert-ner/snapshots/dbec8489a1c44ecad9da8a9185115bccabd799fe
+bert-base-cased             model     378aa1bda6387fd00e824948ebe3488630ad8565         1.5G        9             /Users/lucain/.cache/huggingface/hub/models--bert-base-cased/snapshots/378aa1bda6387fd00e824948ebe3488630ad8565
+bert-base-cased             model     a8d257ba9925ef39f3036bfc338acf5283c512d9         1.4G        9 main        /Users/lucain/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9
+t5-base                     model     23aa4f41cb7c08d4b05c8f327b22bfa0eb8c7ad9        10.1K        3 main        /Users/lucain/.cache/huggingface/hub/models--t5-base/snapshots/23aa4f41cb7c08d4b05c8f327b22bfa0eb8c7ad9
+t5-small                    model     98ffebbb27340ec1b1abd7c45da12c253ee1882a       726.2M        6 refs/pr/1   /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/98ffebbb27340ec1b1abd7c45da12c253ee1882a
+t5-small                    model     d0a119eedb3718e34c648e594394474cf95e0617       485.8M        6             /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d0a119eedb3718e34c648e594394474cf95e0617
+t5-small                    model     d78aea13fa7ecd06c29e3e46195d6341255065d5       970.7M        9 main        /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5
+
+Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G.
+Got 1 warning(s) while scanning. Use -vvv to print details.
+```
+
+#### Grep example
+
+Since the output is in tabular format, you can combine it with any `grep`-like tools to
+filter the entries. Here is an example to filter only revisions from the "t5-small"
+model on a Unix-based machine.
+
+```text
+➜ eval "huggingface-cli scan-cache -v" | grep "t5-small"
+t5-small                    model     98ffebbb27340ec1b1abd7c45da12c253ee1882a       726.2M        6 refs/pr/1   /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/98ffebbb27340ec1b1abd7c45da12c253ee1882a
+t5-small                    model     d0a119eedb3718e34c648e594394474cf95e0617       485.8M        6             /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d0a119eedb3718e34c648e594394474cf95e0617
+t5-small                    model     d78aea13fa7ecd06c29e3e46195d6341255065d5       970.7M        9 main        /Users/lucain/.cache/huggingface/hub/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5
+```
+
+### From Python
+
+For a more advanced usage, use [`scan_cache_dir`] which is the python utility called by
+the CLI tool.
+
+You can use it to get a detailed report structured around 4 dataclasses:
+
+- [`HFCacheInfo`]: complete report returned by [`scan_cache_dir`]
+- [`CachedRepoInfo`]: information about a cached repo
+- [`CachedRevisionInfo`]: information about a cached revision (e.g. "snapshot") inside a repo
+- [`CachedFileInfo`]: information about a cached file in a snapshot
+
+Here is a simple usage example. See reference for details.
+
+```py
+>>> from huggingface_hub import scan_cache_dir
+
+>>> hf_cache_info = scan_cache_dir()
+HFCacheInfo(
+    size_on_disk=3398085269,
+    repos=frozenset({
+        CachedRepoInfo(
+            repo_id='t5-small',
+            repo_type='model',
+            repo_path=PosixPath(...),
+            size_on_disk=970726914,
+            nb_files=11,
+            revisions=frozenset({
+                CachedRevisionInfo(
+                    commit_hash='d78aea13fa7ecd06c29e3e46195d6341255065d5',
+                    size_on_disk=970726339,
+                    snapshot_path=PosixPath(...),
+                    files=frozenset({
+                        CachedFileInfo(
+                            file_name='config.json',
+                            size_on_disk=1197
+                            file_path=PosixPath(...),
+                            blob_path=PosixPath(...),
+                        ),
+                        CachedFileInfo(...),
+                        ...
+                    }),
+                ),
+                CachedRevisionInfo(...),
+                ...
+            }),
+        ),
+        CachedRepoInfo(...),
+        ...
+    }),
+    warnings=[
+        CorruptedCacheException("Snapshots dir doesn't exist in cached repo: ..."),
+        CorruptedCacheException(...),
+        ...
+    ],
+)
+```
diff --git a/docs/source/package_reference/cache.mdx b/docs/source/package_reference/cache.mdx
@@ -0,0 +1,42 @@
+# Cache-system reference
+
+The caching system was updated in v0.8.0 to become the central cache-system shared
+across libraries that depend on the Hub. Read the [cache-system guide](../how-to-cache)
+for a detailed presentation of caching at HF.
+
+## Helpers
+
+### scan_cache_dir
+
+[[autodoc]] huggingface_hub.scan_cache_dir
+
+## Data structures
+
+All structures are built and returned by [`scan_cache_dir`] and are immutable.
+
+### HFCacheInfo
+
+[[autodoc]] huggingface_hub.HFCacheInfo
+
+### CachedRepoInfo
+
+[[autodoc]] huggingface_hub.CachedRepoInfo
+    - size_on_disk_str
+    - refs
+
+### CachedRevisionInfo
+
+[[autodoc]] huggingface_hub.CachedRevisionInfo
+    - size_on_disk_str
+    - nb_files
+
+### CachedFileInfo
+
+[[autodoc]] huggingface_hub.CachedFileInfo
+    - size_on_disk_str
+
+## Exceptions
+
+### CorruptedCacheException
+
+[[autodoc]] huggingface_hub.CorruptedCacheException