New git-aware cache file layout (#801)

* light typing (cherry picked from commit b2c8f9b970c505cdf2c685e645e9e36cc472b0d3) * remove this seminal comment (cherry picked from commit 12a841a605c94733154f3b22e812c0f5e69ef37b) * I don't understand why we don't early return here cc @patrickvonplaten care to take a look? cc @LysandreJik (cherry picked from commit 259ab36f03ab3eed6eeb4fc4984bc259619b442f) * following last commit, unnest this (cherry picked from commit 54957f3f049d887af21dd8f6950873a2823c4247) * [BIG] This should work for all repo_types not just models! (cherry picked from commit 9a3f96ccb2de6663cf4cf2d9a60dd7f415227c1b) * one more (cherry picked from commit b74871250616c44a2125b26d5de29b1189e82e12) * forgot a repo_type and reorder code (cherry picked from commit 3ef7d79a44087e971e10e35d3b9f5bea3474f297) * also rename this cache folder (cherry picked from commit 4c518b861723a6d28d59108403c37edf5208f2fe) * Use `hf_hub_download`, will be simpler later (cherry picked from commit c7478d58fe62da02625b8ca17796ad1419a048b1) * in this new version, `force_filename` does not make sense anymore (cherry picked from commit 9a674bc795d5c8a26aecf5429d391fff92e47e8d) * Just inline everything inside `hf_hub_download` for now (cherry picked from commit ee49f8f57ba4e7e66f237df8f64c804862fe3ee8) * Big prototype! it works! 🎉 (cherry picked from commit 7fe19ec66a2c5a7386a956cb9b65616cb209608a) * wip wip * do not touch `cached_download` * Prompt user to upgrade to `hf_hub_download` * Add a `legacy_cache_layout=True` to preserve old behavior, just in case * Create `relative symlinks` + add some doc * Fix behavior when no network * This test now is legacy * Fix-ish conflict-ish * minimize diff * refactor `repo_folder_name` * windows support + shortcut if user passes a commit hash * Rewrite `snapshot_download` and make it more robust * OOops * Create example-transformers-tf.py * Fix + add a way more complete example (running on Ubuntu) * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/huggingface_hub/file_download.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> * Update src/huggingface_hub/file_download.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> * Only allow full revision hashes otherwise the `revision != commit_hash` test is not reliable * add a little bit more doc + consistency * Update src/huggingface_hub/snapshot_download.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update snapshot download * First pass on tests * Wrap up tests * 🐺 Fix for bug reported by @thomwolf see huggingface/huggingface_hub#801 (comment) * Special case for Windows * Address comments and docs * Clean up with ternary cc @julien-c * Add argument to `cached_download` * Opt-in for filename_to-url * Opt-in for filename_to-url * Pass the flag * Update docs/source/package_reference/file_download.mdx Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/huggingface_hub/file_download.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Address review comments Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
worldlight425 · May 25, 2022 · eb05af7 · eb05af7
1 parent 9b8dc7a
commit eb05af7
Show file tree

Hide file tree

Showing 10 changed files with 1,037 additions and 173 deletions.
diff --git a/docs/source/package_reference/file_download.mdx b/docs/source/package_reference/file_download.mdx
@@ -8,3 +8,109 @@
 
 [[autodoc]] huggingface_hub.hf_hub_url
 
+## Caching
+
+The methods displayed above are designed to work with a caching system that prevents re-downloading files.
+The caching system was updated in v0.8.0 to allow directory structure and file sharing across
+libraries that depend on the hub.
+
+The caching system is designed as follows:
+
+```
+<CACHE_DIR>
+├─ <MODELS>
+├─ <DATASETS>
+├─ <SPACES>
+```
+
+The `<CACHE_DIR>` is usually your user's home directory. However, it is customizable with the
+`cache_dir` argument on all methods, or by specifying the `HF_HOME` environment variable.
+
+Models, datasets and spaces share a common root. Each of these repositories contains the namespace
+(organization, username) if it exists, alongside the repository name:
+
+```
+<CACHE_DIR>
+├─ models--julien-c--EsperBERTo-small
+├─ models--lysandrejik--arxiv-nlp
+├─ models--bert-base-cased
+├─ datasets--glue
+├─ datasets--huggingface--DataMeasurementsFiles
+├─ spaces--dalle-mini--dalle-mini
+```
+
+It is within these folders that all files will now be downloaded from the hub. Caching ensures that
+a file isn't downloaded twice if it already exists and wasn't updated; but if it was updated,
+and you're asking for the latest file, then it will download the latest file (while keeping
+the previous file intact in case you need it again).
+
+In order to achieve this, all folders contain the same skeleton:
+
+```
+<CACHE_DIR>
+├─ datasets--glue
+│  ├─ refs
+│  ├─ blobs
+│  ├─ snapshots
+...
+```
+
+Each folder is designed to contain the following:
+
+### Refs
+
+The `refs` folder contains files which indicates the latest revision of the given reference. For example,
+if we have previously fetched a file from the `main` branch of a repository, the `refs`
+folder will contain a file named `main`, which will itself contain the commit identifier of the current head.
+
+If the latest commit of `main` has `aaaaaa` as identifier, then it will contain `aaaaaa`.
+
+If that same branch gets updated with a new commit, that has `bbbbbb` as an identifier, then
+redownloading a file from that reference will update the `refs/main` file to contain `bbbbbb`.
+
+### Blobs
+
+The `blobs` folder contains the actual files that we have downloaded. The name of each file is their hash.
+
+### Snapshots
+
+The `snapshots` folder contains symlinks to the blobs mentioned above. It is itself made up of several folders:
+one per known revision!
+
+In the explanation above, we had initially fetched a file from the `aaaaaa` revision, before fetching a file from
+the `bbbbbb` revision. In this situation, we would now have two folders in the `snapshots` folder: `aaaaaa`
+and `bbbbbb`.
+
+In each of these folders, live symlinks that have the names of the files that we have downloaded. For example,
+if we had downloaded the `READMD.md` file at revision `aaaaaa`, we would have the following path:
+
+```
+<CACHE_DIR>/<REPO_NAME>/snapshots/aaaaaa/README.md
+```
+
+That `README.md` file is actually a symlink linking to the blob that has the hash of the file.
+
+Creating the skeleton this way means opens up the mechanism to file sharing: if the same file was fetched in
+revision `bbbbbb`, it would have the same hash and the file would not need to be redownloaded.
+
+### In practice
+
+In practice, it should look like the following tree in your cache:
+
+```
+    [  96]  .
+    └── [ 160]  models--julien-c--EsperBERTo-small
+        ├── [ 160]  blobs
+        │   ├── [321M]  403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
+        │   ├── [ 398]  7cb18dc9bafbfcf74629a4b760af1b160957a83e
+        │   └── [1.4K]  d7edf6bd2a681fb0175f7735299831ee1b22b812
+        ├── [  96]  refs
+        │   └── [  40]  main
+        └── [ 128]  snapshots
+            ├── [ 128]  2439f60ef33a0d46d85da5001d52aeda5b00ce9f
+            │   ├── [  52]  README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812
+            │   └── [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
+            └── [ 128]  bbc77c8132af1cc5cf678da3f1ddf2de43606d48
+                ├── [  52]  README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e
+                └── [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
+```
diff --git a/setup.py b/setup.py
@@ -58,7 +58,8 @@ def get_version() -> str:
     author="Hugging Face, Inc.",
     author_email="julien@huggingface.co",
     description=(
-        "Client library to download and publish models on the huggingface.co hub"
+        "Client library to download and publish models, datasets and other repos on the"
+        " huggingface.co hub"
     ),
     long_description=open("README.md", "r", encoding="utf-8").read(),
     long_description_content_type="text/markdown",

diff --git a/src/huggingface_hub/_snapshot_download.py b/src/huggingface_hub/_snapshot_download.py
@@ -1,29 +1,50 @@
 import os
 from fnmatch import fnmatch
-from glob import glob
 from pathlib import Path
 from typing import Dict, List, Optional, Union
 
-from .constants import DEFAULT_REVISION, HUGGINGFACE_HUB_CACHE
-from .file_download import cached_download, hf_hub_url
+from .constants import DEFAULT_REVISION, HUGGINGFACE_HUB_CACHE, REPO_TYPES
+from .file_download import REGEX_COMMIT_HASH, hf_hub_download, repo_folder_name
 from .hf_api import HfApi, HfFolder
 from .utils import logging
 from .utils._deprecation import _deprecate_positional_args
 
 
-REPO_ID_SEPARATOR = "--"
-# ^ this substring is not allowed in repo_ids on hf.co
-# and is the canonical one we use for serialization of repo ids elsewhere.
+logger = logging.get_logger(__name__)
 
 
-logger = logging.get_logger(__name__)
+def _filter_repo_files(
+    *,
+    repo_files: List[str],
+    allow_regex: Optional[Union[List[str], str]] = None,
+    ignore_regex: Optional[Union[List[str], str]] = None,
+) -> List[str]:
+    allow_regex = [allow_regex] if isinstance(allow_regex, str) else allow_regex
+    ignore_regex = [ignore_regex] if isinstance(ignore_regex, str) else ignore_regex
+    filtered_files = []
+    for repo_file in repo_files:
+        # if there's an allowlist, skip download if file does not match any regex
+        if allow_regex is not None and not any(
+            fnmatch(repo_file, r) for r in allow_regex
+        ):
+            continue
+
+        # if there's a denylist, skip download if file does matches any regex
+        if ignore_regex is not None and any(
+            fnmatch(repo_file, r) for r in ignore_regex
+        ):
+            continue
+
+        filtered_files.append(repo_file)
+    return filtered_files
 
 
 @_deprecate_positional_args
 def snapshot_download(
     repo_id: str,
     *,
     revision: Optional[str] = None,
+    repo_type: Optional[str] = None,
     cache_dir: Union[str, Path, None] = None,
     library_name: Optional[str] = None,
     library_version: Optional[str] = None,
@@ -52,6 +73,9 @@ def snapshot_download(
         revision (`str`, *optional*):
             An optional Git revision id which can be a branch name, a tag, or a
             commit hash.
+        repo_type (`str`, *optional*):
+            Set to `"dataset"` or `"space"` if uploading to a dataset or space,
+            `None` or `"model"` if uploading to a model. Default is `None`.
         cache_dir (`str`, `Path`, *optional*):
             Path to the folder where cached files are stored.
         library_name (`str`, *optional*):
@@ -97,9 +121,6 @@ def snapshot_download(
 
     </Tip>
     """
-    # Note: at some point maybe this format of storage should actually replace
-    # the flat storage structure we've used so far (initially from allennlp
-    # if I remember correctly).
 
     if cache_dir is None:
         cache_dir = HUGGINGFACE_HUB_CACHE
@@ -120,133 +141,83 @@ def snapshot_download(
     else:
         token = None
 
-    # remove all `/` occurrences to correctly convert repo to directory name
-    repo_id_flattened = repo_id.replace("/", REPO_ID_SEPARATOR)
-
-    # if we have no internet connection we will look for the
-    # last modified folder in the cache
-    if local_files_only:
-        # possible repos have <path/to/cache_dir>/<flatten_repo_id> prefix
-        repo_folders_prefix = os.path.join(cache_dir, repo_id_flattened)
-
-        # list all possible folders that can correspond to the repo_id
-        # and are of the format <flattened-repo-id>.<revision>.<commit-sha>
-        # now let's list all cached repos that have to be included in the revision.
-        # There are 3 cases that we have to consider.
-
-        # 1) cached repos of format <repo_id>.{revision}.<any-hash>
-        # -> in this case {revision} has to be a branch
-        repo_folders_branch = glob(repo_folders_prefix + "." + revision + ".*")
-
-        # 2) cached repos of format <repo_id>.{revision}
-        # -> in this case {revision} has to be a commit sha
-        repo_folders_commit_only = glob(repo_folders_prefix + "." + revision)
-
-        # 3) cached repos of format <repo_id>.<any-branch>.{revision}
-        # -> in this case {revision} also has to be a commit sha
-        repo_folders_branch_commit = glob(repo_folders_prefix + ".*." + revision)
-
-        # combine all possible fetched cached repos
-        repo_folders = (
-            repo_folders_branch + repo_folders_commit_only + repo_folders_branch_commit
+    if repo_type is None:
+        repo_type = "model"
+    if repo_type not in REPO_TYPES:
+        raise ValueError(
+            f"Invalid repo type: {repo_type}. Accepted repo types are:"
+            f" {str(REPO_TYPES)}"
         )
 
-        if len(repo_folders) == 0:
-            raise ValueError(
-                "Cannot find the requested files in the cached path and outgoing"
-                " traffic has been disabled. To enable model look-ups and downloads"
-                " online, set 'local_files_only' to False."
-            )
+    storage_folder = os.path.join(
+        cache_dir, repo_folder_name(repo_id=repo_id, repo_type=repo_type)
+    )
 
-        # check if repo id was previously cached from a commit sha revision
-        # and passed {revision} is not a commit sha
-        # in this case snapshotting repos locally might lead to unexpected
-        # behavior the user should be warned about
-
-        # get all folders that were cached with just a sha commit revision
-        all_repo_folders_from_sha = set(glob(repo_folders_prefix + ".*")) - set(
-            glob(repo_folders_prefix + ".*.*")
-        )
-        # 1) is there any repo id that was previously cached from a commit sha?
-        has_a_sha_revision_been_cached = len(all_repo_folders_from_sha) > 0
-        # 2) is the passed {revision} is a branch
-        is_revision_a_branch = (
-            len(repo_folders_commit_only + repo_folders_branch_commit) == 0
-        )
-
-        if has_a_sha_revision_been_cached and is_revision_a_branch:
-            # -> in this case let's warn the user
-            logger.warn(
-                f"The repo {repo_id} was previously downloaded from a commit hash"
-                " revision and has created the following cached directories"
-                f" {all_repo_folders_from_sha}. In this case, trying to load a repo"
-                f" from the branch {revision} in offline mode might lead to unexpected"
-                " behavior by not taking into account the latest commits."
-            )
-
-        # find last modified folder
-        storage_folder = max(repo_folders, key=os.path.getmtime)
-
-        # get commit sha
-        repo_id_sha = storage_folder.split(".")[-1]
-        model_files = os.listdir(storage_folder)
-    else:
-        # if we have internet connection we retrieve the correct folder name from the huggingface api
-        _api = HfApi()
-        model_info = _api.model_info(repo_id=repo_id, revision=revision, token=token)
-
-        storage_folder = os.path.join(cache_dir, repo_id_flattened + "." + revision)
-
-        # if passed revision is not identical to the commit sha
-        # then revision has to be a branch name, e.g. "main"
-        # in this case make sure that the branch name is included
-        # cached storage folder name
-        if revision != model_info.sha:
-            storage_folder += f".{model_info.sha}"
-
-        repo_id_sha = model_info.sha
-        model_files = [f.rfilename for f in model_info.siblings]
-
-    allow_regex = [allow_regex] if isinstance(allow_regex, str) else allow_regex
-    ignore_regex = [ignore_regex] if isinstance(ignore_regex, str) else ignore_regex
+    # if we have no internet connection we will look for an
+    # appropriate folder in the cache
+    # If the specified revision is a commit hash, look inside "snapshots".
+    # If the specified revision is a branch or tag, look inside "refs".
+    if local_files_only:
 
-    for model_file in model_files:
-        # if there's an allowlist, skip download if file does not match any regex
-        if allow_regex is not None and not any(
-            fnmatch(model_file, r) for r in allow_regex
-        ):
-            continue
+        if REGEX_COMMIT_HASH.match(revision):
+            commit_hash = revision
+        else:
+            # retrieve commit_hash from file
+            ref_path = os.path.join(storage_folder, "refs", revision)
+            with open(ref_path) as f:
+                commit_hash = f.read()
 
-        # if there's a denylist, skip download if file does matches any regex
-        if ignore_regex is not None and any(
-            fnmatch(model_file, r) for r in ignore_regex
-        ):
-            continue
+        snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)
 
-        url = hf_hub_url(repo_id, filename=model_file, revision=repo_id_sha)
-        relative_filepath = os.path.join(*model_file.split("/"))
+        if os.path.exists(snapshot_folder):
+            return snapshot_folder
 
-        # Create potential nested dir
-        nested_dirname = os.path.dirname(
-            os.path.join(storage_folder, relative_filepath)
+        raise ValueError(
+            "Cannot find an appropriate cached snapshot folder for the specified"
+            " revision on the local disk and outgoing traffic has been disabled. To"
+            " enable repo look-ups and downloads online, set 'local_files_only' to"
+            " False."
         )
-        os.makedirs(nested_dirname, exist_ok=True)
 
-        path = cached_download(
-            url,
-            cache_dir=storage_folder,
-            force_filename=relative_filepath,
+    # if we have internet connection we retrieve the correct folder name from the huggingface api
+    _api = HfApi()
+    repo_info = _api.repo_info(
+        repo_id=repo_id, repo_type=repo_type, revision=revision, token=token
+    )
+    filtered_repo_files = _filter_repo_files(
+        repo_files=[f.rfilename for f in repo_info.siblings],
+        allow_regex=allow_regex,
+        ignore_regex=ignore_regex,
+    )
+    commit_hash = repo_info.sha
+    snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)
+    # if passed revision is not identical to commit_hash
+    # then revision has to be a branch name or tag name.
+    # In that case store a ref.
+    if revision != commit_hash:
+        ref_path = os.path.join(storage_folder, "refs", revision)
+        os.makedirs(os.path.dirname(ref_path), exist_ok=True)
+        with open(ref_path, "w") as f:
+            f.write(commit_hash)
+
+    # we pass the commit_hash to hf_hub_download
+    # so no network call happens if we already
+    # have the file locally.
+
+    for repo_file in filtered_repo_files:
+        _ = hf_hub_download(
+            repo_id,
+            filename=repo_file,
+            repo_type=repo_type,
+            revision=commit_hash,
+            cache_dir=cache_dir,
             library_name=library_name,
             library_version=library_version,
             user_agent=user_agent,
             proxies=proxies,
             etag_timeout=etag_timeout,
             resume_download=resume_download,
             use_auth_token=use_auth_token,
-            local_files_only=local_files_only,
         )
 
-        if os.path.exists(path + ".lock"):
-            os.remove(path + ".lock")
-
-    return storage_folder
+    return snapshot_folder