Skip to content

Commit

Permalink
New git-aware cache file layout (#801)
Browse files Browse the repository at this point in the history
* light typing

(cherry picked from commit b2c8f9b970c505cdf2c685e645e9e36cc472b0d3)

* remove this seminal comment

(cherry picked from commit 12a841a605c94733154f3b22e812c0f5e69ef37b)

* I don't understand why we don't early return here

cc @patrickvonplaten care to take a look? cc @LysandreJik

(cherry picked from commit 259ab36f03ab3eed6eeb4fc4984bc259619b442f)

* following last commit, unnest this

(cherry picked from commit 54957f3f049d887af21dd8f6950873a2823c4247)

* [BIG] This should work for all repo_types not just models!

(cherry picked from commit 9a3f96ccb2de6663cf4cf2d9a60dd7f415227c1b)

* one more

(cherry picked from commit b74871250616c44a2125b26d5de29b1189e82e12)

* forgot a repo_type and reorder code

(cherry picked from commit 3ef7d79a44087e971e10e35d3b9f5bea3474f297)

* also rename this cache folder

(cherry picked from commit 4c518b861723a6d28d59108403c37edf5208f2fe)

* Use `hf_hub_download`, will be simpler later

(cherry picked from commit c7478d58fe62da02625b8ca17796ad1419a048b1)

* in this new version, `force_filename` does not make sense anymore

(cherry picked from commit 9a674bc795d5c8a26aecf5429d391fff92e47e8d)

* Just inline everything inside `hf_hub_download` for now

(cherry picked from commit ee49f8f57ba4e7e66f237df8f64c804862fe3ee8)

* Big prototype! it works! 🎉

(cherry picked from commit 7fe19ec66a2c5a7386a956cb9b65616cb209608a)

* wip wip

* do not touch `cached_download`

* Prompt user to upgrade to `hf_hub_download`

* Add a `legacy_cache_layout=True` to preserve old behavior, just in case

* Create `relative symlinks` + add some doc

* Fix behavior when no network

* This test now is legacy

* Fix-ish conflict-ish

* minimize diff

* refactor `repo_folder_name`

* windows support + shortcut if user passes a commit hash

* Rewrite `snapshot_download` and make it more robust

* OOops

* Create example-transformers-tf.py

* Fix + add a way more complete example (running on Ubuntu)

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/huggingface_hub/file_download.py

Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

* Update src/huggingface_hub/file_download.py

Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

* Only allow full revision hashes otherwise the `revision != commit_hash` test is not reliable

* add a little bit more doc + consistency

* Update src/huggingface_hub/snapshot_download.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update snapshot download

* First pass on tests

* Wrap up tests

* 🐺 Fix for bug reported by @thomwolf

see huggingface/huggingface_hub#801 (comment)

* Special case for Windows

* Address comments and docs

* Clean up with ternary cc @julien-c

* Add argument to `cached_download`

* Opt-in for filename_to-url

* Opt-in for filename_to-url

* Pass the flag

* Update docs/source/package_reference/file_download.mdx

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/huggingface_hub/file_download.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Address review comments

Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
  • Loading branch information
5 people committed May 25, 2022
1 parent 9b8dc7a commit eb05af7
Show file tree
Hide file tree
Showing 10 changed files with 1,037 additions and 173 deletions.
106 changes: 106 additions & 0 deletions docs/source/package_reference/file_download.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,109 @@

[[autodoc]] huggingface_hub.hf_hub_url

## Caching

The methods displayed above are designed to work with a caching system that prevents re-downloading files.
The caching system was updated in v0.8.0 to allow directory structure and file sharing across
libraries that depend on the hub.

The caching system is designed as follows:

```
<CACHE_DIR>
├─ <MODELS>
├─ <DATASETS>
├─ <SPACES>
```

The `<CACHE_DIR>` is usually your user's home directory. However, it is customizable with the
`cache_dir` argument on all methods, or by specifying the `HF_HOME` environment variable.

Models, datasets and spaces share a common root. Each of these repositories contains the namespace
(organization, username) if it exists, alongside the repository name:

```
<CACHE_DIR>
├─ models--julien-c--EsperBERTo-small
├─ models--lysandrejik--arxiv-nlp
├─ models--bert-base-cased
├─ datasets--glue
├─ datasets--huggingface--DataMeasurementsFiles
├─ spaces--dalle-mini--dalle-mini
```

It is within these folders that all files will now be downloaded from the hub. Caching ensures that
a file isn't downloaded twice if it already exists and wasn't updated; but if it was updated,
and you're asking for the latest file, then it will download the latest file (while keeping
the previous file intact in case you need it again).

In order to achieve this, all folders contain the same skeleton:

```
<CACHE_DIR>
├─ datasets--glue
│ ├─ refs
│ ├─ blobs
│ ├─ snapshots
...
```

Each folder is designed to contain the following:

### Refs

The `refs` folder contains files which indicates the latest revision of the given reference. For example,
if we have previously fetched a file from the `main` branch of a repository, the `refs`
folder will contain a file named `main`, which will itself contain the commit identifier of the current head.

If the latest commit of `main` has `aaaaaa` as identifier, then it will contain `aaaaaa`.

If that same branch gets updated with a new commit, that has `bbbbbb` as an identifier, then
redownloading a file from that reference will update the `refs/main` file to contain `bbbbbb`.

### Blobs

The `blobs` folder contains the actual files that we have downloaded. The name of each file is their hash.

### Snapshots

The `snapshots` folder contains symlinks to the blobs mentioned above. It is itself made up of several folders:
one per known revision!

In the explanation above, we had initially fetched a file from the `aaaaaa` revision, before fetching a file from
the `bbbbbb` revision. In this situation, we would now have two folders in the `snapshots` folder: `aaaaaa`
and `bbbbbb`.

In each of these folders, live symlinks that have the names of the files that we have downloaded. For example,
if we had downloaded the `READMD.md` file at revision `aaaaaa`, we would have the following path:

```
<CACHE_DIR>/<REPO_NAME>/snapshots/aaaaaa/README.md
```

That `README.md` file is actually a symlink linking to the blob that has the hash of the file.

Creating the skeleton this way means opens up the mechanism to file sharing: if the same file was fetched in
revision `bbbbbb`, it would have the same hash and the file would not need to be redownloaded.

### In practice

In practice, it should look like the following tree in your cache:

```
[ 96] .
└── [ 160] models--julien-c--EsperBERTo-small
├── [ 160] blobs
│ ├── [321M] 403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
│ ├── [ 398] 7cb18dc9bafbfcf74629a4b760af1b160957a83e
│ └── [1.4K] d7edf6bd2a681fb0175f7735299831ee1b22b812
├── [ 96] refs
│ └── [ 40] main
└── [ 128] snapshots
├── [ 128] 2439f60ef33a0d46d85da5001d52aeda5b00ce9f
│ ├── [ 52] README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812
│ └── [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
└── [ 128] bbc77c8132af1cc5cf678da3f1ddf2de43606d48
├── [ 52] README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e
└── [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
```
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,8 @@ def get_version() -> str:
author="Hugging Face, Inc.",
author_email="julien@huggingface.co",
description=(
"Client library to download and publish models on the huggingface.co hub"
"Client library to download and publish models, datasets and other repos on the"
" huggingface.co hub"
),
long_description=open("README.md", "r", encoding="utf-8").read(),
long_description_content_type="text/markdown",
Expand Down
215 changes: 93 additions & 122 deletions src/huggingface_hub/_snapshot_download.py
Original file line number Diff line number Diff line change
@@ -1,29 +1,50 @@
import os
from fnmatch import fnmatch
from glob import glob
from pathlib import Path
from typing import Dict, List, Optional, Union

from .constants import DEFAULT_REVISION, HUGGINGFACE_HUB_CACHE
from .file_download import cached_download, hf_hub_url
from .constants import DEFAULT_REVISION, HUGGINGFACE_HUB_CACHE, REPO_TYPES
from .file_download import REGEX_COMMIT_HASH, hf_hub_download, repo_folder_name
from .hf_api import HfApi, HfFolder
from .utils import logging
from .utils._deprecation import _deprecate_positional_args


REPO_ID_SEPARATOR = "--"
# ^ this substring is not allowed in repo_ids on hf.co
# and is the canonical one we use for serialization of repo ids elsewhere.
logger = logging.get_logger(__name__)


logger = logging.get_logger(__name__)
def _filter_repo_files(
*,
repo_files: List[str],
allow_regex: Optional[Union[List[str], str]] = None,
ignore_regex: Optional[Union[List[str], str]] = None,
) -> List[str]:
allow_regex = [allow_regex] if isinstance(allow_regex, str) else allow_regex
ignore_regex = [ignore_regex] if isinstance(ignore_regex, str) else ignore_regex
filtered_files = []
for repo_file in repo_files:
# if there's an allowlist, skip download if file does not match any regex
if allow_regex is not None and not any(
fnmatch(repo_file, r) for r in allow_regex
):
continue

# if there's a denylist, skip download if file does matches any regex
if ignore_regex is not None and any(
fnmatch(repo_file, r) for r in ignore_regex
):
continue

filtered_files.append(repo_file)
return filtered_files


@_deprecate_positional_args
def snapshot_download(
repo_id: str,
*,
revision: Optional[str] = None,
repo_type: Optional[str] = None,
cache_dir: Union[str, Path, None] = None,
library_name: Optional[str] = None,
library_version: Optional[str] = None,
Expand Down Expand Up @@ -52,6 +73,9 @@ def snapshot_download(
revision (`str`, *optional*):
An optional Git revision id which can be a branch name, a tag, or a
commit hash.
repo_type (`str`, *optional*):
Set to `"dataset"` or `"space"` if uploading to a dataset or space,
`None` or `"model"` if uploading to a model. Default is `None`.
cache_dir (`str`, `Path`, *optional*):
Path to the folder where cached files are stored.
library_name (`str`, *optional*):
Expand Down Expand Up @@ -97,9 +121,6 @@ def snapshot_download(
</Tip>
"""
# Note: at some point maybe this format of storage should actually replace
# the flat storage structure we've used so far (initially from allennlp
# if I remember correctly).

if cache_dir is None:
cache_dir = HUGGINGFACE_HUB_CACHE
Expand All @@ -120,133 +141,83 @@ def snapshot_download(
else:
token = None

# remove all `/` occurrences to correctly convert repo to directory name
repo_id_flattened = repo_id.replace("/", REPO_ID_SEPARATOR)

# if we have no internet connection we will look for the
# last modified folder in the cache
if local_files_only:
# possible repos have <path/to/cache_dir>/<flatten_repo_id> prefix
repo_folders_prefix = os.path.join(cache_dir, repo_id_flattened)

# list all possible folders that can correspond to the repo_id
# and are of the format <flattened-repo-id>.<revision>.<commit-sha>
# now let's list all cached repos that have to be included in the revision.
# There are 3 cases that we have to consider.

# 1) cached repos of format <repo_id>.{revision}.<any-hash>
# -> in this case {revision} has to be a branch
repo_folders_branch = glob(repo_folders_prefix + "." + revision + ".*")

# 2) cached repos of format <repo_id>.{revision}
# -> in this case {revision} has to be a commit sha
repo_folders_commit_only = glob(repo_folders_prefix + "." + revision)

# 3) cached repos of format <repo_id>.<any-branch>.{revision}
# -> in this case {revision} also has to be a commit sha
repo_folders_branch_commit = glob(repo_folders_prefix + ".*." + revision)

# combine all possible fetched cached repos
repo_folders = (
repo_folders_branch + repo_folders_commit_only + repo_folders_branch_commit
if repo_type is None:
repo_type = "model"
if repo_type not in REPO_TYPES:
raise ValueError(
f"Invalid repo type: {repo_type}. Accepted repo types are:"
f" {str(REPO_TYPES)}"
)

if len(repo_folders) == 0:
raise ValueError(
"Cannot find the requested files in the cached path and outgoing"
" traffic has been disabled. To enable model look-ups and downloads"
" online, set 'local_files_only' to False."
)
storage_folder = os.path.join(
cache_dir, repo_folder_name(repo_id=repo_id, repo_type=repo_type)
)

# check if repo id was previously cached from a commit sha revision
# and passed {revision} is not a commit sha
# in this case snapshotting repos locally might lead to unexpected
# behavior the user should be warned about

# get all folders that were cached with just a sha commit revision
all_repo_folders_from_sha = set(glob(repo_folders_prefix + ".*")) - set(
glob(repo_folders_prefix + ".*.*")
)
# 1) is there any repo id that was previously cached from a commit sha?
has_a_sha_revision_been_cached = len(all_repo_folders_from_sha) > 0
# 2) is the passed {revision} is a branch
is_revision_a_branch = (
len(repo_folders_commit_only + repo_folders_branch_commit) == 0
)

if has_a_sha_revision_been_cached and is_revision_a_branch:
# -> in this case let's warn the user
logger.warn(
f"The repo {repo_id} was previously downloaded from a commit hash"
" revision and has created the following cached directories"
f" {all_repo_folders_from_sha}. In this case, trying to load a repo"
f" from the branch {revision} in offline mode might lead to unexpected"
" behavior by not taking into account the latest commits."
)

# find last modified folder
storage_folder = max(repo_folders, key=os.path.getmtime)

# get commit sha
repo_id_sha = storage_folder.split(".")[-1]
model_files = os.listdir(storage_folder)
else:
# if we have internet connection we retrieve the correct folder name from the huggingface api
_api = HfApi()
model_info = _api.model_info(repo_id=repo_id, revision=revision, token=token)

storage_folder = os.path.join(cache_dir, repo_id_flattened + "." + revision)

# if passed revision is not identical to the commit sha
# then revision has to be a branch name, e.g. "main"
# in this case make sure that the branch name is included
# cached storage folder name
if revision != model_info.sha:
storage_folder += f".{model_info.sha}"

repo_id_sha = model_info.sha
model_files = [f.rfilename for f in model_info.siblings]

allow_regex = [allow_regex] if isinstance(allow_regex, str) else allow_regex
ignore_regex = [ignore_regex] if isinstance(ignore_regex, str) else ignore_regex
# if we have no internet connection we will look for an
# appropriate folder in the cache
# If the specified revision is a commit hash, look inside "snapshots".
# If the specified revision is a branch or tag, look inside "refs".
if local_files_only:

for model_file in model_files:
# if there's an allowlist, skip download if file does not match any regex
if allow_regex is not None and not any(
fnmatch(model_file, r) for r in allow_regex
):
continue
if REGEX_COMMIT_HASH.match(revision):
commit_hash = revision
else:
# retrieve commit_hash from file
ref_path = os.path.join(storage_folder, "refs", revision)
with open(ref_path) as f:
commit_hash = f.read()

# if there's a denylist, skip download if file does matches any regex
if ignore_regex is not None and any(
fnmatch(model_file, r) for r in ignore_regex
):
continue
snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)

url = hf_hub_url(repo_id, filename=model_file, revision=repo_id_sha)
relative_filepath = os.path.join(*model_file.split("/"))
if os.path.exists(snapshot_folder):
return snapshot_folder

# Create potential nested dir
nested_dirname = os.path.dirname(
os.path.join(storage_folder, relative_filepath)
raise ValueError(
"Cannot find an appropriate cached snapshot folder for the specified"
" revision on the local disk and outgoing traffic has been disabled. To"
" enable repo look-ups and downloads online, set 'local_files_only' to"
" False."
)
os.makedirs(nested_dirname, exist_ok=True)

path = cached_download(
url,
cache_dir=storage_folder,
force_filename=relative_filepath,
# if we have internet connection we retrieve the correct folder name from the huggingface api
_api = HfApi()
repo_info = _api.repo_info(
repo_id=repo_id, repo_type=repo_type, revision=revision, token=token
)
filtered_repo_files = _filter_repo_files(
repo_files=[f.rfilename for f in repo_info.siblings],
allow_regex=allow_regex,
ignore_regex=ignore_regex,
)
commit_hash = repo_info.sha
snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)
# if passed revision is not identical to commit_hash
# then revision has to be a branch name or tag name.
# In that case store a ref.
if revision != commit_hash:
ref_path = os.path.join(storage_folder, "refs", revision)
os.makedirs(os.path.dirname(ref_path), exist_ok=True)
with open(ref_path, "w") as f:
f.write(commit_hash)

# we pass the commit_hash to hf_hub_download
# so no network call happens if we already
# have the file locally.

for repo_file in filtered_repo_files:
_ = hf_hub_download(
repo_id,
filename=repo_file,
repo_type=repo_type,
revision=commit_hash,
cache_dir=cache_dir,
library_name=library_name,
library_version=library_version,
user_agent=user_agent,
proxies=proxies,
etag_timeout=etag_timeout,
resume_download=resume_download,
use_auth_token=use_auth_token,
local_files_only=local_files_only,
)

if os.path.exists(path + ".lock"):
os.remove(path + ".lock")

return storage_folder
return snapshot_folder
Loading

0 comments on commit eb05af7

Please sign in to comment.