New git-aware cache file layout #801

julien-c · 2022-03-25T17:53:22Z

A new way to layout cached files on disk, unifying what we do for `snapshot_download` and `cached_download`, and laying the ground for new features (ability to query the disk to figure out which local files/revisions we have)

See old comment from last year inside snapshot_download:

# Note: at some point maybe this format of storage should actually replace
# the flat storage structure we've used so far (initially from allennlp
# if I remember correctly).

This new layout will be git-aware (compatible with versioning)

One picture is worth 1,000 words

Here's a screenshot of the cache's file tree generated by a few file downloads from a model repo (sample code below)

import torch
from huggingface_hub.file_download import hf_hub_download

OLDER_REVISION = "bbc77c8132af1cc5cf678da3f1ddf2de43606d48"

hf_hub_download("julien-c/EsperBERTo-small", filename="README.md")

hf_hub_download("julien-c/EsperBERTo-small", filename="pytorch_model.bin")

hf_hub_download(
    "julien-c/EsperBERTo-small", filename="README.md", revision=OLDER_REVISION
)

weights_file = hf_hub_download(
    "julien-c/EsperBERTo-small", filename="pytorch_model.bin", revision=OLDER_REVISION
)

w = torch.load(weights_file, map_location=torch.device("cpu"))
# Yay it works! just loaded a torch file from a symlink

Preliminary mini-spec

The cache directory contains one subfolder per repo_id (namespaced by repo type)
inside each repo folder:
- refs is a list of the latest known revision => commit_hash pairs
- blobs contains the actual file blobs (identified by their git-sha or sha256, depending on whether they're LFS files or not)
- snapshots contains one subfolder per commit, each "commit" contains the subset of the files that have been resolved at that particular commit. Each filename is a symlink to the blob at that particular commit.

Explaining the result

In the sample code I'm downloading the same two filenames at two different commits.

The README.md is different at both commits so there's both versions of it in blobs.
The pytorch_model.bin however, is the same at both commits, so it's not downloaded again, and the disk space is shared between the two snapshots. 🎉 🎉

julien-c · 2022-03-28T09:29:53Z

src/huggingface_hub/snapshot_download.py

+    repo_info = _api.repo_info(
+        repo_id=repo_id, repo_type=repo_type, revision=revision, token=token
+    )


This uses the new repo_info() function from #792

julien-c · 2022-04-04T06:28:40Z

Some additional context for this is in huggingface/transformers#15927 (comment)

julien-c · 2022-05-04T14:24:58Z

Ok I tentatively rebased this on main, let's see if it still works

HuggingFaceDocBuilderDev · 2022-05-05T13:19:02Z

The documentation is not available anymore as the PR was closed or merged.

julien-c · 2022-05-05T15:15:36Z

Ok this should be ready for a deeper review @LysandreJik 🎉

snapshot_download is way simpler now and only makes one networking call if you already have the repo snapshot locally 🎉

Also cc @patrickvonplaten and @sgugger

patrickvonplaten · 2022-05-05T17:26:20Z

Wow this is super cool! No more re-downloading all files when just the README is changed ❤️

patrickvonplaten · 2022-05-05T17:29:43Z

src/huggingface_hub/snapshot_download.py

-        # possible repos have <path/to/cache_dir>/<flatten_repo_id> prefix
-        repo_folders_prefix = os.path.join(cache_dir, repo_id_flattened)
-
-        # list all possible folders that can correspond to the repo_id


All that beautiful code is gone 😢

src/huggingface_hub/snapshot_download.py

patrickvonplaten · 2022-05-05T17:59:59Z

src/huggingface_hub/snapshot_download.py

+        repo_id=repo_id, repo_type=repo_type, revision=revision, token=token
+    )
+    filtered_repo_files = _filter_repo_files(
+        repo_files=[f.rfilename for f in repo_info.siblings],


would this also work with three directory folder layers deep?

what do you mean?

it should support arbitrary nesting, but i haven't extensively tested yet :)

src/huggingface_hub/snapshot_download.py

src/huggingface_hub/file_download.py

patrickvonplaten

Super excited about this feature. We can advertise it well for all Wav2Vec2 + ngram !

Left only some nits, can all be disregarded

sgugger

Super nice new system! Can't wait to be able to use in Transformers :-)

sgugger · 2022-05-05T18:48:08Z

src/huggingface_hub/file_download.py

+        except (requests.exceptions.SSLError, requests.exceptions.ProxyError):
+            # Actually raise for those subclasses of ConnectionError
+            raise


On the Transformers side, we also raise the RepositoryNotFoundError, EntryNotFoundError and RevisionNotFoundError here (which are raised by some code in the try block). There are different from connection/timeout errors and give explicit information to the user about what's wrong, so would be great to integrate those here too.

(Can go in a separate PR if it makes more sense.)

yes 👍 on adding them

No strong opinion on whether we want to add them here or in another PR. I'd say another PR given we might want to also update cached_download, which this PR does not touch. WDYT @LysandreJik?

Taking care of that in #878

src/huggingface_hub/file_download.py

LysandreJik

Thanks for working on it, this is super nice and clean.

It downloads files under ~/.cache/huggingface/hub. Is this the expected place for them to be downloaded at? Should we expect other libraries (transformers, datasets) to go tap in that folder instead of the current transformers/datasets folders?

PS: When playing around with it, I had several connection errors that I didn't seem to have previously when using snapshot_download. Might be an isolated incident as I don't think this should put any more stress on the backend (quite the contrary).

LysandreJik · 2022-05-05T15:48:25Z

src/huggingface_hub/file_download.py

+    warnings.warn(
+        "`filename_to_url` uses the legacy way cache file layout",
+        FutureWarning,
+    )


Thank you for implementing warnings!

src/huggingface_hub/file_download.py

LysandreJik · 2022-05-05T15:55:30Z

src/huggingface_hub/file_download.py

-    force_filename: Optional[str] = None,
    proxies: Optional[Dict] = None,
    etag_timeout: Optional[float] = 10,
    resume_download: Optional[bool] = False,
    use_auth_token: Union[bool, str, None] = None,
    local_files_only: Optional[bool] = False,
+    legacy_cache_layout: Optional[bool] = False,


Quick notes regarding the management of backward compatibility: you could add a **kwargs at the end that would retrieve the force_filename argument. It would still be passed to cached_download in case the user chooses legacy_cache_layout, and would otherwise print a warning that this value does not make sense with the new method.

why not, though it might be a bit superfluous because if the user adds the legacy_cache_layout=True flag to their code (which is opt-in) and they want the actual filename, they might as well just remove the flag and they get the actual filename. No?

src/huggingface_hub/snapshot_download.py

src/huggingface_hub/file_download.py

julien-c · 2022-05-06T08:15:47Z

It downloads files under ~/.cache/huggingface/hub. Is this the expected place for them to be downloaded at? Should we expect other libraries (transformers, datasets) to go tap in that folder instead of the current transformers/datasets folders?

Yes, that was my idea – or we could also just go up one level and store in ~/.cache/huggingface, considering that this is going to be library-agnostic and huggingface-wide anyways

julien-c · 2022-05-06T08:17:33Z

PS: When playing around with it, I had several connection errors that I didn't seem to have previously when using snapshot_download. Might be an isolated incident

The API calls themselves are the same as before, so it shouldn't change much – let's investigate if it continues

src/huggingface_hub/file_download.py

…h` test is not reliable

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

see #801 (comment)

LysandreJik · 2022-05-24T23:34:04Z

src/huggingface_hub/file_download.py

+    warnings.warn(
+        "`cached_download` is the legacy way to download files from the HF hub, please"
+        " consider upgrading to `hf_hub_download`",
+        FutureWarning,
+    )


@julien-c, cached_download implements features that are not available with hf_hub_download (namely, caching from a URL). Is this a feature that you think should be abandonned? I think there is room for both methods to exist, but this warning says otherwise.

Is this a feature that you think should be abandonned?

I think it's a feature that does not really make sense in the context of the HF hub, and a HF Hub client library.

My concern is that if we don't add this warning, users will not upgrade to the new way.

As a middleground, maybe we could add a legacy_cache_layout flag that just disables the warning? and mention in the warning "If you want to cache from an arbitrary URL instead, pass legacy_cache_layout=True"

WDYT?

Yes, sounds good to me. I'll proceed this way.

src/huggingface_hub/_snapshot_download.py

LysandreJik · 2022-05-25T12:14:51Z

src/huggingface_hub/_snapshot_download.py

-        commit_hash = revision
-        if not REGEX_COMMIT_HASH.match(commit_hash):
-            # rertieve commit_hash from file
+
+        def resolve_ref(revision) -> str:
+            # retrieve commit_hash from file
            ref_path = os.path.join(storage_folder, "refs", revision)
            with open(ref_path) as f:
-                commit_hash = f.read()
+                return f.read()

+        commit_hash = (
+            revision if REGEX_COMMIT_HASH.match(revision) else resolve_ref(revision)
+        )
        snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)
+


See side by side comparison of before/after @julien-c:

After Before

if local_files_only: def resolve_ref(revision) -> str: # retrieve commit_hash from file ref_path = os.path.join(storage_folder, "refs", revision) with open(ref_path) as f: return f.read() commit_hash = ( revision if REGEX_COMMIT_HASH.match(revision) else resolve_ref(revision) ) snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash) if os.path.exists(snapshot_folder): return snapshot_folder raise ValueError( "Cannot find an appropriate cached snapshot folder for the specified" " revision on the local disk and outgoing traffic has been disabled. To" " enable repo look-ups and downloads online, set 'local_files_only' to" " False." )

if local_files_only: if REGEX_COMMIT_HASH.match(revision): snapshot_folder = os.path.join(storage_folder, "snapshots", revision) if os.path.exists(snapshot_folder): return snapshot_folder else: ref_path = os.path.join(storage_folder, "refs", revision) with open(ref_path) as f: commit_hash = f.read() snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash) if os.path.exists(snapshot_folder): return snapshot_folder raise ValueError( "Cannot find an appropriate cached folder for the specified revision on the" " local disk and outgoing traffic has been disabled. To enable repo" " look-ups and downloads online, set 'local_files_only' to False." )

Yep, looks great to me!

the same logic exists in hf_hub_download no? (not 100% sure anymore)

sgugger

Left some nits, feel free to ignore.
Very nice tests!

docs/source/package_reference/file_download.mdx

src/huggingface_hub/_snapshot_download.py

src/huggingface_hub/file_download.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

LysandreJik · 2022-05-25T14:51:55Z

Thank you, @patrickvonplaten, @sgugger, @julien-c and @thomwolf for the reviews! I'll go ahead and merge this now so that we can test it extensively while on main to ensure nothing breaks.

@patrickvonplaten

* light typing (cherry picked from commit b2c8f9b970c505cdf2c685e645e9e36cc472b0d3) * remove this seminal comment (cherry picked from commit 12a841a605c94733154f3b22e812c0f5e69ef37b) * I don't understand why we don't early return here cc @patrickvonplaten care to take a look? cc @LysandreJik (cherry picked from commit 259ab36f03ab3eed6eeb4fc4984bc259619b442f) * following last commit, unnest this (cherry picked from commit 54957f3f049d887af21dd8f6950873a2823c4247) * [BIG] This should work for all repo_types not just models! (cherry picked from commit 9a3f96ccb2de6663cf4cf2d9a60dd7f415227c1b) * one more (cherry picked from commit b74871250616c44a2125b26d5de29b1189e82e12) * forgot a repo_type and reorder code (cherry picked from commit 3ef7d79a44087e971e10e35d3b9f5bea3474f297) * also rename this cache folder (cherry picked from commit 4c518b861723a6d28d59108403c37edf5208f2fe) * Use `hf_hub_download`, will be simpler later (cherry picked from commit c7478d58fe62da02625b8ca17796ad1419a048b1) * in this new version, `force_filename` does not make sense anymore (cherry picked from commit 9a674bc795d5c8a26aecf5429d391fff92e47e8d) * Just inline everything inside `hf_hub_download` for now (cherry picked from commit ee49f8f57ba4e7e66f237df8f64c804862fe3ee8) * Big prototype! it works! 🎉 (cherry picked from commit 7fe19ec66a2c5a7386a956cb9b65616cb209608a) * wip wip * do not touch `cached_download` * Prompt user to upgrade to `hf_hub_download` * Add a `legacy_cache_layout=True` to preserve old behavior, just in case * Create `relative symlinks` + add some doc * Fix behavior when no network * This test now is legacy * Fix-ish conflict-ish * minimize diff * refactor `repo_folder_name` * windows support + shortcut if user passes a commit hash * Rewrite `snapshot_download` and make it more robust * OOops * Create example-transformers-tf.py * Fix + add a way more complete example (running on Ubuntu) * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/huggingface_hub/file_download.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> * Update src/huggingface_hub/file_download.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> * Only allow full revision hashes otherwise the `revision != commit_hash` test is not reliable * add a little bit more doc + consistency * Update src/huggingface_hub/snapshot_download.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update snapshot download * First pass on tests * Wrap up tests * 🐺 Fix for bug reported by @thomwolf see huggingface/huggingface_hub#801 (comment) * Special case for Windows * Address comments and docs * Clean up with ternary cc @julien-c * Add argument to `cached_download` * Opt-in for filename_to-url * Opt-in for filename_to-url * Pass the flag * Update docs/source/package_reference/file_download.mdx Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/huggingface_hub/file_download.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Address review comments Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

julien-c changed the title ~~[PoC] Prototype for a new cache file layout~~ [PoC] Prototype for a new git-aware cache file layout Mar 25, 2022

julien-c mentioned this pull request Mar 28, 2022

Allow snapshot_download to download dataset and space repos as well #739

Closed

julien-c commented Mar 28, 2022

View reviewed changes

LysandreJik self-assigned this Mar 31, 2022

LysandreJik force-pushed the hf-api-adds branch from 0616aaf to 9049f06 Compare March 31, 2022 11:42

LysandreJik force-pushed the hf-api-adds branch 2 times, most recently from a05ad04 to b0ca349 Compare April 12, 2022 11:43

Base automatically changed from hf-api-adds to main April 14, 2022 10:46

julien-c force-pushed the cache-file-layout branch from 7fe19ec to 1087e54 Compare May 4, 2022 14:24

julien-c changed the title ~~[PoC] Prototype for a new git-aware cache file layout~~ New git-aware cache file layout May 4, 2022

julien-c linked an issue May 4, 2022 that may be closed by this pull request

Allow snapshot_download to download dataset and space repos as well #739

Closed

patrickvonplaten reviewed May 5, 2022

View reviewed changes

src/huggingface_hub/snapshot_download.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed May 5, 2022

View reviewed changes

src/huggingface_hub/snapshot_download.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed May 5, 2022

View reviewed changes

src/huggingface_hub/snapshot_download.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed May 5, 2022

View reviewed changes

src/huggingface_hub/snapshot_download.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed May 5, 2022

View reviewed changes

src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved

patrickvonplaten approved these changes May 5, 2022

View reviewed changes

sgugger approved these changes May 5, 2022

View reviewed changes

LysandreJik reviewed May 5, 2022

View reviewed changes

julien-c mentioned this pull request May 6, 2022

Allow boolean values for force_filename in hf_hub_download #863

Closed

julien-c commented May 9, 2022

View reviewed changes

src/huggingface_hub/file_download.py Show resolved Hide resolved

julien-c and others added 9 commits May 24, 2022 19:16

Only allow full revision hashes otherwise the `revision != commit_has…

2c5c94b

…h` test is not reliable

add a little bit more doc + consistency

35fff44

Update src/huggingface_hub/snapshot_download.py

383022f

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Update snapshot download

7112146

First pass on tests

8f99d22

Wrap up tests

01e3a61

🐺 Fix for bug reported by @thomwolf

2d54dd1

see #801 (comment)

Special case for Windows

8f71ad6

Address comments and docs

442b5b1

LysandreJik force-pushed the cache-file-layout branch from 3042731 to 442b5b1 Compare May 24, 2022 23:16

LysandreJik marked this pull request as ready for review May 24, 2022 23:18

LysandreJik reviewed May 24, 2022

View reviewed changes

julien-c commented May 25, 2022

View reviewed changes

src/huggingface_hub/_snapshot_download.py Outdated Show resolved Hide resolved

Clean up with ternary cc @julien-c

e1ff4eb

LysandreJik reviewed May 25, 2022

View reviewed changes

LysandreJik added 4 commits May 25, 2022 08:30

Add argument to cached_download

e69ef4f

Opt-in for filename_to-url

0d06b55

Opt-in for filename_to-url

d6bb92e

Pass the flag

4259642

sgugger approved these changes May 25, 2022

View reviewed changes

LysandreJik and others added 3 commits May 25, 2022 10:11

Update docs/source/package_reference/file_download.mdx

a8348ab

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/huggingface_hub/file_download.py

8c599ee

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Address review comments

3dfc00c

LysandreJik merged commit f124f8b into main May 25, 2022

LysandreJik deleted the cache-file-layout branch May 25, 2022 14:52

julien-c mentioned this pull request Aug 9, 2022

Add a utility to list cached things #972

Closed

Wauplin mentioned this pull request Aug 16, 2022

Enable snapshot_download for different repo types #496

Closed

Wauplin mentioned this pull request Sep 12, 2023

Keep lock files in a /locks folder to prevent rare concurrency issue #1659

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New git-aware cache file layout #801

New git-aware cache file layout #801

julien-c commented Mar 25, 2022 •

edited

Loading

julien-c Mar 28, 2022

julien-c commented Apr 4, 2022

julien-c commented May 4, 2022

HuggingFaceDocBuilderDev commented May 5, 2022 •

edited

Loading

julien-c commented May 5, 2022

patrickvonplaten commented May 5, 2022

patrickvonplaten May 5, 2022

patrickvonplaten May 5, 2022 •

edited

Loading

julien-c May 6, 2022

patrickvonplaten left a comment

sgugger left a comment

sgugger May 5, 2022

julien-c May 9, 2022

LysandreJik May 18, 2022

LysandreJik left a comment •

edited

Loading

LysandreJik May 5, 2022

LysandreJik May 5, 2022

julien-c May 9, 2022

julien-c commented May 6, 2022

julien-c commented May 6, 2022

LysandreJik May 24, 2022

julien-c May 25, 2022

LysandreJik May 25, 2022

LysandreJik May 25, 2022

julien-c May 25, 2022

julien-c May 25, 2022

sgugger left a comment

LysandreJik commented May 25, 2022

New git-aware cache file layout #801

New git-aware cache file layout #801

Conversation

julien-c commented Mar 25, 2022 • edited Loading

A new way to layout cached files on disk, unifying what we do for snapshot_download and cached_download, and laying the ground for new features (ability to query the disk to figure out which local files/revisions we have)

One picture is worth 1,000 words

Preliminary mini-spec

Explaining the result

Choose a reason for hiding this comment

julien-c commented Apr 4, 2022

julien-c commented May 4, 2022

HuggingFaceDocBuilderDev commented May 5, 2022 • edited Loading

julien-c commented May 5, 2022

Ok this should be ready for a deeper review @LysandreJik 🎉

patrickvonplaten commented May 5, 2022

Choose a reason for hiding this comment

patrickvonplaten May 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julien-c commented May 6, 2022

julien-c commented May 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik commented May 25, 2022

julien-c commented Mar 25, 2022 •

edited

Loading

A new way to layout cached files on disk, unifying what we do for `snapshot_download` and `cached_download`, and laying the ground for new features (ability to query the disk to figure out which local files/revisions we have)

HuggingFaceDocBuilderDev commented May 5, 2022 •

edited

Loading

patrickvonplaten May 5, 2022 •

edited

Loading

LysandreJik left a comment •

edited

Loading