Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New git-aware cache file layout #801

Merged
merged 47 commits into from
May 25, 2022
Merged

New git-aware cache file layout #801

merged 47 commits into from
May 25, 2022

Conversation

julien-c
Copy link
Member

@julien-c julien-c commented Mar 25, 2022

A new way to layout cached files on disk, unifying what we do for snapshot_download and cached_download, and laying the ground for new features (ability to query the disk to figure out which local files/revisions we have)

See old comment from last year inside snapshot_download:

# Note: at some point maybe this format of storage should actually replace
# the flat storage structure we've used so far (initially from allennlp
# if I remember correctly).

This new layout will be git-aware (compatible with versioning)

One picture is worth 1,000 words

Here's a screenshot of the cache's file tree generated by a few file downloads from a model repo (sample code below)

image

import torch
from huggingface_hub.file_download import hf_hub_download

OLDER_REVISION = "bbc77c8132af1cc5cf678da3f1ddf2de43606d48"

hf_hub_download("julien-c/EsperBERTo-small", filename="README.md")

hf_hub_download("julien-c/EsperBERTo-small", filename="pytorch_model.bin")

hf_hub_download(
    "julien-c/EsperBERTo-small", filename="README.md", revision=OLDER_REVISION
)

weights_file = hf_hub_download(
    "julien-c/EsperBERTo-small", filename="pytorch_model.bin", revision=OLDER_REVISION
)

w = torch.load(weights_file, map_location=torch.device("cpu"))
# Yay it works! just loaded a torch file from a symlink

Preliminary mini-spec

  • The cache directory contains one subfolder per repo_id (namespaced by repo type)
  • inside each repo folder:
    • refs is a list of the latest known revision => commit_hash pairs
    • blobs contains the actual file blobs (identified by their git-sha or sha256, depending on whether they're LFS files or not)
    • snapshots contains one subfolder per commit, each "commit" contains the subset of the files that have been resolved at that particular commit. Each filename is a symlink to the blob at that particular commit.

Explaining the result

In the sample code I'm downloading the same two filenames at two different commits.

  • The README.md is different at both commits so there's both versions of it in blobs.
  • The pytorch_model.bin however, is the same at both commits, so it's not downloaded again, and the disk space is shared between the two snapshots. 🎉 🎉

@julien-c julien-c changed the title [PoC] Prototype for a new cache file layout [PoC] Prototype for a new git-aware cache file layout Mar 25, 2022
Comment on lines 197 to 208
repo_info = _api.repo_info(
repo_id=repo_id, repo_type=repo_type, revision=revision, token=token
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses the new repo_info() function from #792

@julien-c
Copy link
Member Author

julien-c commented Apr 4, 2022

Some additional context for this is in huggingface/transformers#15927 (comment)

@LysandreJik LysandreJik force-pushed the hf-api-adds branch 2 times, most recently from a05ad04 to b0ca349 Compare April 12, 2022 11:43
Base automatically changed from hf-api-adds to main April 14, 2022 10:46
@julien-c julien-c force-pushed the cache-file-layout branch from 7fe19ec to 1087e54 Compare May 4, 2022 14:24
@julien-c
Copy link
Member Author

julien-c commented May 4, 2022

Ok I tentatively rebased this on main, let's see if it still works

@julien-c julien-c changed the title [PoC] Prototype for a new git-aware cache file layout New git-aware cache file layout May 4, 2022
@julien-c julien-c linked an issue May 4, 2022 that may be closed by this pull request
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 5, 2022

The documentation is not available anymore as the PR was closed or merged.

@julien-c
Copy link
Member Author

julien-c commented May 5, 2022

Ok this should be ready for a deeper review @LysandreJik 🎉

snapshot_download is way simpler now and only makes one networking call if you already have the repo snapshot locally 🎉

Also cc @patrickvonplaten and @sgugger

@patrickvonplaten
Copy link
Contributor

Wow this is super cool! No more re-downloading all files when just the README is changed ❤️

# possible repos have <path/to/cache_dir>/<flatten_repo_id> prefix
repo_folders_prefix = os.path.join(cache_dir, repo_id_flattened)

# list all possible folders that can correspond to the repo_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All that beautiful code is gone 😢

repo_id=repo_id, repo_type=repo_type, revision=revision, token=token
)
filtered_repo_files = _filter_repo_files(
repo_files=[f.rfilename for f in repo_info.siblings],
Copy link
Contributor

@patrickvonplaten patrickvonplaten May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this also work with three directory folder layers deep?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean?

it should support arbitrary nesting, but i haven't extensively tested yet :)

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super excited about this feature. We can advertise it well for all Wav2Vec2 + ngram !

Left only some nits, can all be disregarded

Copy link
Contributor

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice new system! Can't wait to be able to use in Transformers :-)

Comment on lines +960 to +1014
except (requests.exceptions.SSLError, requests.exceptions.ProxyError):
# Actually raise for those subclasses of ConnectionError
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Transformers side, we also raise the RepositoryNotFoundError, EntryNotFoundError and RevisionNotFoundError here (which are raised by some code in the try block). There are different from connection/timeout errors and give explicit information to the user about what's wrong, so would be great to integrate those here too.

(Can go in a separate PR if it makes more sense.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes 👍 on adding them

No strong opinion on whether we want to add them here or in another PR. I'd say another PR given we might want to also update cached_download, which this PR does not touch. WDYT @LysandreJik?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking care of that in #878

src/huggingface_hub/file_download.py Show resolved Hide resolved
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on it, this is super nice and clean.

It downloads files under ~/.cache/huggingface/hub. Is this the expected place for them to be downloaded at? Should we expect other libraries (transformers, datasets) to go tap in that folder instead of the current transformers/datasets folders?

PS: When playing around with it, I had several connection errors that I didn't seem to have previously when using snapshot_download. Might be an isolated incident as I don't think this should put any more stress on the backend (quite the contrary).

Comment on lines 271 to 274
warnings.warn(
"`filename_to_url` uses the legacy way cache file layout",
FutureWarning,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing warnings!

src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved
src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved
src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved
Comment on lines 705 to 795
force_filename: Optional[str] = None,
proxies: Optional[Dict] = None,
etag_timeout: Optional[float] = 10,
resume_download: Optional[bool] = False,
use_auth_token: Union[bool, str, None] = None,
local_files_only: Optional[bool] = False,
legacy_cache_layout: Optional[bool] = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick notes regarding the management of backward compatibility: you could add a **kwargs at the end that would retrieve the force_filename argument. It would still be passed to cached_download in case the user chooses legacy_cache_layout, and would otherwise print a warning that this value does not make sense with the new method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not, though it might be a bit superfluous because if the user adds the legacy_cache_layout=True flag to their code (which is opt-in) and they want the actual filename, they might as well just remove the flag and they get the actual filename. No?

src/huggingface_hub/snapshot_download.py Outdated Show resolved Hide resolved
src/huggingface_hub/snapshot_download.py Outdated Show resolved Hide resolved
src/huggingface_hub/file_download.py Show resolved Hide resolved
src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved
src/huggingface_hub/file_download.py Show resolved Hide resolved
@julien-c
Copy link
Member Author

julien-c commented May 6, 2022

It downloads files under ~/.cache/huggingface/hub. Is this the expected place for them to be downloaded at? Should we expect other libraries (transformers, datasets) to go tap in that folder instead of the current transformers/datasets folders?

Yes, that was my idea – or we could also just go up one level and store in ~/.cache/huggingface, considering that this is going to be library-agnostic and huggingface-wide anyways

@julien-c
Copy link
Member Author

julien-c commented May 6, 2022

PS: When playing around with it, I had several connection errors that I didn't seem to have previously when using snapshot_download. Might be an isolated incident

The API calls themselves are the same as before, so it shouldn't change much – let's investigate if it continues

@LysandreJik LysandreJik marked this pull request as ready for review May 24, 2022 23:18
Comment on lines 529 to 533
warnings.warn(
"`cached_download` is the legacy way to download files from the HF hub, please"
" consider upgrading to `hf_hub_download`",
FutureWarning,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julien-c, cached_download implements features that are not available with hf_hub_download (namely, caching from a URL). Is this a feature that you think should be abandonned? I think there is room for both methods to exist, but this warning says otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a feature that you think should be abandonned?

I think it's a feature that does not really make sense in the context of the HF hub, and a HF Hub client library.

My concern is that if we don't add this warning, users will not upgrade to the new way.

As a middleground, maybe we could add a legacy_cache_layout flag that just disables the warning? and mention in the warning "If you want to cache from an arbitrary URL instead, pass legacy_cache_layout=True"

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good to me. I'll proceed this way.

Comment on lines 161 to 172
commit_hash = revision
if not REGEX_COMMIT_HASH.match(commit_hash):
# rertieve commit_hash from file

def resolve_ref(revision) -> str:
# retrieve commit_hash from file
ref_path = os.path.join(storage_folder, "refs", revision)
with open(ref_path) as f:
commit_hash = f.read()
return f.read()

commit_hash = (
revision if REGEX_COMMIT_HASH.match(revision) else resolve_ref(revision)
)
snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See side by side comparison of before/after @julien-c:

After Before
if local_files_only:

    def resolve_ref(revision) -> str:
        # retrieve commit_hash from file
        ref_path = os.path.join(storage_folder, "refs", revision)
        with open(ref_path) as f:
            return f.read()

    commit_hash = (
        revision if REGEX_COMMIT_HASH.match(revision) else resolve_ref(revision)
    )
    snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)

    if os.path.exists(snapshot_folder):
        return snapshot_folder

    raise ValueError(
        "Cannot find an appropriate cached snapshot folder for the specified"
        " revision on the local disk and outgoing traffic has been disabled. To"
        " enable repo look-ups and downloads online, set 'local_files_only' to"
        " False."
    )
if local_files_only:
    if REGEX_COMMIT_HASH.match(revision):
        snapshot_folder = os.path.join(storage_folder, "snapshots", revision)
        if os.path.exists(snapshot_folder):
            return snapshot_folder
    else:
        ref_path = os.path.join(storage_folder, "refs", revision)
        with open(ref_path) as f:
            commit_hash = f.read()
        snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)
        if os.path.exists(snapshot_folder):
            return snapshot_folder
     raise ValueError(
        "Cannot find an appropriate cached folder for the specified revision on the"
        " local disk and outgoing traffic has been disabled. To enable repo"
        " look-ups and downloads online, set 'local_files_only' to False."
    )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, looks great to me!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same logic exists in hf_hub_download no? (not 100% sure anymore)

Copy link
Contributor

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nits, feel free to ignore.
Very nice tests!

docs/source/package_reference/file_download.mdx Outdated Show resolved Hide resolved
docs/source/package_reference/file_download.mdx Outdated Show resolved Hide resolved
src/huggingface_hub/_snapshot_download.py Outdated Show resolved Hide resolved
src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved
src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved
src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved
src/huggingface_hub/file_download.py Show resolved Hide resolved
src/huggingface_hub/file_download.py Show resolved Hide resolved
LysandreJik and others added 3 commits May 25, 2022 10:11
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@LysandreJik
Copy link
Member

Thank you, @patrickvonplaten, @sgugger, @julien-c and @thomwolf for the reviews! I'll go ahead and merge this now so that we can test it extensively while on main to ensure nothing breaks.

@LysandreJik LysandreJik merged commit f124f8b into main May 25, 2022
@LysandreJik LysandreJik deleted the cache-file-layout branch May 25, 2022 14:52
worldlight425 added a commit to worldlight425/huggingface_hub that referenced this pull request Feb 22, 2023
* light typing

(cherry picked from commit b2c8f9b970c505cdf2c685e645e9e36cc472b0d3)

* remove this seminal comment

(cherry picked from commit 12a841a605c94733154f3b22e812c0f5e69ef37b)

* I don't understand why we don't early return here

cc @patrickvonplaten care to take a look? cc @LysandreJik

(cherry picked from commit 259ab36f03ab3eed6eeb4fc4984bc259619b442f)

* following last commit, unnest this

(cherry picked from commit 54957f3f049d887af21dd8f6950873a2823c4247)

* [BIG] This should work for all repo_types not just models!

(cherry picked from commit 9a3f96ccb2de6663cf4cf2d9a60dd7f415227c1b)

* one more

(cherry picked from commit b74871250616c44a2125b26d5de29b1189e82e12)

* forgot a repo_type and reorder code

(cherry picked from commit 3ef7d79a44087e971e10e35d3b9f5bea3474f297)

* also rename this cache folder

(cherry picked from commit 4c518b861723a6d28d59108403c37edf5208f2fe)

* Use `hf_hub_download`, will be simpler later

(cherry picked from commit c7478d58fe62da02625b8ca17796ad1419a048b1)

* in this new version, `force_filename` does not make sense anymore

(cherry picked from commit 9a674bc795d5c8a26aecf5429d391fff92e47e8d)

* Just inline everything inside `hf_hub_download` for now

(cherry picked from commit ee49f8f57ba4e7e66f237df8f64c804862fe3ee8)

* Big prototype! it works! 🎉

(cherry picked from commit 7fe19ec66a2c5a7386a956cb9b65616cb209608a)

* wip wip

* do not touch `cached_download`

* Prompt user to upgrade to `hf_hub_download`

* Add a `legacy_cache_layout=True` to preserve old behavior, just in case

* Create `relative symlinks` + add some doc

* Fix behavior when no network

* This test now is legacy

* Fix-ish conflict-ish

* minimize diff

* refactor `repo_folder_name`

* windows support + shortcut if user passes a commit hash

* Rewrite `snapshot_download` and make it more robust

* OOops

* Create example-transformers-tf.py

* Fix + add a way more complete example (running on Ubuntu)

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/huggingface_hub/file_download.py

Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

* Update src/huggingface_hub/file_download.py

Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

* Only allow full revision hashes otherwise the `revision != commit_hash` test is not reliable

* add a little bit more doc + consistency

* Update src/huggingface_hub/snapshot_download.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update snapshot download

* First pass on tests

* Wrap up tests

* 🐺 Fix for bug reported by @thomwolf

see huggingface/huggingface_hub#801 (comment)

* Special case for Windows

* Address comments and docs

* Clean up with ternary cc @julien-c

* Add argument to `cached_download`

* Opt-in for filename_to-url

* Opt-in for filename_to-url

* Pass the flag

* Update docs/source/package_reference/file_download.mdx

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/huggingface_hub/file_download.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Address review comments

Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants