Cache non-existence of files or completeness of repo #986

sgugger · 2022-08-10T17:21:39Z

In Transformers, we very often try to download files that do not exist: this is because each tokenizer has a list of optional files so we try to download them all and intercept the errors for those that do not exist. There is also the situation with very large model not having a weight filename like regular models but an index file instead.

This works all fine, but some users are trying to optimize things like calls to pipeline, and each of those requests where a file does not exist takes a bit of time. Therefore, this PR proposes a way to cache the non-existence of files at a given commit, so we can use that cache information and not try to download a file we know does not exist at a specific commit.

I suggest using the same architecture as the snapshots folder with a .no_exist folder: one level for the commit shas, then an empty file for each file we tried to download but do not exist.

@julien-c also suggested adding a .complete folder which contains the commit hashes for which we know we have all files (as a result of snapshot_download for instance).

A picture is worth a thousand words:

I haven't implemented any tests yet, just want to see what everyone thinks and will add some if this is of interest.

HuggingFaceDocBuilderDev · 2022-08-10T17:24:51Z

The documentation is not available anymore as the PR was closed or merged.

Wauplin · 2022-08-11T06:45:54Z

At first glance, it seems legit. Just to be sure we are aligned, we would have a workflow like this, right ?

Try to get a file from local disk for a given commit (folder snapshots)
If not there, check in .complete if the commit id is present. If yes, we know for sure our file doesn't exist.
If the commit id is not in .complete, check in .no_exist/commit_id. If an empty file with same filename is there, we know for sure our file doesn't exist.
Otherwise, make an HTTP call and either download to blob + snapshots or add a file in .no_exist.

This means the first call to pipeline(...) would still require a lot of time since we will check each file individually unless a snapshot_download has been done before. But I guess it's fine if first instantiation takes time ?

Wauplin · 2022-08-11T06:55:04Z

A bit similar to the .complete, we can imagine a .expected folder which would be similar to the snapshots folder but only populated by empty files. The .expected folder would just keep track of the folder tree at a given commit id. It would be populated either when calling snapshot_download or list_repo_files.

The benefit I see is if we want to avoid downloading all the files just to know that a folder is .complete.

sgugger · 2022-08-11T12:50:33Z

Yes that was exactly the workflow I planned to implement next in the wrapper around hf_hub_download we have in Transformers. And yes, the first call would still be slower (mind you an API call takes 100-200ms so it's really not important compared to the download time of the model), we're interested in optimizing the calls once the model/tokenizer is cached.

For the .expected, I like this idea too, it just requires to make a specific call to have the list of files of the repo, and we were strongly discouraged to not make those calls in the past as they were not optimized. It might also be weird if hf_hub_download starts asking for all files instead of just the one required to create this subfolder (since snapshot_download wouldn't require it, as it downloads all files anyway).

Wauplin · 2022-08-11T13:22:12Z

Yes that was exactly the workflow I planned to implement next in the wrapper around hf_hub_download we have in Transformers. And yes, the first call would still be slower (mind you an API call takes 100-200ms so it's really not important compared to the download time of the model), we're interested in optimizing the calls once the model/tokenizer is cached.

Makes sense.

and we were strongly discouraged to not make those calls in the past as they were not optimized

Do you remember if it was not optimized in huggingface_hub because there were multiple calls to make or on moon-landing in general because the endpoint itself is not optimized ? At the moment, both huggingface_hub.list_repo_files() and the http endpoint are publicly documented without mentioning optimization issues ?

It might also be weird if hf_hub_download starts asking for all files instead of just the one required to create this subfolder
since snapshot_download wouldn't require it, as it downloads all files anyway

I did not thought of creating .expected directly from hf_hub_download but before it in the case you know you will request a lot of non-existent files. Either by calling snapshot_dowload or list_files_repo (or a wrapper around it). snapshot_download can still benefit from it as it costs nothing compared to the download and at least .expected would be populated even if the download is interrupted.

But in any case, it's maybe a not-essential feature for your use case.

sgugger · 2022-08-11T13:28:42Z

The moon-landing call was not optimized. It may have changed since then with the migration to Gitaly.

julien-c · 2022-08-11T15:03:10Z

IMO .complete can maybe be kept for a future PR (I know, i was the one who suggested it... 😅) because, as you note, it can only be set when using snapshot_download, which is not currently super frequent, and so the usefulness of .complete will be quite rare

Plus it's a bit confusing maybe that .complete implies "we have all the files", but the inverse implication is not true

julien-c · 2022-08-11T15:09:01Z

Do you remember if it was not optimized in huggingface_hub because there were multiple calls to make or on moon-landing in general because the endpoint itself is not optimized

The endpoint was not optimized on moon-landing at the time (before we switched to Gitaly, notably) but especially because the transformers team was thinking of doing a call to this endpoint at every model instantiation, not as a fallback (i.e. at least 50M or 100M times per day 🤯 )
There's also the issue that you can have a repo with 100,000+ files in it (rare for model repos, but common for dataset repos), and the API output will take some time in that case

All of this to say that:

feel free to use the list_files_repo API when needed, shouldn't be an issue on the hub side anymore
but I feel it's maybe simpler to stick to the "try to get a file" pattern we are using everywhere. Makes reasoning simpler.

julien-c · 2022-08-11T15:10:15Z

TL;DR: I would advocate to just implement the .no_exist from your POC

WDYT? also cc @huggingface/moon-landing-back

Pierrci · 2022-08-11T16:10:23Z

TL;DR: I would advocate to just implement the .no_exist from your POC

+1 on this - we'll probably add pagination to "listing" endpoints in the coming weeks/months (which would increase the cost of listing all files in a repo - though probably mostly for datasets and not really for models), so I would stick to the "try to get a file" logic and .no_exist.

Wauplin · 2022-08-11T16:44:45Z

Thanks @julien-c and @sgugger for completeness of the answer :) Let's stick to .no_exist then.

sgugger · 2022-08-11T18:34:23Z

Thanks for the comments! Reverted the complete part then, to only focus on the .no_exist cache, and added some tests.

One question I have is that the current code does not cache the reference when caching that a file does not exist, as it would add a lot of dupe code. In practice in Transformers it will be cached by some of the files that do exist, so I don't think it's the end of the world, but as seen in the tests added, we need to add a hf_hub_download of a file that works to have the refs folder.

src/huggingface_hub/file_download.py

tests/test_cache_layout.py

Wauplin · 2022-08-12T10:21:15Z

Thanks @sgugger for the changes and tests ! I made a few comments on how code could be more concise but logic-wise that's perfect 👍

Also, did you plan to also implement the logic that if a file is in .no_exist folder then hf_hub_download should not make the call to the server ? This can be done in a separate PR if you want but I'd definitely put it huggingface_hub.

sgugger · 2022-08-12T13:38:54Z

Also, did you plan to also implement the logic that if a file is in .no_exist folder then hf_hub_download should not make the call to the server ? This can be done in a separate PR if you want but I'd definitely put it huggingface_hub.

I didn't plan to add this here. It's the same as for try_to_load_from_cache: I'm unsure whether hf_hub_download is supposed to use it directly in case of connection errors or if we should leave this to the downstream libraries. Both can be definitely be addressed in a followup PR (though I'll be on vacation the next two weeks and a half, so not before then ;-) )

Will address the other comments this morning.

Co-authored-by: Lucain <lucainp@gmail.com>

julien-c · 2022-08-23T17:58:07Z

BTW this should be documented (maybe in a "Advanced" doc) no?

Wauplin · 2022-08-24T07:43:57Z

Created an issue about it #1011

sgugger requested review from julien-c, Wauplin and LysandreJik August 10, 2022 17:21

sgugger force-pushed the cache_no_exist branch from 8154650 to e8165f4 Compare August 11, 2022 18:32

sgugger marked this pull request as ready for review August 11, 2022 18:32

Wauplin reviewed Aug 12, 2022

View reviewed changes

src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved

src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved

src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved

tests/test_cache_layout.py Outdated Show resolved Hide resolved

sgugger changed the title ~~[PoC] Cache non-existence of files or completeness of repo~~ Cache non-existence of files or completeness of repo Aug 12, 2022

sgugger and others added 6 commits August 12, 2022 10:21

Cache non-existence of files or completeness of repo

8b5bce5

Revert complete and add tests

1cbc88d

Remove dupe test and quality

6f449c1

Remove debugging statements

514bf4b

Co-authored-by: Lucain <lucainp@gmail.com>

Address review comments

0a201d0

Group tests together

ef9b4b9

sgugger force-pushed the cache_no_exist branch from e3e214d to ef9b4b9 Compare August 12, 2022 14:21

sgugger and others added 2 commits August 12, 2022 10:27

Quality

fc00197

Merge branch 'main' into cache_no_exist

2307adf

Wauplin added 3 commits August 16, 2022 13:38

FIX touch in case of subdirectory

b32db03

refactored duplicated test test_file_downloaded_in_cache

66dcc36

fix writing new ref if ref changed

3659c82

Wauplin self-requested a review August 16, 2022 12:27

Wauplin approved these changes Aug 16, 2022

View reviewed changes

Wauplin merged commit 3e4aea6 into main Aug 16, 2022

Wauplin deleted the cache_no_exist branch August 16, 2022 12:33

Wauplin mentioned this pull request Aug 24, 2022

Documentation about ".exist" folder in the cache #1011

Closed

sgugger mentioned this pull request Sep 2, 2022

Further reduce the number of alls to head for cached objects huggingface/transformers#18871

Merged

Wauplin mentioned this pull request Sep 5, 2022

🚩 Handle .no_exist files in try_to_load_from_cache #1033

Closed

Wauplin mentioned this pull request Jan 24, 2023

Document .no_exist folder #1308

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache non-existence of files or completeness of repo #986

Cache non-existence of files or completeness of repo #986

sgugger commented Aug 10, 2022

HuggingFaceDocBuilderDev commented Aug 10, 2022 •

edited

Loading

Wauplin commented Aug 11, 2022

Wauplin commented Aug 11, 2022 •

edited

Loading

sgugger commented Aug 11, 2022

Wauplin commented Aug 11, 2022

sgugger commented Aug 11, 2022

julien-c commented Aug 11, 2022

julien-c commented Aug 11, 2022

julien-c commented Aug 11, 2022

Pierrci commented Aug 11, 2022 •

edited

Loading

Wauplin commented Aug 11, 2022

sgugger commented Aug 11, 2022

Wauplin commented Aug 12, 2022

sgugger commented Aug 12, 2022

julien-c commented Aug 23, 2022

Wauplin commented Aug 24, 2022

Cache non-existence of files or completeness of repo #986

Cache non-existence of files or completeness of repo #986

Conversation

sgugger commented Aug 10, 2022

HuggingFaceDocBuilderDev commented Aug 10, 2022 • edited Loading

Wauplin commented Aug 11, 2022

Wauplin commented Aug 11, 2022 • edited Loading

sgugger commented Aug 11, 2022

Wauplin commented Aug 11, 2022

sgugger commented Aug 11, 2022

julien-c commented Aug 11, 2022

julien-c commented Aug 11, 2022

julien-c commented Aug 11, 2022

Pierrci commented Aug 11, 2022 • edited Loading

Wauplin commented Aug 11, 2022

sgugger commented Aug 11, 2022

Wauplin commented Aug 12, 2022

sgugger commented Aug 12, 2022

julien-c commented Aug 23, 2022

Wauplin commented Aug 24, 2022

HuggingFaceDocBuilderDev commented Aug 10, 2022 •

edited

Loading

Wauplin commented Aug 11, 2022 •

edited

Loading

Pierrci commented Aug 11, 2022 •

edited

Loading