Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache non-existence of files or completeness of repo #986

Merged
merged 11 commits into from
Aug 16, 2022
Merged

Conversation

sgugger
Copy link
Contributor

@sgugger sgugger commented Aug 10, 2022

In Transformers, we very often try to download files that do not exist: this is because each tokenizer has a list of optional files so we try to download them all and intercept the errors for those that do not exist. There is also the situation with very large model not having a weight filename like regular models but an index file instead.

This works all fine, but some users are trying to optimize things like calls to pipeline, and each of those requests where a file does not exist takes a bit of time. Therefore, this PR proposes a way to cache the non-existence of files at a given commit, so we can use that cache information and not try to download a file we know does not exist at a specific commit.

I suggest using the same architecture as the snapshots folder with a .no_exist folder: one level for the commit shas, then an empty file for each file we tried to download but do not exist.

@julien-c also suggested adding a .complete folder which contains the commit hashes for which we know we have all files (as a result of snapshot_download for instance).

A picture is worth a thousand words:
image

I haven't implemented any tests yet, just want to see what everyone thinks and will add some if this is of interest.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 10, 2022

The documentation is not available anymore as the PR was closed or merged.

@Wauplin
Copy link
Contributor

Wauplin commented Aug 11, 2022

At first glance, it seems legit. Just to be sure we are aligned, we would have a workflow like this, right ?

  1. Try to get a file from local disk for a given commit (folder snapshots)
  2. If not there, check in .complete if the commit id is present. If yes, we know for sure our file doesn't exist.
  3. If the commit id is not in .complete, check in .no_exist/commit_id. If an empty file with same filename is there, we know for sure our file doesn't exist.
  4. Otherwise, make an HTTP call and either download to blob + snapshots or add a file in .no_exist.

This means the first call to pipeline(...) would still require a lot of time since we will check each file individually unless a snapshot_download has been done before. But I guess it's fine if first instantiation takes time ?

@Wauplin
Copy link
Contributor

Wauplin commented Aug 11, 2022

A bit similar to the .complete, we can imagine a .expected folder which would be similar to the snapshots folder but only populated by empty files. The .expected folder would just keep track of the folder tree at a given commit id. It would be populated either when calling snapshot_download or list_repo_files.

The benefit I see is if we want to avoid downloading all the files just to know that a folder is .complete.

@sgugger
Copy link
Contributor Author

sgugger commented Aug 11, 2022

Yes that was exactly the workflow I planned to implement next in the wrapper around hf_hub_download we have in Transformers. And yes, the first call would still be slower (mind you an API call takes 100-200ms so it's really not important compared to the download time of the model), we're interested in optimizing the calls once the model/tokenizer is cached.

For the .expected, I like this idea too, it just requires to make a specific call to have the list of files of the repo, and we were strongly discouraged to not make those calls in the past as they were not optimized. It might also be weird if hf_hub_download starts asking for all files instead of just the one required to create this subfolder (since snapshot_download wouldn't require it, as it downloads all files anyway).

@Wauplin
Copy link
Contributor

Wauplin commented Aug 11, 2022

Yes that was exactly the workflow I planned to implement next in the wrapper around hf_hub_download we have in Transformers. And yes, the first call would still be slower (mind you an API call takes 100-200ms so it's really not important compared to the download time of the model), we're interested in optimizing the calls once the model/tokenizer is cached.

Makes sense.

and we were strongly discouraged to not make those calls in the past as they were not optimized

Do you remember if it was not optimized in huggingface_hub because there were multiple calls to make or on moon-landing in general because the endpoint itself is not optimized ? At the moment, both huggingface_hub.list_repo_files() and the http endpoint are publicly documented without mentioning optimization issues ?

It might also be weird if hf_hub_download starts asking for all files instead of just the one required to create this subfolder
since snapshot_download wouldn't require it, as it downloads all files anyway

I did not thought of creating .expected directly from hf_hub_download but before it in the case you know you will request a lot of non-existent files. Either by calling snapshot_dowload or list_files_repo (or a wrapper around it). snapshot_download can still benefit from it as it costs nothing compared to the download and at least .expected would be populated even if the download is interrupted.

But in any case, it's maybe a not-essential feature for your use case.

@sgugger
Copy link
Contributor Author

sgugger commented Aug 11, 2022

The moon-landing call was not optimized. It may have changed since then with the migration to Gitaly.

@julien-c
Copy link
Member

IMO .complete can maybe be kept for a future PR (I know, i was the one who suggested it... 😅) because, as you note, it can only be set when using snapshot_download, which is not currently super frequent, and so the usefulness of .complete will be quite rare

Plus it's a bit confusing maybe that .complete implies "we have all the files", but the inverse implication is not true

@julien-c
Copy link
Member

Do you remember if it was not optimized in huggingface_hub because there were multiple calls to make or on moon-landing in general because the endpoint itself is not optimized

The endpoint was not optimized on moon-landing at the time (before we switched to Gitaly, notably) but especially because the transformers team was thinking of doing a call to this endpoint at every model instantiation, not as a fallback (i.e. at least 50M or 100M times per day 🤯 )
There's also the issue that you can have a repo with 100,000+ files in it (rare for model repos, but common for dataset repos), and the API output will take some time in that case

All of this to say that:

  • feel free to use the list_files_repo API when needed, shouldn't be an issue on the hub side anymore
  • but I feel it's maybe simpler to stick to the "try to get a file" pattern we are using everywhere. Makes reasoning simpler.

@julien-c
Copy link
Member

TL;DR: I would advocate to just implement the .no_exist from your POC

WDYT? also cc @huggingface/moon-landing-back

@Pierrci
Copy link
Member

Pierrci commented Aug 11, 2022

TL;DR: I would advocate to just implement the .no_exist from your POC

+1 on this - we'll probably add pagination to "listing" endpoints in the coming weeks/months (which would increase the cost of listing all files in a repo - though probably mostly for datasets and not really for models), so I would stick to the "try to get a file" logic and .no_exist.

@Wauplin
Copy link
Contributor

Wauplin commented Aug 11, 2022

Thanks @julien-c and @sgugger for completeness of the answer :) Let's stick to .no_exist then.

@sgugger sgugger marked this pull request as ready for review August 11, 2022 18:32
@sgugger
Copy link
Contributor Author

sgugger commented Aug 11, 2022

Thanks for the comments! Reverted the complete part then, to only focus on the .no_exist cache, and added some tests.

One question I have is that the current code does not cache the reference when caching that a file does not exist, as it would add a lot of dupe code. In practice in Transformers it will be cached by some of the files that do exist, so I don't think it's the end of the world, but as seen in the tests added, we need to add a hf_hub_download of a file that works to have the refs folder.

src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved
src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved
src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved
tests/test_cache_layout.py Outdated Show resolved Hide resolved
@Wauplin
Copy link
Contributor

Wauplin commented Aug 12, 2022

Thanks @sgugger for the changes and tests ! I made a few comments on how code could be more concise but logic-wise that's perfect 👍

Also, did you plan to also implement the logic that if a file is in .no_exist folder then hf_hub_download should not make the call to the server ? This can be done in a separate PR if you want but I'd definitely put it huggingface_hub.

@sgugger
Copy link
Contributor Author

sgugger commented Aug 12, 2022

Also, did you plan to also implement the logic that if a file is in .no_exist folder then hf_hub_download should not make the call to the server ? This can be done in a separate PR if you want but I'd definitely put it huggingface_hub.

I didn't plan to add this here. It's the same as for try_to_load_from_cache: I'm unsure whether hf_hub_download is supposed to use it directly in case of connection errors or if we should leave this to the downstream libraries. Both can be definitely be addressed in a followup PR (though I'll be on vacation the next two weeks and a half, so not before then ;-) )

Will address the other comments this morning.

@sgugger sgugger changed the title [PoC] Cache non-existence of files or completeness of repo Cache non-existence of files or completeness of repo Aug 12, 2022
@Wauplin Wauplin self-requested a review August 16, 2022 12:27
@Wauplin Wauplin merged commit 3e4aea6 into main Aug 16, 2022
@Wauplin Wauplin deleted the cache_no_exist branch August 16, 2022 12:33
@julien-c
Copy link
Member

BTW this should be documented (maybe in a "Advanced" doc) no?

@Wauplin
Copy link
Contributor

Wauplin commented Aug 24, 2022

Created an issue about it #1011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants