-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache non-existence of files or completeness of repo #986
Conversation
The documentation is not available anymore as the PR was closed or merged. |
At first glance, it seems legit. Just to be sure we are aligned, we would have a workflow like this, right ?
This means the first call to |
A bit similar to the The benefit I see is if we want to avoid downloading all the files just to know that a folder is |
Yes that was exactly the workflow I planned to implement next in the wrapper around For the |
Makes sense.
Do you remember if it was not optimized in
I did not thought of creating But in any case, it's maybe a not-essential feature for your use case. |
The |
IMO Plus it's a bit confusing maybe that |
The endpoint was not optimized on moon-landing at the time (before we switched to Gitaly, notably) but especially because the transformers team was thinking of doing a call to this endpoint at every model instantiation, not as a fallback (i.e. at least 50M or 100M times per day 🤯 ) All of this to say that:
|
TL;DR: I would advocate to just implement the WDYT? also cc @huggingface/moon-landing-back |
+1 on this - we'll probably add pagination to "listing" endpoints in the coming weeks/months (which would increase the cost of listing all files in a repo - though probably mostly for datasets and not really for models), so I would stick to the "try to get a file" logic and |
8154650
to
e8165f4
Compare
Thanks for the comments! Reverted the complete part then, to only focus on the One question I have is that the current code does not cache the reference when caching that a file does not exist, as it would add a lot of dupe code. In practice in Transformers it will be cached by some of the files that do exist, so I don't think it's the end of the world, but as seen in the tests added, we need to add a |
Thanks @sgugger for the changes and tests ! I made a few comments on how code could be more concise but logic-wise that's perfect 👍 Also, did you plan to also implement the logic that if a file is in |
I didn't plan to add this here. It's the same as for Will address the other comments this morning. |
e3e214d
to
ef9b4b9
Compare
BTW this should be documented (maybe in a "Advanced" doc) no? |
Created an issue about it #1011 |
In Transformers, we very often try to download files that do not exist: this is because each tokenizer has a list of optional files so we try to download them all and intercept the errors for those that do not exist. There is also the situation with very large model not having a weight filename like regular models but an index file instead.
This works all fine, but some users are trying to optimize things like calls to
pipeline
, and each of those requests where a file does not exist takes a bit of time. Therefore, this PR proposes a way to cache the non-existence of files at a given commit, so we can use that cache information and not try to download a file we know does not exist at a specific commit.I suggest using the same architecture as the
snapshots
folder with a.no_exist
folder: one level for the commit shas, then an empty file for each file we tried to download but do not exist.@julien-c also suggested adding a
.complete
folder which contains the commit hashes for which we know we have all files (as a result ofsnapshot_download
for instance).A picture is worth a thousand words:
I haven't implemented any tests yet, just want to see what everyone thinks and will add some if this is of interest.