-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New git-aware cache file layout #801
Conversation
repo_info = _api.repo_info( | ||
repo_id=repo_id, repo_type=repo_type, revision=revision, token=token | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses the new repo_info()
function from #792
0616aaf
to
9049f06
Compare
Some additional context for this is in huggingface/transformers#15927 (comment) |
a05ad04
to
b0ca349
Compare
7fe19ec
to
1087e54
Compare
Ok I tentatively rebased this on |
The documentation is not available anymore as the PR was closed or merged. |
Ok this should be ready for a deeper review @LysandreJik 🎉
Also cc @patrickvonplaten and @sgugger |
Wow this is super cool! No more re-downloading all files when just the README is changed ❤️ |
# possible repos have <path/to/cache_dir>/<flatten_repo_id> prefix | ||
repo_folders_prefix = os.path.join(cache_dir, repo_id_flattened) | ||
|
||
# list all possible folders that can correspond to the repo_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All that beautiful code is gone 😢
repo_id=repo_id, repo_type=repo_type, revision=revision, token=token | ||
) | ||
filtered_repo_files = _filter_repo_files( | ||
repo_files=[f.rfilename for f in repo_info.siblings], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this also work with three directory folder layers deep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean?
it should support arbitrary nesting, but i haven't extensively tested yet :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super excited about this feature. We can advertise it well for all Wav2Vec2 + ngram !
Left only some nits, can all be disregarded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nice new system! Can't wait to be able to use in Transformers :-)
except (requests.exceptions.SSLError, requests.exceptions.ProxyError): | ||
# Actually raise for those subclasses of ConnectionError | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Transformers side, we also raise the RepositoryNotFoundError
, EntryNotFoundError
and RevisionNotFoundError
here (which are raised by some code in the try block). There are different from connection/timeout errors and give explicit information to the user about what's wrong, so would be great to integrate those here too.
(Can go in a separate PR if it makes more sense.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes 👍 on adding them
No strong opinion on whether we want to add them here or in another PR. I'd say another PR given we might want to also update cached_download
, which this PR does not touch. WDYT @LysandreJik?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking care of that in #878
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on it, this is super nice and clean.
It downloads files under ~/.cache/huggingface/hub
. Is this the expected place for them to be downloaded at? Should we expect other libraries (transformers, datasets) to go tap in that folder instead of the current transformers/datasets folders?
PS: When playing around with it, I had several connection errors that I didn't seem to have previously when using snapshot_download
. Might be an isolated incident as I don't think this should put any more stress on the backend (quite the contrary).
src/huggingface_hub/file_download.py
Outdated
warnings.warn( | ||
"`filename_to_url` uses the legacy way cache file layout", | ||
FutureWarning, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing warnings!
src/huggingface_hub/file_download.py
Outdated
force_filename: Optional[str] = None, | ||
proxies: Optional[Dict] = None, | ||
etag_timeout: Optional[float] = 10, | ||
resume_download: Optional[bool] = False, | ||
use_auth_token: Union[bool, str, None] = None, | ||
local_files_only: Optional[bool] = False, | ||
legacy_cache_layout: Optional[bool] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick notes regarding the management of backward compatibility: you could add a **kwargs
at the end that would retrieve the force_filename
argument. It would still be passed to cached_download
in case the user chooses legacy_cache_layout
, and would otherwise print a warning that this value does not make sense with the new method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not, though it might be a bit superfluous because if the user adds the legacy_cache_layout=True
flag to their code (which is opt-in) and they want the actual filename, they might as well just remove the flag and they get the actual filename. No?
Yes, that was my idea – or we could also just go up one level and store in |
The API calls themselves are the same as before, so it shouldn't change much – let's investigate if it continues |
…h` test is not reliable
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
3042731
to
442b5b1
Compare
src/huggingface_hub/file_download.py
Outdated
warnings.warn( | ||
"`cached_download` is the legacy way to download files from the HF hub, please" | ||
" consider upgrading to `hf_hub_download`", | ||
FutureWarning, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julien-c, cached_download
implements features that are not available with hf_hub_download
(namely, caching from a URL). Is this a feature that you think should be abandonned? I think there is room for both methods to exist, but this warning says otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a feature that you think should be abandonned?
I think it's a feature that does not really make sense in the context of the HF hub, and a HF Hub client library.
My concern is that if we don't add this warning, users will not upgrade to the new way.
As a middleground, maybe we could add a legacy_cache_layout
flag that just disables the warning? and mention in the warning "If you want to cache from an arbitrary URL instead, pass legacy_cache_layout=True"
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sounds good to me. I'll proceed this way.
commit_hash = revision | ||
if not REGEX_COMMIT_HASH.match(commit_hash): | ||
# rertieve commit_hash from file | ||
|
||
def resolve_ref(revision) -> str: | ||
# retrieve commit_hash from file | ||
ref_path = os.path.join(storage_folder, "refs", revision) | ||
with open(ref_path) as f: | ||
commit_hash = f.read() | ||
return f.read() | ||
|
||
commit_hash = ( | ||
revision if REGEX_COMMIT_HASH.match(revision) else resolve_ref(revision) | ||
) | ||
snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See side by side comparison of before/after @julien-c:
After | Before |
if local_files_only:
def resolve_ref(revision) -> str:
# retrieve commit_hash from file
ref_path = os.path.join(storage_folder, "refs", revision)
with open(ref_path) as f:
return f.read()
commit_hash = (
revision if REGEX_COMMIT_HASH.match(revision) else resolve_ref(revision)
)
snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)
if os.path.exists(snapshot_folder):
return snapshot_folder
raise ValueError(
"Cannot find an appropriate cached snapshot folder for the specified"
" revision on the local disk and outgoing traffic has been disabled. To"
" enable repo look-ups and downloads online, set 'local_files_only' to"
" False."
) |
if local_files_only:
if REGEX_COMMIT_HASH.match(revision):
snapshot_folder = os.path.join(storage_folder, "snapshots", revision)
if os.path.exists(snapshot_folder):
return snapshot_folder
else:
ref_path = os.path.join(storage_folder, "refs", revision)
with open(ref_path) as f:
commit_hash = f.read()
snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)
if os.path.exists(snapshot_folder):
return snapshot_folder
raise ValueError(
"Cannot find an appropriate cached folder for the specified revision on the"
" local disk and outgoing traffic has been disabled. To enable repo"
" look-ups and downloads online, set 'local_files_only' to False."
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, looks great to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same logic exists in hf_hub_download
no? (not 100% sure anymore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some nits, feel free to ignore.
Very nice tests!
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Thank you, @patrickvonplaten, @sgugger, @julien-c and @thomwolf for the reviews! I'll go ahead and merge this now so that we can test it extensively while on |
* light typing (cherry picked from commit b2c8f9b970c505cdf2c685e645e9e36cc472b0d3) * remove this seminal comment (cherry picked from commit 12a841a605c94733154f3b22e812c0f5e69ef37b) * I don't understand why we don't early return here cc @patrickvonplaten care to take a look? cc @LysandreJik (cherry picked from commit 259ab36f03ab3eed6eeb4fc4984bc259619b442f) * following last commit, unnest this (cherry picked from commit 54957f3f049d887af21dd8f6950873a2823c4247) * [BIG] This should work for all repo_types not just models! (cherry picked from commit 9a3f96ccb2de6663cf4cf2d9a60dd7f415227c1b) * one more (cherry picked from commit b74871250616c44a2125b26d5de29b1189e82e12) * forgot a repo_type and reorder code (cherry picked from commit 3ef7d79a44087e971e10e35d3b9f5bea3474f297) * also rename this cache folder (cherry picked from commit 4c518b861723a6d28d59108403c37edf5208f2fe) * Use `hf_hub_download`, will be simpler later (cherry picked from commit c7478d58fe62da02625b8ca17796ad1419a048b1) * in this new version, `force_filename` does not make sense anymore (cherry picked from commit 9a674bc795d5c8a26aecf5429d391fff92e47e8d) * Just inline everything inside `hf_hub_download` for now (cherry picked from commit ee49f8f57ba4e7e66f237df8f64c804862fe3ee8) * Big prototype! it works! 🎉 (cherry picked from commit 7fe19ec66a2c5a7386a956cb9b65616cb209608a) * wip wip * do not touch `cached_download` * Prompt user to upgrade to `hf_hub_download` * Add a `legacy_cache_layout=True` to preserve old behavior, just in case * Create `relative symlinks` + add some doc * Fix behavior when no network * This test now is legacy * Fix-ish conflict-ish * minimize diff * refactor `repo_folder_name` * windows support + shortcut if user passes a commit hash * Rewrite `snapshot_download` and make it more robust * OOops * Create example-transformers-tf.py * Fix + add a way more complete example (running on Ubuntu) * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/huggingface_hub/file_download.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> * Update src/huggingface_hub/file_download.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> * Only allow full revision hashes otherwise the `revision != commit_hash` test is not reliable * add a little bit more doc + consistency * Update src/huggingface_hub/snapshot_download.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update snapshot download * First pass on tests * Wrap up tests * 🐺 Fix for bug reported by @thomwolf see huggingface/huggingface_hub#801 (comment) * Special case for Windows * Address comments and docs * Clean up with ternary cc @julien-c * Add argument to `cached_download` * Opt-in for filename_to-url * Opt-in for filename_to-url * Pass the flag * Update docs/source/package_reference/file_download.mdx Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/huggingface_hub/file_download.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Address review comments Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
A new way to layout cached files on disk, unifying what we do for
snapshot_download
andcached_download
, and laying the ground for new features (ability to query the disk to figure out which local files/revisions we have)See old comment from last year inside
snapshot_download
:This new layout will be git-aware (compatible with versioning)
One picture is worth 1,000 words
Here's a screenshot of the cache's file tree generated by a few file downloads from a model repo (sample code below)
Preliminary mini-spec
refs
is a list of the latest knownrevision => commit_hash
pairsblobs
contains the actual file blobs (identified by theirgit-sha
orsha256
, depending on whether they're LFS files or not)snapshots
contains one subfolder per commit, each "commit" contains the subset of the files that have been resolved at that particular commit. Each filename is a symlink to the blob at that particular commit.Explaining the result
In the sample code I'm downloading the same two filenames at two different commits.
README.md
is different at both commits so there's both versions of it inblobs
.pytorch_model.bin
however, is the same at both commits, so it's not downloaded again, and the disk space is shared between the two snapshots. 🎉 🎉