-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harmonize HF environment variables + other cleaning #27564
Conversation
# TODO: deprecate `get_file_from_repo` or document it differently? | ||
# Docstring is exactly the same as `cached_repo` but behavior is slightly different. If file is missing or if | ||
# there is a connection error, `cached_repo` will return None while `get_file_from_repo` will raise an error. | ||
# IMO we should keep only 1 method and have a single `raise_error` argument (to be discussed). | ||
def get_file_from_repo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be discussed. I tend to prefer to have a single cached_file
method with a raise_on_error
argument (and with the correct typing to let the user know when to expect a None value and when it's not possible). I can do it here or in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd vote to uniformize everything under hfh's banner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'd be ok for moving the logic to huggingface_hub
but this method doesn't exist at the moment.
In hfh we have hf_hub_download
that does the job to download/retrieve a file given a repo_id
+ filename
. In transformers
, cached_file
is slightly different as it accept either a repo_id
or a local path
(in path_or_repo_id
parameter), making it more versatile.
So for now I'd be more in favor of uniforming within transformers
and then later move it to huggingface_hub
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_file_from_repo
calls cached_file
(while that method has True
as default values
_raise_exceptions_for_missing_entries=False,
_raise_exceptions_for_connection_errors=False,
Not against to keep only one of them. But the changes might get larger. I would personally do this in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, let's wait for a separate PR 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this important work @Wauplin!
) | ||
from huggingface_hub.utils._deprecation import _deprecate_method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this a public method to leverage?
Not ideal to depend on private attributes from hfh 🫣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I realized about that while working on this PR 😭
I'm fine with either making them public (huggingface_hub.utils.deprecate_method
?) which would require to wait next hfh release or making an exception for this one since it's a quite internal method anyway (so using private attribute in transformers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in #27564 (comment) I suggest to still use the private attribute until it's made more official in hfh side. Worse case scenario, the plan is to remove this in 3 transformers releases.
# TODO: deprecate `get_file_from_repo` or document it differently? | ||
# Docstring is exactly the same as `cached_repo` but behavior is slightly different. If file is missing or if | ||
# there is a connection error, `cached_repo` will return None while `get_file_from_repo` will raise an error. | ||
# IMO we should keep only 1 method and have a single `raise_error` argument (to be discussed). | ||
def get_file_from_repo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd vote to uniformize everything under hfh's banner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the work @Wauplin !
Left a few comments / questions, but overall it's good to me
# TODO: deprecate `get_file_from_repo` or document it differently? | ||
# Docstring is exactly the same as `cached_repo` but behavior is slightly different. If file is missing or if | ||
# there is a connection error, `cached_repo` will return None while `get_file_from_repo` will raise an error. | ||
# IMO we should keep only 1 method and have a single `raise_error` argument (to be discussed). | ||
def get_file_from_repo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_file_from_repo
calls cached_file
(while that method has True
as default values
_raise_exceptions_for_missing_entries=False,
_raise_exceptions_for_connection_errors=False,
Not against to keep only one of them. But the changes might get larger. I would personally do this in a separate PR.
Thank you for the explanation. Just one more nit question, in huggingface/huggingface_hub#1786, there is a table of the re-mapped names, like |
Thanks for the reviews @LysandreJik and @ydshieh! I addressed 2 points:
About using Finally, regarding removing @LysandreJik @ydshieh could you re-review the last changes + message above and approve the PR? I think we are good to merge the current version if that is ok with you. Thanks in advance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
The documentation is not available anymore as the PR was closed or merged. |
I'll review tomorrow! Thanks for this refactor 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - really nice tidy up!
Only Q is about existing HUGGINGFACE_HUB_CACHE
values in the environment
Nvm - I saw the discussion here: #27564 (comment)
# In code, use `HF_HUB_CACHE` as the default cache path. This variable is set by the library and is guaranteed | ||
# to be set to the right value. | ||
# | ||
# TODO: clean this for v5? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Yay! 3.5 approvals, that's enough! Thanks everyone 🤗 |
This PR aims at harmonizing environment variables in the HF ecosystem (following up on huggingface/huggingface_hub#1786). I also took the time to review some integration of
huggingface_hub
intotransformers
(trying to break nothing). The goal is simply to have less duplication in the codebase. The PR is not ready to be merged as is (see comments) but can be reviewed to discuss the few points. Feedback is very welcome @LysandreJik @amyeroberts @ArthurZucker @ydshieh as I'm necesarily missing some context sometimes.List of changes:
HF_HOME
is the preferred way to set a custom hf path. Second best solution isHF_HUB_CACHE
(hf_hub_cache is only for model/dataset caching while hf_home also contains the token and the cached modules -when trust_remote_code-). Therefore:PYTORCH_PRETRAINED_BERT_CACHE
is deprecated (v5) but still respectedPYTORCH_TRANSFORMERS_CACHE
is deprecated (v5) but still respectedTRANSFORMERS_CACHE
is deprecated (v5) but still respectedDISABLE_TELEMETRY
is deprecated in favor ofHF_HUB_DISABLE_TELEMETRY
get_cached_models
=> looks like a legacy unused method (and not compatible with new cache system)try_to_load_from_cache
(moved tohuggingface_hub
)get_file_from_repo
andcached_file
to keep only one of them. The first one raises if the file doesn't exist while the second returns None in such a case. I think having a single method and a publicraise_on_error
argument should be enough (but open to discussion). Also both methods have the exact same docstring which is misleading.huggingface_hub.send_telemetry
to send telemetry data. Result is exactly the same but the HTTP call in made in a separate thread (meaning better UX as the user is not blocked)download_url
(already deprecated): it's safer to close the file