Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonize HF environment variables + other cleaning #27564

Merged
merged 4 commits into from
Nov 21, 2023

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Nov 17, 2023

This PR aims at harmonizing environment variables in the HF ecosystem (following up on huggingface/huggingface_hub#1786). I also took the time to review some integration of huggingface_hub into transformers (trying to break nothing). The goal is simply to have less duplication in the codebase. The PR is not ready to be merged as is (see comments) but can be reviewed to discuss the few points. Feedback is very welcome @LysandreJik @amyeroberts @ArthurZucker @ydshieh as I'm necesarily missing some context sometimes.

List of changes:

  • HF_HOME is the preferred way to set a custom hf path. Second best solution is HF_HUB_CACHE (hf_hub_cache is only for model/dataset caching while hf_home also contains the token and the cached modules -when trust_remote_code-). Therefore:
    • using PYTORCH_PRETRAINED_BERT_CACHE is deprecated (v5) but still respected
    • using PYTORCH_TRANSFORMERS_CACHE is deprecated (v5) but still respected
    • using TRANSFORMERS_CACHE is deprecated (v5) but still respected
  • DISABLE_TELEMETRY is deprecated in favor of HF_HUB_DISABLE_TELEMETRY
  • added a deprecation warning around get_cached_models => looks like a legacy unused method (and not compatible with new cache system)
  • removed try_to_load_from_cache (moved to huggingface_hub)
  • unsure: I think we should harmonize between get_file_from_repo and cached_file to keep only one of them. The first one raises if the file doesn't exist while the second returns None in such a case. I think having a single method and a public raise_on_error argument should be enough (but open to discussion). Also both methods have the exact same docstring which is misleading.
  • use huggingface_hub.send_telemetry to send telemetry data. Result is exactly the same but the HTTP call in made in a separate thread (meaning better UX as the user is not blocked)
  • (nit) in download_url (already deprecated): it's safer to close the file

Comment on lines +459 to 463
# TODO: deprecate `get_file_from_repo` or document it differently?
# Docstring is exactly the same as `cached_repo` but behavior is slightly different. If file is missing or if
# there is a connection error, `cached_repo` will return None while `get_file_from_repo` will raise an error.
# IMO we should keep only 1 method and have a single `raise_error` argument (to be discussed).
def get_file_from_repo(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be discussed. I tend to prefer to have a single cached_file method with a raise_on_error argument (and with the correct typing to let the user know when to expect a None value and when it's not possible). I can do it here or in a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd vote to uniformize everything under hfh's banner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'd be ok for moving the logic to huggingface_hub but this method doesn't exist at the moment.

In hfh we have hf_hub_download that does the job to download/retrieve a file given a repo_id + filename. In transformers, cached_file is slightly different as it accept either a repo_id or a local path (in path_or_repo_id parameter), making it more versatile.

So for now I'd be more in favor of uniforming within transformers and then later move it to huggingface_hub.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_file_from_repo calls cached_file (while that method has True as default values

        _raise_exceptions_for_missing_entries=False,
        _raise_exceptions_for_connection_errors=False,

Not against to keep only one of them. But the changes might get larger. I would personally do this in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, let's wait for a separate PR 👍

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this important work @Wauplin!

)
from huggingface_hub.utils._deprecation import _deprecate_method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a public method to leverage?

Not ideal to depend on private attributes from hfh 🫣

Copy link
Contributor Author

@Wauplin Wauplin Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I realized about that while working on this PR 😭

I'm fine with either making them public (huggingface_hub.utils.deprecate_method ?) which would require to wait next hfh release or making an exception for this one since it's a quite internal method anyway (so using private attribute in transformers).

Copy link
Contributor Author

@Wauplin Wauplin Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #27564 (comment) I suggest to still use the private attribute until it's made more official in hfh side. Worse case scenario, the plan is to remove this in 3 transformers releases.

Comment on lines +459 to 463
# TODO: deprecate `get_file_from_repo` or document it differently?
# Docstring is exactly the same as `cached_repo` but behavior is slightly different. If file is missing or if
# there is a connection error, `cached_repo` will return None while `get_file_from_repo` will raise an error.
# IMO we should keep only 1 method and have a single `raise_error` argument (to be discussed).
def get_file_from_repo(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd vote to uniformize everything under hfh's banner.

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work @Wauplin !

Left a few comments / questions, but overall it's good to me

src/transformers/utils/hub.py Show resolved Hide resolved
src/transformers/utils/hub.py Outdated Show resolved Hide resolved
src/transformers/utils/hub.py Show resolved Hide resolved
Comment on lines +459 to 463
# TODO: deprecate `get_file_from_repo` or document it differently?
# Docstring is exactly the same as `cached_repo` but behavior is slightly different. If file is missing or if
# there is a connection error, `cached_repo` will return None while `get_file_from_repo` will raise an error.
# IMO we should keep only 1 method and have a single `raise_error` argument (to be discussed).
def get_file_from_repo(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_file_from_repo calls cached_file (while that method has True as default values

        _raise_exceptions_for_missing_entries=False,
        _raise_exceptions_for_connection_errors=False,

Not against to keep only one of them. But the changes might get larger. I would personally do this in a separate PR.

src/transformers/utils/hub.py Outdated Show resolved Hide resolved
src/transformers/utils/hub.py Show resolved Hide resolved
@ydshieh
Copy link
Collaborator

ydshieh commented Nov 20, 2023

Thank you for the explanation. Just one more nit question, in huggingface/huggingface_hub#1786, there is a table of the re-mapped names, like HUGGINGFACE_HUB_CACHE -> HF_CACHE. I am wondering why we still have the usage of constants.HUGGINGFACE_HUB_CACHE.

@Wauplin
Copy link
Contributor Author

Wauplin commented Nov 20, 2023

Thanks for the reviews @LysandreJik and @ydshieh!

I addressed 2 points:

  • do not use use_auth_token in deprecated code (see here)
  • use huggingface_hub.constants.HF_HUB_CACHE instead of huggingface_hub.constants.HUGGINGFACE_HUB_CACHE (see here)

About using from huggingface_hub.utils._deprecation import _deprecate_method I would be keen to keep it as it is for now if that's ok with you @LysandreJik. It's not ideal but still ok I think. In next release of hfh I'll make it more official and in any case _deprecate_method will be used only in the next 3 versions of transformers (so EoL quite soon :) ).

Finally, regarding removing get_file_from_repo in favor of a unified cached_file` let's do it as a separate PR as suggested by @ydshieh. It is not really related to env variable harmonization.


@LysandreJik @ydshieh could you re-review the last changes + message above and approve the PR? I think we are good to merge the current version if that is ok with you. Thanks in advance!

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Nov 20, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker
Copy link
Collaborator

I'll review tomorrow! Thanks for this refactor 😉

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - really nice tidy up!

Only Q is about existing HUGGINGFACE_HUB_CACHE values in the environment

Nvm - I saw the discussion here: #27564 (comment)

# In code, use `HF_HUB_CACHE` as the default cache path. This variable is set by the library and is guaranteed
# to be set to the right value.
#
# TODO: clean this for v5?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

src/transformers/utils/hub.py Show resolved Hide resolved
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Wauplin
Copy link
Contributor Author

Wauplin commented Nov 21, 2023

Yay! 3.5 approvals, that's enough! Thanks everyone 🤗

@Wauplin Wauplin merged commit d2a980e into main Nov 21, 2023
21 checks passed
@Wauplin Wauplin deleted the harmonize-environment-variables branch November 21, 2023 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants