Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite push_to_hub to use upload_files #18366

Merged
merged 3 commits into from
Aug 1, 2022
Merged

Rewrite push_to_hub to use upload_files #18366

merged 3 commits into from
Aug 1, 2022

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Jul 29, 2022

What does this PR do?

As asked in #18299 this PR rewrites the PushToHubMixin completely to use upload_file behind the scenes. This is breaking since there is no more git repository involved, so while the user might have been left with a proper repository in the past, this is not the case anymore. The trade-off is that there is no need to clone anything now, so the upload should be faster.

Another breaking change is to default use_temp_files to None now, which then defaults to True if there is a local folder of the same name, False otherwise. This felt way more natural with this change, but we can discuss and revisit if needed. For the rest, normally backward compatibility is ensured with proper deprecation warnings when needed.

Using push_to_hub=True in save_pretrained is adapted to this rewrite but with 0 breaking change normally.

You can see in the tests how this simplifies a lot of things at the end of the day, so I'm quite happy with the final API.

Oh and one last change I made is to move the code specific to TensorFlow models in the TF modeling utils file.

@sgugger sgugger requested a review from LysandreJik July 29, 2022 18:42
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 29, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc'ing @Wauplin for visibility (and potential review)

Comment on lines 935 to 941
url = upload_file(
path_or_fileobj=os.path.join(working_dir, file),
path_in_repo=file,
repo_id=repo_id,
token=token,
commit_message=commit_message,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not upload all files in one commit, using create_commit (or upload_folder)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this could be updated to leverage either of the two so that a single commit is done.

The create_commit docs are available here.

The upload_folder docs are available here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using upload_folder, you won't need to track modified files with their timestamps: if the files have the same name as others on the Hub and contain the same values, they won't get pushed (so if you have local checkpoints that have already been pushed it won't push them again, only verify if they're on the Hub).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to use upload_folder instead, it seems like it's an easy enough drop-in replacement

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upload_folder is not ideal here as we may be in a folder where the user has some other stuff (like training artifacts) they didn't explicitly asked us to push. But create_commit should work.

(I was asked to use upload_file looks like my instructions were not correct 😝 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like you have to take your "instructions" from the right persons=)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🫣

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR @sgugger!

It looks great to me, and uses a non-deprecated API which is great. I've added pointers for the switch from upload_file to upload_folder, which is a drop-in replacement.

I've tested that all tests pass locally, the only change that will need to be done is in the commit message attributed to the upload.

>>> pt_model.push_to_hub("my-awesome-model", organization="my-awesome-org")
>>> pt_model.push_to_hub("my-awesome-org/my-awesome-model")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very welcome change

repo_id = f"{organization}/{repo_id}"

token = HfFolder.get_token() if use_auth_token is True else use_auth_token
url = create_repo(repo_id=repo_id, token=token, private=private, exist_ok=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this prepares transformers for huggingface_hub version 0.10.0 (was previously using deprecated behavior)

Comment on lines 935 to 941
url = upload_file(
path_or_fileobj=os.path.join(working_dir, file),
path_in_repo=file,
repo_id=repo_id,
token=token,
commit_message=commit_message,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this could be updated to leverage either of the two so that a single commit is done.

The create_commit docs are available here.

The upload_folder docs are available here.

Comment on lines 935 to 941
url = upload_file(
path_or_fileobj=os.path.join(working_dir, file),
path_in_repo=file,
repo_id=repo_id,
token=token,
commit_message=commit_message,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using upload_folder, you won't need to track modified files with their timestamps: if the files have the same name as others on the Hub and contain the same values, they won't get pushed (so if you have local checkpoints that have already been pushed it won't push them again, only verify if they're on the Hub).

Comment on lines 926 to 942
modified_files = [
f
for f in os.listdir(working_dir)
if f not in files_timestamps or os.path.getmtime(os.path.join(working_dir, f)) > files_timestamps[f]
]
for file in modified_files:
if commit_message is None:
commit_message = f"Upload {file}."
logger.info(f"Uploading {file} to {repo_id}")
url = upload_file(
path_or_fileobj=os.path.join(working_dir, file),
path_in_repo=file,
repo_id=repo_id,
token=token,
commit_message=commit_message,
)
logger.info(f"{file} was uploaded, you can find it at {url}.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
modified_files = [
f
for f in os.listdir(working_dir)
if f not in files_timestamps or os.path.getmtime(os.path.join(working_dir, f)) > files_timestamps[f]
]
for file in modified_files:
if commit_message is None:
commit_message = f"Upload {file}."
logger.info(f"Uploading {file} to {repo_id}")
url = upload_file(
path_or_fileobj=os.path.join(working_dir, file),
path_in_repo=file,
repo_id=repo_id,
token=token,
commit_message=commit_message,
)
logger.info(f"{file} was uploaded, you can find it at {url}.")
url = upload_folder(
repo_id=repo_id,
folder_path=working_dir,
path_in_repo='.',
commit_message=commit_message, token=token
)
logger.info(f"New version was uploaded, you can find it at {url}.")

@@ -36,12 +36,13 @@

import requests
from filelock import FileLock
from huggingface_hub import HfFolder, Repository, create_repo, list_repo_files, whoami
from huggingface_hub import HfFolder, create_repo, list_repo_files, upload_file, whoami
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from huggingface_hub import HfFolder, create_repo, list_repo_files, upload_file, whoami
from huggingface_hub import HfFolder, create_repo, list_repo_files, upload_folder, whoami

Comment on lines 935 to 941
url = upload_file(
path_or_fileobj=os.path.join(working_dir, file),
path_in_repo=file,
repo_id=repo_id,
token=token,
commit_message=commit_message,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to use upload_folder instead, it seems like it's an easy enough drop-in replacement

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! We'll likely adapt it for the Hub mixin in huggingface-hub.

LGTM!

@sgugger sgugger merged commit 01db72a into main Aug 1, 2022
@sgugger sgugger deleted the revamp_push_to_hub branch August 1, 2022 16:07
Comment on lines +2490 to 2491
Upload the model files to the 🤗 Model Hub while synchronizing a local clone of the repo in `repo_path_or_name`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small detail: does it make sense to refer to repo_id instead of repo_path_or_name since the later seems to be deprecated ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just realised it is a merged PR anyway 😬

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can still address ;-)

oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022
* Rewrite push_to_hub to use upload_files

* Adapt the doc a bit

* Address review comments and clean doc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants