Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] High-level and opinionated huggingface-cli push command #1352

Closed
Wauplin opened this issue Feb 21, 2023 · 4 comments
Closed

[RFC] High-level and opinionated huggingface-cli push command #1352

Wauplin opened this issue Feb 21, 2023 · 4 comments

Comments

@Wauplin
Copy link
Contributor

Wauplin commented Feb 21, 2023

EDIT (before reading below): motivation is the same but the plan has been slightly improved (see this comment).
Basically, the plan is:

  • add delete_patterns in upload_folder
  • implement commit_in_chunks to chained commits in a PR
  • implement huggingface-cli push from that

This is a proposal for "yet a new upload method". My idea is to provide end users an easy to use command that does everything needed to push changes from a local directory to the Hub. It is quite opinionated on how to do stuff (see below) which means "no need to worry about details" at the cost of less flexibility. Since it's more designed for users, I think it makes sense to provide it as a CLI command (in additional to a public method).

The basic idea is to implement a push method that will list local files, list remote files and upload the diff to a PR, potentially in several commits. Once done, PR can be reviewed or merged directly. Workflow would be the same no matter if you have write permissions to the repo.

Here is how it would look like:

def push( # To be renamed
    local_dir: Union[str, Path],
    repo_id: str,
    repo_type: Optional[str] = None,
    delete_missing: bool = False, # If true => delete remote file that are missing locally
    allow_patterns: Optional[Union[List[str], str]] = None, # or --include
    ignore_patterns: Optional[Union[List[str], str]] = None, # or --exclude
    merge_pr: bool = False, # by default, commits are pushed to a PR. If True => merge without verification
    revision: Optional[str] = None, # set target branch (default to `main`)
    max_workers: int = 8,
    ... # token, user-agent,...
) -> ...: # To be defined
    pass

Or in as a command line:

huggingface-cli push 
    --repo-id=...
    --repo-type=...
    --include=...
    --exclude=...
    --delete-missing
    --merge
    --max-workers=...

Process (TL;DR)

Here is a TL;DR of what would be done under-the-hood when calling this push(...) method. I made a more detailed version below.

  1. Create repo/list remote files.
  2. List local files to push.
  3. Check which files to upload or delete.
  4. Define a plan (i.e. the list of commits we want to do). In most cases, only 1 commit.
  5. Create a PR.
  6. Push commits 1 by 1 to this PR.
  7. (optional) Merge the PR.

Goals / Advantages

The goal for such a push(...) method is to:

  1. Be as easy to use as git add ., git commit -m "...", git push. At the moment we know that git commands are not as fast as the HTTP methods but they are quite practical to use. Kinda related to CLI interface for downloading files #1105, asking for more CLI integrations.
  2. Be used by end-users. The primary goal is not to be integrated in third-party libraries but really to ease the workflow for users. Integrations using the existing methods are good enough IMO.
  3. Be more complete than upload_folder. At the moment, upload_folder is nice but it's not really a proper "sync" method.
    1. If some files are already uploaded, they will be uploaded again. It is at the user discretion to make sure this doesn't happen. => LFS files (e.g. the ones that matter are never re-uploaded)
    2. If some files have been deleted locally, they are not deleted on the remote branch. The user has to use the create_commit method with CommitOperationAdd and CommitOperationDelete (quite manual).
  4. Handle arbitrary large commits. At the moment, create_commit can theoretically handle 10k LFS files and a 1GB payload for regular files. In practice, 1k LFS files might already result in a timeout. The idea of push(...) would be to create a PR, push multiple commits to it before reviewing + merging.
  5. Resume upload if an exception happened. When uploading large files or a lot of files, it can happen to get exceptions in the process. At the moment, it is necessary to start over from the beginning. By committing a series of smaller commits, we can resume more easily.
  6. (optional) Allow same workflow no matter if you have write permission on the repo or not. In any case, changes are pushed to a PR. If you have write permission, you can directly merge the PR but the workflow is still the same.
  7. (optional) Have an interactive/verbose TUI to help users push their stuff.

Drawbacks

  1. This would be yet another way to commit stuff on the Hub. We already have Repository, create_commit and upload_file/upload_folder. Each of them serves a different purpose. This would be the most high-level (and opinionated) method we could do.
  2. The closer we get to git commands, the more expectations users could get. This proposal is more a "weak" version of git in the sense that each time we call this push(...) method, we start again the process of listing remote files, listing local files, computing and comparing shas,...
  3. Am I re-inventing the wheel where improving our git-cli support would be enough? Creating a PR to play the role of "temporary commit" feels hacky but can still be valid.

Alternatives

We have a few alternatives instead of creating "yet a new upload method".

  • Make upload_folder more complete. For example, being able to check remote files before pushing.
  • Make create_commit more robust. For example, do not re-upload LFS files if they have already been uploaded to S3 but the commit operation itself timed-out. Already the case.
  • Make server-side changes to handle "multi-step commits". Instead of pushing multiple commits to a branch and make a PR, we could sequentially update an on-going commit. Would be kinda the same as transactional DBs where changes are not effective unless a db.commit() call.

Process (detailed)

  1. Check if repo exists + list files. Using blobs=True we can know for each file if it's an LFS or a regular one.
  2. List local files. Use allow_patterns/ignore patterns to filter them but would also make sense to read the .gitignore attributes.
  3. For each file:
    1. If file is on local and on remote:
      1. If file is tracked as LFS: compute sha of the local file. Sha of the remote file is already returned so we can compare. If different than remote, we want to upload the file.
      2. If file is tracked as regular: download the remote file into the cache using hf_hub_download. Compute the sha of the local file and the cached file. If different, we want to upload it. Downloading regular files shouldn't be a problem in most cases (only "small" files and not a lot of them).
    2. If file on local but not remote: we want to upload it.
    3. If file is remote but not local: either leave it or delete it depending on a delete_missing flag (default to False for safety).
  4. Now that we know which files we want to upload/delete, we "have a plan". A plan is the list of commits we plan to make + the list of operations in each commit. From this plan we should be able to:
    1. Generate a plan_id i.e an id unique to the plan that stays the same if we restart the command.
    2. Generate a list of plan_commit_ids i.e. unique ids representing the content we plan to add in each commit. This is different that the commit OIDs.
  5. Create a PR with title "Auto-push using huggingface_hub ({plan_id})".
    1. List all opened PRs and scan the PR titles. If we cannot find the plan_id, create a PR -to be merged to main by default-.
    2. If PR already exist, use it. This plays as a "resume upload" feature.
  6. Now we have a pr ref (e.g. "refs/pr/4") to push commits to. With list_repo_commits we can list which commits have already been pushed. We can use the commit message to set the plan_commit_id.
  7. Push the remaining commits! Commit title: plan_commit_id. Commit description: list of operations. Use parent_commit to ensure we don't mess with the history.
  8. Once all commits have been pushed, we can either merge it directly (if merge_pr flag is set) or return the PR url for human verification.
@julien-c
Copy link
Member

wow this writeup is very 🤯

my first hunch would be to already this in upload_folder (at least the not-reuploading-stuff part) instead of doing a new method, but I guess the issue is when we also add:

  • remote deletes of missing local files
  • handle large commits by opening PRs (i do like the design of opening a PR)

All in all, I'm wondering if we shouldn't instead add this stuff to upload_folder, potentially as options... Even though for now upload_folder is a lower-level, ~single-endpoint call, so 🤷

Looking forward to seeing more people chime in on this!

@Wauplin
Copy link
Contributor Author

Wauplin commented Feb 21, 2023

Agree about the "we shouldn't re-upload LFS files to S3" in upload_folder (or any create_commit). I created an issue separately (#1355) since it's a bit orthogonal to the initial topic (the suggested huggingface-cli push-like command). => LFS are not re-uploaded, I was wrong about that (see code)

Agree on the fact that it would be less maintenance work to implement this in upload_folder directly but it's also not doing exactly the same. I'm not against changing the behavior of upload_folder if we think it's best.

@Wauplin
Copy link
Contributor Author

Wauplin commented Mar 2, 2023

@julien-c My plan so far:

  1. Add delete_patterns option to upload_folder #1370
    Add a delete_patterns argument to upload_folder => could simplify quite some integration by doing a clean repo + upload local folder in a single call (no need to iterate manually + use the low-level CommitOperationAdd/Delete stuff.
  2. Create a commit_by_chunks endpoint to create a PR + push multiple commits + merge it, as describe in the issue description
  3. Implement the huggingface-cli push command which would be an alias from upload_folder(..., delete_patterns="*", chunked_commits=True)
  4. for huggingface-cli push (and even upload_folder?) => respect .gitignore file if it exists in the folder
  5. implement optimizations? (do not re-upload regular files that have already been uploaded - requires a local copy - might not be worth it)

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 29, 2023

It's done and released!! 🎉 🎉

See #1618

@Wauplin Wauplin closed this as completed Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants