-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] High-level and opinionated huggingface-cli push
command
#1352
Comments
wow this writeup is very 🤯 my first hunch would be to already this in
All in all, I'm wondering if we shouldn't instead add this stuff to Looking forward to seeing more people chime in on this! |
Agree on the fact that it would be less maintenance work to implement this in |
@julien-c My plan so far:
|
It's done and released!! 🎉 🎉 See #1618 |
EDIT (before reading below): motivation is the same but the plan has been slightly improved (see this comment).
Basically, the plan is:
delete_patterns
inupload_folder
commit_in_chunks
to chained commits in a PRhuggingface-cli push
from thatThis is a proposal for "yet a new upload method". My idea is to provide end users an easy to use command that does everything needed to push changes from a local directory to the Hub. It is quite opinionated on how to do stuff (see below) which means "no need to worry about details" at the cost of less flexibility. Since it's more designed for users, I think it makes sense to provide it as a CLI command (in additional to a public method).
The basic idea is to implement a
push
method that will list local files, list remote files and upload the diff to a PR, potentially in several commits. Once done, PR can be reviewed or merged directly. Workflow would be the same no matter if you have write permissions to the repo.Here is how it would look like:
Or in as a command line:
Process (TL;DR)
Here is a TL;DR of what would be done under-the-hood when calling this
push(...)
method. I made a more detailed version below.Goals / Advantages
The goal for such a
push(...)
method is to:git add .
,git commit -m "..."
,git push
. At the moment we know that git commands are not as fast as the HTTP methods but they are quite practical to use. Kinda related to CLI interface for downloading files #1105, asking for more CLI integrations.upload_folder
. At the moment,upload_folder
is nice but it's not really a proper "sync" method.If some files are already uploaded, they will be uploaded again. It is at the user discretion to make sure this doesn't happen.=> LFS files (e.g. the ones that matter are never re-uploaded)create_commit
method withCommitOperationAdd
andCommitOperationDelete
(quite manual).create_commit
can theoretically handle 10k LFS files and a 1GB payload for regular files. In practice, 1k LFS files might already result in a timeout. The idea ofpush(...)
would be to create a PR, push multiple commits to it before reviewing + merging.Drawbacks
Repository
,create_commit
andupload_file
/upload_folder
. Each of them serves a different purpose. This would be the most high-level (and opinionated) method we could do.push(...)
method, we start again the process of listing remote files, listing local files, computing and comparing shas,...Alternatives
We have a few alternatives instead of creating "yet a new upload method".
upload_folder
more complete. For example, being able to check remote files before pushing.MakeAlready the case.create_commit
more robust. For example, do not re-upload LFS files if they have already been uploaded to S3 but the commit operation itself timed-out.Process (detailed)
blobs=True
we can know for each file if it's an LFS or a regular one.allow_patterns
/ignore patterns
to filter them but would also make sense to read the.gitignore
attributes.hf_hub_download
. Compute the sha of the local file and the cached file. If different, we want to upload it. Downloading regular files shouldn't be a problem in most cases (only "small" files and not a lot of them).delete_missing
flag (default toFalse
for safety).plan_id
i.e an id unique to the plan that stays the same if we restart the command.plan_commit_id
s i.e. unique ids representing the content we plan to add in each commit. This is different that the commit OIDs."Auto-push using huggingface_hub ({plan_id})"
.plan_id
, create a PR -to be merged tomain
by default-."refs/pr/4"
) to push commits to. Withlist_repo_commits
we can list which commits have already been pushed. We can use the commit message to set theplan_commit_id
.plan_commit_id
. Commit description: list of operations. Useparent_commit
to ensure we don't mess with the history.merge_pr
flag is set) or return the PR url for human verification.The text was updated successfully, but these errors were encountered: