Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add push_to_hub to pipeline #29172

Merged
merged 24 commits into from
Apr 16, 2024
Merged

Conversation

not-lain
Copy link
Contributor

@not-lain not-lain commented Feb 21, 2024

What does this PR do?

this will add push_to_hub method to the pipelines allowing people to push their custom pipelines to the huggingface hub easily

Fixes #28857 #28983

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@Narsil @Rocketknight1

additional resources :

TODO:

  • fix documentation for push_to_hub
  • fix automap configuration for intial push (keeps adding a user/repo--file.module)
  • update save_pretrained method (checkout PreTrainedModel class for more info ) (enhancement)
  • add tests

@not-lain not-lain mentioned this pull request Feb 22, 2024
5 tasks
@not-lain
Copy link
Contributor Author

not-lain commented Feb 22, 2024

@ArthurZucker @Rocketknight1
yess finally, fixed the docs

  • as for the tests, i will leave that part to you
  • as for the configuration i just wanted to highlight what needs to be fixed, also it's consistent enough, and according to Sylvian, not always do people push the pipeline to the same repo containing the model which is tricky, my suggestion is to leave the configuration for now and open a seperate issue for that.

TLDR;
assuming the model is in another remote repo is better than assuming it's in the same one we're pushing to and messing the configuration.

any reviews, comments or ideas are much appreciated.

@not-lain
Copy link
Contributor Author

almost forgot #29004 will fix any problems with remote pipeline configuration for most of the cases (adds remote repo flags user/repo--file.module to the custom-pipeline field, leaving this as the final configuration inconsistency since this is related to the auto_config instead, adding extra unnecessary remote flags , maybe we should add a if else there checking if the pipeline being pushed to the same original model repo.
a ruff estimation of the code in L938 of the same file could be like this :

if self.model.config._name_or_path != repo_id :
  custom_object_save(self, save_directory)

let me know if you approve of this

@not-lain
Copy link
Contributor Author

@ArthurZucker @Rocketknight1 after careful investigation it turns out that the extra flag is added due to the AutoModelForxxx and NOT the new push_to_hub method so i'm removing it from the todo list since it's irrelevant to this pull request.
reporoduction :
https://colab.research.google.com/drive/1unFh3i5FyRRHcUO8Al7cLKkYXPhtr0lC?usp=sharing

@not-lain
Copy link
Contributor Author

not-lain commented Feb 28, 2024

@Rocketknight1 I have updated the save_pretrained a little bit, the reason why i did this is that it's coupled with the push_to_hub method. I have done my part to cover as much ground as possible and this pull request is about the push_to_hub method so i will stop here.

since i don't know the repo that you want to test push to, i will leave that part to you to add them, this notebook will help you out when creating tests https://colab.research.google.com/drive/130IpVrScW8cNomEDY2Fa6-mA4_VrRmgT?usp=sharing

awaiting review

Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditionally approving this PR too, except for one TODO, and the comment that I think we probably need to refactor / reconsider our model of custom pipelines and how they should be saved/loaded. See this comment for more.

Comment on lines 946 to 948
# TODO:
# depricate the safe_serialization parameter and use kwargs instead
# or update the save_pretrained to get all the parameters such as max_shard_size, ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO not finished here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are passing everything to the

+ kwargs["safe_serialization"] = safe_serialization
+ self.model.save_pretrained(save_directory, **kwargs)

as kwargs i felt we should switch to a kwargs annotation
also yeh you are right, there is no need for deprecation or any changes, it already works perfectly as is, should i remove that comment ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - we don't want to leave unnecessary TODOs in the codebase!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the extra comment ✅

Comment on lines +1258 to +1262
Pipeline.push_to_hub = copy_func(Pipeline.push_to_hub)
if Pipeline.push_to_hub.__doc__ is not None:
Pipeline.push_to_hub.__doc__ = Pipeline.push_to_hub.__doc__.format(
object="pipe", object_class="pipeline", object_files="pipeline file"
).replace(".from_pretrained", "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit confuses me - since you're inheriting PushToHubMixin.push_to_hub, __doc__ should always be defined, right? I can see it's a copy of the same code for the other classes that inherit from PushToHubMixin, though, I'm just not sure why it's coded this way. Not a blocker, just a comment!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since other people used this method to copy the docs i chose to use the same one as them, to stay in the same page as them, just to avoid straying too much from the norm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's totally fine! I was just pointing out my own confusion, I guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true that, i think the reason for them using this annotation is that they only need to change one method (the original one) to change the docs for all of the other classes using it.
meaning one changes all, which is a really nice approach

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@not-lain
Copy link
Contributor Author

not-lain commented Mar 7, 2024

cc @Rocketknight1 @ArthurZucker
I added the new feature to the dynamic pipeline test thing
now it's finally ready for a review 🚀🤗

@not-lain
Copy link
Contributor Author

cc @Rocketknight1 @ArthurZucker
any reviews on this one ?

@not-lain
Copy link
Contributor Author

@Rocketknight1 friendly pinging you here.
Just wanted to say that the test that I added is working perfectly ✅

@Rocketknight1
Copy link
Member

Sorry for the delay! I still feel like we might need to refactor our model for what custom pipelines actually do, but in the meantime this seems okay to add.

cc @amyeroberts for core maintainer review - this is basically a PR that adds push_to_hub() to custom pipelines. They already have a save_pretrained() method, so this just pushes the result of that.

We had some internal discussions about this, and at some point we might need to tackle the question of custom pipelines properly, including properly separating them from models (right now they're kind of attached at the hip to the model in their repo). Still, I think this fix is useful in the short-term!

@not-lain
Copy link
Contributor Author

@Rocketknight1

including properly separating them from models

I do agree with you on this point and I do understand where you're coming from but I don't think that this is relevant much to this pr, even if we do seperate the model from the pipeline we still need the push_to_hub method.

imo we should create a separate issue for that

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

Just one small comment


if self.modelcard is not None:
self.modelcard.save_pretrained(save_directory)

@staticmethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a # Copied from comment here as it's the same as the implementation in configuration_utils.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that I am reading this I realize that this is an extra method I will remove it now since the def _set_token_in_kwargs is already defined in src\transformers\configuration_utils.py

Copy link
Contributor Author

@not-lain not-lain Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well in src\transformers\modeling_utils.py in the def save_pretrained they didn't even import nor create a function to add the token to the kwargs.
IMO the _set_token_in_kwargs should be moved to the PushToHubMixin instead, this should help a lot to avoid changing every single class manually, for now I will settle in simply adding a comment since that enhancement is out of scope in this pr, do let me know if you approve of this, if so can you open another issue and tag me, I'll try to contribute to that

EDIT:
same goes for lots of other classes, I think we definitely should implement the DRY principle here and add the _set_token_in_kwargs to the PushToHubMixin instead especially since this is repetitive and we have a parameter that will be deprecated

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_set_token_in_kwargs is only defined in the config class. In fact, looking at it - we shouldn't need it here at all. This is a work around to account for the fact some models' config classes have their own from_pretrained method - but this isn't the case for pipelines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the clarification

@amyeroberts amyeroberts self-requested a review April 5, 2024 11:48
@not-lain
Copy link
Contributor Author

Hi @amyeroberts
Any reviews on this PR?

@amyeroberts
Copy link
Collaborator

@not-lain Per this conversation, the changes removing _set_token_in_kwargs should be done

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating!

Make sure that the input arguments are consistent with the logic and docstrings

src/transformers/pipelines/base.py Outdated Show resolved Hide resolved
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature!

@amyeroberts amyeroberts merged commit 0eaef0c into huggingface:main Apr 16, 2024
21 checks passed
@not-lain
Copy link
Contributor Author

@amyeroberts @Rocketknight1 Thanks a lot guys ✨✨

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Apr 18, 2024
* add `push_to_hub` to pipeline

* fix docs

* format with ruff

* update save_pretrained

* update save_pretrained

* remove unnecessary comment

* switch to push_to_hub method in DynamicPipelineTester

* remove unused imports

* update docs for add_new_pipeline

* fix docs for add_new_pipeline

* add comment

* fix italien docs

* changes to token retrieval for pipelines

* Update src/transformers/pipelines/base.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
ArthurZucker pushed a commit that referenced this pull request Apr 22, 2024
* add `push_to_hub` to pipeline

* fix docs

* format with ruff

* update save_pretrained

* update save_pretrained

* remove unnecessary comment

* switch to push_to_hub method in DynamicPipelineTester

* remove unused imports

* update docs for add_new_pipeline

* fix docs for add_new_pipeline

* add comment

* fix italien docs

* changes to token retrieval for pipelines

* Update src/transformers/pipelines/base.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
ydshieh pushed a commit that referenced this pull request Apr 23, 2024
* add `push_to_hub` to pipeline

* fix docs

* format with ruff

* update save_pretrained

* update save_pretrained

* remove unnecessary comment

* switch to push_to_hub method in DynamicPipelineTester

* remove unused imports

* update docs for add_new_pipeline

* fix docs for add_new_pipeline

* add comment

* fix italien docs

* changes to token retrieval for pipelines

* Update src/transformers/pipelines/base.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
itazap pushed a commit that referenced this pull request May 14, 2024
* add `push_to_hub` to pipeline

* fix docs

* format with ruff

* update save_pretrained

* update save_pretrained

* remove unnecessary comment

* switch to push_to_hub method in DynamicPipelineTester

* remove unused imports

* update docs for add_new_pipeline

* fix docs for add_new_pipeline

* add comment

* fix italien docs

* changes to token retrieval for pipelines

* Update src/transformers/pipelines/base.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add push_to_hub( ) method when working with pipelines
4 participants