-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add push_to_hub
to pipeline
#29172
add push_to_hub
to pipeline
#29172
Conversation
@ArthurZucker @Rocketknight1
TLDR; any reviews, comments or ideas are much appreciated. |
almost forgot #29004 will fix any problems with remote pipeline configuration for most of the cases (adds remote repo flags if self.model.config._name_or_path != repo_id :
custom_object_save(self, save_directory) let me know if you approve of this |
@ArthurZucker @Rocketknight1 after careful investigation it turns out that the extra flag is added due to the |
@Rocketknight1 I have updated the save_pretrained a little bit, the reason why i did this is that it's coupled with the push_to_hub method. I have done my part to cover as much ground as possible and this pull request is about the since i don't know the repo that you want to test push to, i will leave that part to you to add them, this notebook will help you out when creating tests https://colab.research.google.com/drive/130IpVrScW8cNomEDY2Fa6-mA4_VrRmgT?usp=sharing awaiting review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conditionally approving this PR too, except for one TODO, and the comment that I think we probably need to refactor / reconsider our model of custom pipelines and how they should be saved/loaded. See this comment for more.
src/transformers/pipelines/base.py
Outdated
# TODO: | ||
# depricate the safe_serialization parameter and use kwargs instead | ||
# or update the save_pretrained to get all the parameters such as max_shard_size, ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO not finished here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are passing everything to the
+ kwargs["safe_serialization"] = safe_serialization
+ self.model.save_pretrained(save_directory, **kwargs)
as kwargs i felt we should switch to a kwargs annotation
also yeh you are right, there is no need for deprecation or any changes, it already works perfectly as is, should i remove that comment ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - we don't want to leave unnecessary TODOs in the codebase!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the extra comment ✅
Pipeline.push_to_hub = copy_func(Pipeline.push_to_hub) | ||
if Pipeline.push_to_hub.__doc__ is not None: | ||
Pipeline.push_to_hub.__doc__ = Pipeline.push_to_hub.__doc__.format( | ||
object="pipe", object_class="pipeline", object_files="pipeline file" | ||
).replace(".from_pretrained", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit confuses me - since you're inheriting PushToHubMixin.push_to_hub
, __doc__
should always be defined, right? I can see it's a copy of the same code for the other classes that inherit from PushToHubMixin
, though, I'm just not sure why it's coded this way. Not a blocker, just a comment!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since other people used this method to copy the docs i chose to use the same one as them, to stay in the same page as them, just to avoid straying too much from the norm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's totally fine! I was just pointing out my own confusion, I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true that, i think the reason for them using this annotation is that they only need to change one method (the original one) to change the docs for all of the other classes using it.
meaning one changes all, which is a really nice approach
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
cc @Rocketknight1 @ArthurZucker |
cc @Rocketknight1 @ArthurZucker |
@Rocketknight1 friendly pinging you here. |
Sorry for the delay! I still feel like we might need to refactor our model for what custom pipelines actually do, but in the meantime this seems okay to add. cc @amyeroberts for core maintainer review - this is basically a PR that adds We had some internal discussions about this, and at some point we might need to tackle the question of custom pipelines properly, including properly separating them from models (right now they're kind of attached at the hip to the model in their repo). Still, I think this fix is useful in the short-term! |
I do agree with you on this point and I do understand where you're coming from but I don't think that this is relevant much to this pr, even if we do seperate the model from the pipeline we still need the push_to_hub method. imo we should create a separate issue for that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
Just one small comment
src/transformers/pipelines/base.py
Outdated
|
||
if self.modelcard is not None: | ||
self.modelcard.save_pretrained(save_directory) | ||
|
||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a # Copied from
comment here as it's the same as the implementation in configuration_utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that I am reading this I realize that this is an extra method I will remove it now since the def _set_token_in_kwargs
is already defined in src\transformers\configuration_utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well in src\transformers\modeling_utils.py
in the def save_pretrained
they didn't even import nor create a function to add the token to the kwargs.
IMO the _set_token_in_kwargs
should be moved to the PushToHubMixin
instead, this should help a lot to avoid changing every single class manually, for now I will settle in simply adding a comment since that enhancement is out of scope in this pr, do let me know if you approve of this, if so can you open another issue and tag me, I'll try to contribute to that
EDIT:
same goes for lots of other classes, I think we definitely should implement the DRY principle here and add the _set_token_in_kwargs
to the PushToHubMixin
instead especially since this is repetitive and we have a parameter that will be deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_set_token_in_kwargs
is only defined in the config class. In fact, looking at it - we shouldn't need it here at all. This is a work around to account for the fact some models' config classes have their own from_pretrained
method - but this isn't the case for pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the clarification
Hi @amyeroberts |
@not-lain Per this conversation, the changes removing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating!
Make sure that the input arguments are consistent with the logic and docstrings
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature!
@amyeroberts @Rocketknight1 Thanks a lot guys ✨✨ |
* add `push_to_hub` to pipeline * fix docs * format with ruff * update save_pretrained * update save_pretrained * remove unnecessary comment * switch to push_to_hub method in DynamicPipelineTester * remove unused imports * update docs for add_new_pipeline * fix docs for add_new_pipeline * add comment * fix italien docs * changes to token retrieval for pipelines * Update src/transformers/pipelines/base.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* add `push_to_hub` to pipeline * fix docs * format with ruff * update save_pretrained * update save_pretrained * remove unnecessary comment * switch to push_to_hub method in DynamicPipelineTester * remove unused imports * update docs for add_new_pipeline * fix docs for add_new_pipeline * add comment * fix italien docs * changes to token retrieval for pipelines * Update src/transformers/pipelines/base.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* add `push_to_hub` to pipeline * fix docs * format with ruff * update save_pretrained * update save_pretrained * remove unnecessary comment * switch to push_to_hub method in DynamicPipelineTester * remove unused imports * update docs for add_new_pipeline * fix docs for add_new_pipeline * add comment * fix italien docs * changes to token retrieval for pipelines * Update src/transformers/pipelines/base.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* add `push_to_hub` to pipeline * fix docs * format with ruff * update save_pretrained * update save_pretrained * remove unnecessary comment * switch to push_to_hub method in DynamicPipelineTester * remove unused imports * update docs for add_new_pipeline * fix docs for add_new_pipeline * add comment * fix italien docs * changes to token retrieval for pipelines * Update src/transformers/pipelines/base.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
What does this PR do?
this will add
push_to_hub
method to the pipelines allowing people to push their custom pipelines to the huggingface hub easilyFixes #28857 #28983
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@Narsil @Rocketknight1
additional resources :
push_to_hub
method)push_to_hub
method)TODO:
push_to_hub
fix automap configuration for intial push (keeps adding auser/repo--file.module
)save_pretrained
method (checkoutPreTrainedModel
class for more info ) (enhancement)