Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewritten batch support in pipelines. #4154

Merged
merged 6 commits into from
May 7, 2020
Merged

Conversation

mfuntowicz
Copy link
Member

Batch support in Pipeline was confusing and not well tested.

This PR rewrites all the content of DefaultArgumentHandler which handles most of the input conversions (args, kwargs, batched, etc.) and brings unit tests on this specific class.

Signed-off-by: Morgan Funtowicz morgan@huggingface.co

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

if len(kwargs) == 1:
output = list(kwargs.values())
else:
output = list(chain(kwargs.values()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be chain(*kwargs.values())? Otherwise why do you need chain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the extra * will lead to unwanted behaviour in the following case:

Imagine you have (quite surprising but not impossible) input such as {"data: "some str", "X": "some other str"} it will explode both str and convert them to a flattened list of char.

The actual benefits of chain is to concatenate all the possible kwargs and conserving only the values and not the key

return []

def __call__(self, *args, **kwargs):
if len(kwargs) > 0 and len(args) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several of the calls to this function in pipelines.py look like:

inputs = self._args_parser(*texts, **kwargs)

Should those be changed to *args to comply with this line?

@codecov-io
Copy link

codecov-io commented May 6, 2020

Codecov Report

Merging #4154 into master will increase coverage by 0.03%.
The diff coverage is 96.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4154      +/-   ##
==========================================
+ Coverage   78.79%   78.83%   +0.03%     
==========================================
  Files         114      114              
  Lines       18711    18726      +15     
==========================================
+ Hits        14743    14762      +19     
+ Misses       3968     3964       -4     
Impacted Files Coverage Δ
src/transformers/pipelines.py 76.27% <96.42%> (+1.32%) ⬆️
src/transformers/modeling_tf_utils.py 92.77% <0.00%> (+0.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 818463e...95a257c. Read the comment docs.

mfuntowicz added 2 commits May 6, 2020 15:08
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
@julien-c
Copy link
Member

julien-c commented May 6, 2020

Ok so this should be ready to merge, right @mfuntowicz?

@sshleifer
Copy link
Contributor

+1 Would love a merge here so I can more easily improve TranslationPipeline

@julien-c julien-c merged commit 0a6cbea into master May 7, 2020
@julien-c julien-c deleted the pipeline-improve-batching branch May 7, 2020 13:52
mfuntowicz added a commit that referenced this pull request May 8, 2020
* Rewritten batch support in pipelines.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix imports sorting 🔧

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Set pad_to_max_length=True by default on Pipeline.

* Set pad_to_max_length=False for generation pipelines.

Most of generation models doesn't have padding token.

* Address @joeddav review comment: Uniformized *args.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Address @joeddav review comment: Uniformized *args (second).

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
tianleiwu pushed a commit to tianleiwu/transformers that referenced this pull request May 8, 2020
* Rewritten batch support in pipelines.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix imports sorting 🔧

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Set pad_to_max_length=True by default on Pipeline.

* Set pad_to_max_length=False for generation pipelines.

Most of generation models doesn't have padding token.

* Address @joeddav review comment: Uniformized *args.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Address @joeddav review comment: Uniformized *args (second).

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants