[BartTokenizer] add prepare s2s batch #6212

patil-suraj · 2020-08-03T12:39:33Z

This PR adds prepare_seq2seq_batch method to BartTokenizer as per the proposal in #6080

sshleifer

LGTM, nice tests!

sgugger · 2020-08-03T15:51:21Z

src/transformers/tokenization_bart.py

+        return_tensors: str = "None",
+        **kwargs,
+    ) -> BatchEncoding:
+        """Prepare a batch that can be passed directly to an instance of BartModel.


Suggested change

"""Prepare a batch that can be passed directly to an instance of BartModel.

"""

Prepare a batch that can be passed directly to an instance of :class:`~transformers.BartModel`.

(nit)

sgugger

Thanks for the PR!

Lots of nits on the docs: in general if the argument you are documenting is passed along to another method, don't hesitate to copy-paste the docstring from that method. And when documenting an argument, don't use abbreviations and make full sentences :-)

sgugger · 2020-08-03T15:53:30Z

src/transformers/tokenization_bart.py

+                maximum length for the source text which defers to the config value of 1024 for facebook/bart*
+            max_target_length (:obj:`int`, `optional`):
+                maximum length for the target text which defers to the config value of 1024 for facebook/bart*
+            padding (:obj:`str`, `optional`, defaults to "longest"):


This can be bool, string or PaddingStrategy I believe? See documentation of PreTrainedTokenizerBase.__call__:

padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`): Activates and controls padding. Accepts the following values: * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided). * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the maximum acceptable input length for the model if that argument is not provided. * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different lengths).

sgugger · 2020-08-03T15:55:20Z

src/transformers/tokenization_bart.py

+                maximum length for the target text which defers to the config value of 1024 for facebook/bart*
+            padding (:obj:`str`, `optional`, defaults to "longest"):
+                strategy for padding `input_ids` and `decoder_input_ids`. Should be "max_length" or "longest".
+            return_tensors (:obj:`str`, `optional`):


This can be string or TensorType (same as above, just copy from PreTrainedTokenizerBase.__call__):

return_tensors (:obj:`str` or :class:`~transformers.tokenization_utils_base.TensorType`, `optional`): If set, will return tensors instead of list of python integers. Acceptable values are: * :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects. * :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects. * :obj:`'np'`: Return Numpy :obj:`np.ndarray` objects.

sgugger · 2020-08-03T15:55:54Z

src/transformers/tokenization_bart.py

+            return_tensors (:obj:`str`, `optional`):
+                Can be set to ‘tf’, ‘pt’ or ‘np’ to return respectively TensorFlow `tf.constant`, PyTorch `torch.Tensor` or Numpy :oj: np.ndarray instead of a list of python integers.
+            **kwargs:
+                passed to self.__call__


Suggested change

passed to self.__call__

Additional keyword arguments passed along to :obj:`self.__call__`.

sgugger · 2020-08-03T15:56:35Z

src/transformers/tokenization_bart.py

+        """Prepare a batch that can be passed directly to an instance of BartModel.
+        Args:
+            src_texts (:obj:`List[str]`):
+                list of src texts


Suggested change

list of src texts

List of input texts.

sgugger · 2020-08-03T15:56:49Z

src/transformers/tokenization_bart.py

+            src_texts (:obj:`List[str]`):
+                list of src texts
+            tgt_texts (:obj:`List[str]`, `optional`):
+                list of tgt texts


Suggested change

list of tgt texts

List of target texts.

sgugger · 2020-08-03T15:58:45Z

src/transformers/tokenization_bart.py

+            tgt_texts (:obj:`List[str]`, `optional`):
+                list of tgt texts
+            max_length (:obj:`int`, `optional`):
+                maximum length for the source text which defers to the config value of 1024 for facebook/bart*


Suggested change

maximum length for the source text which defers to the config value of 1024 for facebook/bart*

Maximum length for the source texts. If not provided, this will use the predefined model maximum length.

Don't mention a specific model here since several could be used.

sgugger · 2020-08-03T15:59:08Z

src/transformers/tokenization_bart.py

+            max_length (:obj:`int`, `optional`):
+                maximum length for the source text which defers to the config value of 1024 for facebook/bart*
+            max_target_length (:obj:`int`, `optional`):
+                maximum length for the target text which defers to the config value of 1024 for facebook/bart*


Suggested change

maximum length for the target text which defers to the config value of 1024 for facebook/bart*

Maximum length for the target texts. If not provided, this will use the predefined model maximum length.

patil-suraj · 2020-08-03T16:13:38Z

Thanks @sgugger for theses helpful suggestions!. Will keep these in mind for future PRs.

patil-suraj · 2020-08-03T16:52:41Z

@sgugger , can you help me with the build_doc failure ? Thanks!

sgugger · 2020-08-03T17:18:49Z

Fixed, you needed to have the beginning of the docstrings on a new line for sphinx to understand the indentation.

codecov · 2020-08-03T17:24:41Z

Codecov Report

Merging #6212 into master will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #6212      +/-   ##
==========================================
+ Coverage   78.29%   78.35%   +0.05%     
==========================================
  Files         146      146              
  Lines       26607    26619      +12     
==========================================
+ Hits        20832    20856      +24     
+ Misses       5775     5763      -12

Impacted Files	Coverage Δ
src/transformers/tokenization_bart.py	`96.38% <100.00%> (+0.61%)`	⬆️
src/transformers/modeling_tf_mobilebert.py	`23.38% <0.00%> (-73.39%)`	⬇️
src/transformers/data/processors/utils.py	`27.63% <0.00%> (+1.31%)`	⬆️
src/transformers/tokenization_xlnet.py	`90.09% <0.00%> (+1.80%)`	⬆️
src/transformers/generation_tf_utils.py	`86.46% <0.00%> (+2.25%)`	⬆️
src/transformers/tokenization_roberta.py	`98.63% <0.00%> (+2.73%)`	⬆️
src/transformers/training_args.py	`81.00% <0.00%> (+14.00%)`	⬆️
src/transformers/data/processors/glue.py	`49.09% <0.00%> (+17.09%)`	⬆️
src/transformers/modeling_tf_gpt2.py	`95.32% <0.00%> (+23.67%)`	⬆️
src/transformers/trainer.py	`39.14% <0.00%> (+24.04%)`	⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8edfaaa...b05ee8f. Read the comment docs.

patil-suraj · 2020-08-04T04:26:45Z

Fixed, you needed to have the beginning of the docstrings on a new line for sphinx to understand the indentation.

Thanks @sgugger !

LysandreJik

This looks useful! It would be nice to upstream it so that other sequence to sequence models may make use of it. Also, you added it to BartTokenizer and not the fast tokenizer, is there a reason for that?

If we consider this to be a conversion to an s2s task, I think this would be better suited in a s2s processor like we have for squad_convert_examples_to_features or glue_convert.... I don't see any reason of having it linked to BART especially.

Pinging @thomwolf

patil-suraj · 2020-08-04T11:16:45Z

Hi @LysandreJik

and not the fast tokenizer, is there a reason for that?

No, just forgot to add that.

Upstream will be useful but we will need handle few cases differently for each seq2seq model i.e in case of t5 we manually need to add the deocder_start_token_id as T5 don't have a bos token. Also `eos' needs to be added manually. In case of mBart, it needs the language code as prefix token etc. And also AFAIK lot of people seem to be unfamiliar with the processors API

patil-suraj · 2020-08-07T15:00:49Z

hi @sshleifer , @LysandreJik any update ?

sshleifer · 2020-08-11T20:12:31Z

src/transformers/tokenization_bart.py

+                * :obj:`'np'`: Return Numpy :obj:`np.ndarray` objects.
+            **kwargs:
+                Additional keyword arguments passed along to :obj:`self.__call__`.
+        Returns:


There is a new docstring on master/ tokenization_utils_base.py that you may want to (a) reuse or (b) modify.

…ransformers into bart-tok-s2s-batch

patil-suraj · 2020-08-13T14:13:09Z

@sshleifer updated the docs.

sgugger · 2020-08-13T14:16:20Z

src/transformers/tokenization_bart.py

+            src_texts: (:obj:`list`):
+                list of documents to summarize or source language texts
+            tgt_texts: (:obj:`list`, `optional`):
+                list of tgt language texts or summaries.


The type annotations here were better before. The docstrings should not have abbreviations (and start with a capital and end with a full stop nit).

Aah, I blindly copy pasted, will make the changes. Also can you tell me where the doc error is coming from ?

You're missing new lines before your lists I'd say.

LysandreJik

LGTM as-is (after the doc building test fixes), but we really should add the same method on the Fast tokenizer. Having parity on both tokenizers is one of our goals.

patil-suraj · 2020-08-14T07:33:32Z

@LysandreJik I will add this for fast tokenizer too once this PR is merged.

LysandreJik · 2020-08-14T07:34:32Z

Sounds good!

patil-suraj · 2020-08-14T15:14:39Z

@LysandreJik , doc error is fixed, not sure if current failure is related to this PR.

Co-authored-by: sgugger <sylvain.gugger@gmail.com>

This reverts commit c0b35c9.

add prepare s2s batch

4448ce6

sshleifer approved these changes Aug 3, 2020

View reviewed changes

sshleifer requested review from LysandreJik and sgugger August 3, 2020 14:43

sgugger reviewed Aug 3, 2020

View reviewed changes

doc nit

baf4018

sgugger approved these changes Aug 3, 2020

View reviewed changes

better docs

0301dfd

fix doc indent

24736dd

Fix indentation

ad1b918

LysandreJik reviewed Aug 4, 2020

View reviewed changes

sshleifer reviewed Aug 11, 2020

View reviewed changes

patil-suraj added 4 commits August 13, 2020 19:36

better doc

f0c3039

Merge branch 'bart-tok-s2s-batch' of https://github.com/patil-suraj/t…

7084cf0

…ransformers into bart-tok-s2s-batch

typo

fdf93d9

Merge branch 'bart-tok-s2s-batch' of https://github.com/patil-suraj/t…

e1a7942

…ransformers into bart-tok-s2s-batch

sgugger reviewed Aug 13, 2020

View reviewed changes

fix docs

ebd603e

LysandreJik approved these changes Aug 14, 2020

View reviewed changes

fix docs

b05ee8f

sshleifer self-assigned this Aug 17, 2020

sshleifer merged commit 2a77813 into huggingface:master Aug 17, 2020

patil-suraj deleted the bart-tok-s2s-batch branch August 19, 2020 08:01

Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020

[BartTokenizer] add prepare s2s batch (huggingface#6212)

55b3492

Co-authored-by: sgugger <sylvain.gugger@gmail.com>

fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

[BartTokenizer] add prepare s2s batch (huggingface#6212)

c0b35c9

Co-authored-by: sgugger <sylvain.gugger@gmail.com>

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "[BartTokenizer] add prepare s2s batch (huggingface#6212)"

a497ce0

This reverts commit c0b35c9.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BartTokenizer] add prepare s2s batch #6212

[BartTokenizer] add prepare s2s batch #6212

patil-suraj commented Aug 3, 2020

sshleifer left a comment

sgugger Aug 3, 2020 •

edited

Loading

sgugger left a comment

sgugger Aug 3, 2020

sgugger Aug 3, 2020

sgugger Aug 3, 2020

sgugger Aug 3, 2020

sgugger Aug 3, 2020

sgugger Aug 3, 2020

sgugger Aug 3, 2020

patil-suraj commented Aug 3, 2020

patil-suraj commented Aug 3, 2020

sgugger commented Aug 3, 2020

codecov bot commented Aug 3, 2020 •

edited

Loading

patil-suraj commented Aug 4, 2020

LysandreJik left a comment

patil-suraj commented Aug 4, 2020

patil-suraj commented Aug 7, 2020

sshleifer Aug 11, 2020

patil-suraj commented Aug 13, 2020

sgugger Aug 13, 2020

patil-suraj Aug 13, 2020 •

edited

Loading

sgugger Aug 13, 2020

LysandreJik left a comment •

edited

Loading

patil-suraj commented Aug 14, 2020

LysandreJik commented Aug 14, 2020

patil-suraj commented Aug 14, 2020 •

edited

Loading

	"""Prepare a batch that can be passed directly to an instance of BartModel.
	"""
	Prepare a batch that can be passed directly to an instance of :class:`~transformers.BartModel`.

	passed to self.__call__
	Additional keyword arguments passed along to :obj:`self.__call__`.

	maximum length for the source text which defers to the config value of 1024 for facebook/bart*
	Maximum length for the source texts. If not provided, this will use the predefined model maximum length.

	maximum length for the target text which defers to the config value of 1024 for facebook/bart*
	Maximum length for the target texts. If not provided, this will use the predefined model maximum length.

[BartTokenizer] add prepare s2s batch #6212

[BartTokenizer] add prepare s2s batch #6212

Conversation

patil-suraj commented Aug 3, 2020

sshleifer left a comment

Choose a reason for hiding this comment

sgugger Aug 3, 2020 • edited Loading

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patil-suraj commented Aug 3, 2020

patil-suraj commented Aug 3, 2020

sgugger commented Aug 3, 2020

codecov bot commented Aug 3, 2020 • edited Loading

Codecov Report

patil-suraj commented Aug 4, 2020

LysandreJik left a comment

Choose a reason for hiding this comment

patil-suraj commented Aug 4, 2020

patil-suraj commented Aug 7, 2020

Choose a reason for hiding this comment

patil-suraj commented Aug 13, 2020

Choose a reason for hiding this comment

patil-suraj Aug 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment • edited Loading

Choose a reason for hiding this comment

patil-suraj commented Aug 14, 2020

LysandreJik commented Aug 14, 2020

patil-suraj commented Aug 14, 2020 • edited Loading

sgugger Aug 3, 2020 •

edited

Loading

codecov bot commented Aug 3, 2020 •

edited

Loading

patil-suraj Aug 13, 2020 •

edited

Loading

LysandreJik left a comment •

edited

Loading

patil-suraj commented Aug 14, 2020 •

edited

Loading