[T5Tokenizer] add prepare_seq2seq_batch method #6122

patil-suraj · 2020-07-29T09:10:12Z

This PR adds prepare_seq2seq_batch method to T5Tokenizer as per the proposal in #6080

patil-suraj · 2020-07-29T09:11:21Z

src/transformers/tokenization_t5.py

+
+    def set_tgt_special_tokens(self) -> None:
+        self.prefix_tokens = [self.pad_token_id]
+        self.suffix_tokens = [self.eos_token_id]


not entirely sure about adding eos automatically. What do you think @sshleifer ?

I wouldn't do eos in this PR. I think for that we need to either
a) get to the bottom of why it impacts zero shot translation performance
or
b) add a flag to support not adding it (for backward compatibility/ zero shot tasks).

Do we have evidence that adding a prefix token on the decoder side is helpful?

Do we have evidence that adding a prefix token on the decoder side is helpful?

yes, the T5Model does this in the _shift_right method. Same is the case with the original TF T5 implementation. AFAIK in seq2seq models decoder uses special start token, in BART the tokenizer automatically adds bos, in T5 there is no bos instead pad token is used as decoder start id

get to the bottom of why it impacts zero shot translation performance

I will remove it for now, and for this issue to be solved.

src/transformers/tokenization_t5.py

sshleifer · 2020-07-29T14:24:58Z

src/transformers/tokenization_t5.py

+
+    def set_tgt_special_tokens(self) -> None:
+        self.prefix_tokens = [self.pad_token_id]
+        self.suffix_tokens = [self.eos_token_id]


I wouldn't do eos in this PR. I think for that we need to either
a) get to the bottom of why it impacts zero shot translation performance
or
b) add a flag to support not adding it (for backward compatibility/ zero shot tasks).

Do we have evidence that adding a prefix token on the decoder side is helpful?

sshleifer · 2020-07-29T14:26:10Z

tests/test_tokenization_t5.py

+        ]
+        expected_src_tokens = [71, 307, 8986, 21, 4505, 51, 52, 1707, 5, 1]
+        batch = tokenizer.prepare_seq2seq_batch(
+            src_text, tgt_texts=tgt_text, max_length=len(expected_src_tokens), return_tensors=FRAMEWORK


More cases to test:

test max_target_length kwarg and allow it to be passed through, affect decoder_input_ids.shape[1]

empty tgt_texts

empty src_texts -> Raises something

Thanks, I will cover these cases.

empty tgt_texts

for this can I just check if input_ids and attention_mask are returned and no decoder_input_ids and decoder_attention_mask ?

these tests look great now!

sshleifer

one nit, otherwise LGTM

sshleifer · 2020-07-31T17:20:01Z

src/transformers/tokenization_t5.py

+        for k, v in decoder_inputs.items():
+            model_inputs[f"decoder_{k}"] = v
+
+        self.set_src_special_tokens()


(nit) I would stylistically, just say self.prefix_tokens = [] and self.prefix_tokens = [self.pad_token_id] to avoid adding a layer of abstraction.

Same, unless you expect people to have to subclass your work to inject some custom behavior.

sshleifer · 2020-07-31T17:21:42Z

tests/test_tokenization_t5.py

+        src_text = ["A long paragraph for summrization.", "Another paragraph for summrization."]
+        batch = tokenizer.prepare_seq2seq_batch(src_text, return_tensors=FRAMEWORK)
+        # check if input_ids are returned and no decoder_input_ids
+        self.assertIn("input_ids", batch.keys())


(nit) dont think you need .keys

aah, right. in works for dict keys by default. Thanks 😀

sshleifer · 2020-07-31T17:22:20Z

tests/test_tokenization_t5.py

+        self.assertIsInstance(batch, BatchEncoding)
+        self.assertEqual(batch.input_ids.shape, (2, 512))
+
+    def test_eos_in_input(self):


would be cool to migrate one or more of the integration tests in test_modeling_t5.py to the new method.

sgugger

Very nice, thanks! I have some nits on the docs.

sgugger · 2020-07-31T17:57:24Z

src/transformers/tokenization_t5.py

+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
+        by concatenating and adding special tokens. The special tokens depend on calling source text or target text.
+        An T5 sequence has the following format, where ``X`` represents the sequence:


Suggested change

An T5 sequence has the following format, where ``X`` represents the sequence:

A T5 sequence has the following format, where ``X`` represents the sequence:

sgugger · 2020-07-31T17:58:55Z

src/transformers/tokenization_t5.py

+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):


Suggested change

token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):

token_ids_1 (:obj:`List[int]`, `optional`):

(we only indicate real default values. If something is optional, the None default value is expected).

sgugger · 2020-07-31T17:59:13Z

src/transformers/tokenization_t5.py

+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.


Suggested change

:obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.

:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.

sgugger · 2020-07-31T18:00:24Z

src/transformers/tokenization_t5.py

+        **kwargs,
+    ) -> BatchEncoding:
+        """Prepare a batch that can be passed directly to an instance of T5Model.
+        Arguments:


Please specify the argument types with the same STYLE as above, also make sure you document all arguments (return_tensors is not documented).

sgugger · 2020-07-31T18:00:40Z

src/transformers/tokenization_t5.py

+            **kwargs: passed to self.__call__
+
+        Returns:
+            :obj:`BatchEncoding`: with keys input_ids, attention_mask, decoder_input_ids, decoder_attention_mask.


Suggested change

:obj:`BatchEncoding`: with keys input_ids, attention_mask, decoder_input_ids, decoder_attention_mask.

:class:`~transformers.BatchEncoding`: with keys input_ids, attention_mask, decoder_input_ids, decoder_attention_mask.

sgugger · 2020-07-31T18:01:25Z

src/transformers/tokenization_t5.py

+        for k, v in decoder_inputs.items():
+            model_inputs[f"decoder_{k}"] = v
+
+        self.set_src_special_tokens()


Same, unless you expect people to have to subclass your work to inject some custom behavior.

patil-suraj · 2020-08-01T06:39:23Z

@sshleifer , @sgugger I have made changes regarding the suggestions. Thanks !

sshleifer

LGTM

sshleifer · 2020-08-01T18:53:45Z

tests/test_tokenization_t5.py

+        ]
+        expected_src_tokens = [71, 307, 8986, 21, 4505, 51, 52, 1707, 5, 1]
+        batch = tokenizer.prepare_seq2seq_batch(
+            src_text, tgt_texts=tgt_text, max_length=len(expected_src_tokens), return_tensors=FRAMEWORK


these tests look great now!

sshleifer · 2020-08-01T18:55:11Z

tests/test_tokenization_t5.py

+        self.assertNotIn("decoder_attention_mask", batch)
+
+    def test_max_target_length(self):
+        tokenizer = T5Tokenizer.from_pretrained("t5-small")


tip: you can use

@cached_property def default_tok(self): return T5Tokenizer.from_pretrained("t5-small")

To only initialize once. This barely matters for tokenizers. More usefuls for models where __init__ can take 20 seconds.

codecov · 2020-08-17T17:10:12Z

Codecov Report

Merging #6122 into master will increase coverage by 0.08%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #6122      +/-   ##
==========================================
+ Coverage   78.51%   78.59%   +0.08%     
==========================================
  Files         146      146              
  Lines       26326    26347      +21     
==========================================
+ Hits        20669    20708      +39     
+ Misses       5657     5639      -18

Impacted Files	Coverage Δ
src/transformers/tokenization_t5.py	`96.73% <100.00%> (+0.96%)`	⬆️
src/transformers/modeling_tf_gpt2.py	`65.42% <0.00%> (-29.91%)`	⬇️
src/transformers/tokenization_xlnet.py	`66.66% <0.00%> (-23.43%)`	⬇️
src/transformers/modeling_tf_utils.py	`84.09% <0.00%> (-4.88%)`	⬇️
src/transformers/modeling_tf_pytorch_utils.py	`88.05% <0.00%> (-1.26%)`	⬇️
src/transformers/file_utils.py	`82.20% <0.00%> (-0.29%)`	⬇️
src/transformers/generation_tf_utils.py	`85.71% <0.00%> (-0.26%)`	⬇️
src/transformers/generation_utils.py	`97.11% <0.00%> (+0.28%)`	⬆️
src/transformers/tokenization_openai.py	`84.09% <0.00%> (+12.87%)`	⬆️
src/transformers/modeling_tf_distilbert.py	`98.79% <0.00%> (+33.89%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92f8ce2...a84bb5b. Read the comment docs.

patil-suraj · 2020-08-17T17:10:53Z

@sshleifer , @patrickvonplaten , all green :)

* tests

)" This reverts commit 3dfafe6.

add prepare_seq2seq_batch method

1931f91

patil-suraj commented Jul 29, 2020

View reviewed changes

sshleifer suggested changes Jul 29, 2020

View reviewed changes

patil-suraj added 2 commits July 29, 2020 21:04

remove suffix_tokens

ab3ce4e

more tests

7fcc4a7

sshleifer approved these changes Jul 31, 2020

View reviewed changes

sshleifer requested review from LysandreJik, patrickvonplaten and sgugger July 31, 2020 17:20

sshleifer reviewed Jul 31, 2020

View reviewed changes

better assertIn

10f5898

patil-suraj changed the title ~~[WIP] [T5Tokenizer] add prepare_seq2seq_batch method~~ [T5Tokenizer] add prepare_seq2seq_batch method Jul 31, 2020

sgugger approved these changes Jul 31, 2020

View reviewed changes

patil-suraj added 3 commits August 1, 2020 11:52

fix docs

f8cd95f

cleanup

fb78843

fix style

fbe7ea1

sshleifer approved these changes Aug 1, 2020

View reviewed changes

sshleifer self-assigned this Aug 17, 2020

fix tests

cda6984

better doc

a84bb5b

sshleifer approved these changes Aug 17, 2020

View reviewed changes

sshleifer merged commit 407da12 into huggingface:master Aug 17, 2020

patil-suraj deleted the t5-tok-seq2seq-batch branch August 17, 2020 17:59

Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020

[T5Tokenizer] add prepare_seq2seq_batch method (huggingface#6122)

321bf49

* tests

fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

[T5Tokenizer] add prepare_seq2seq_batch method (huggingface#6122)

3dfafe6

* tests

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "[T5Tokenizer] add prepare_seq2seq_batch method (huggingface#6122

144f8eb

)" This reverts commit 3dfafe6.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[T5Tokenizer] add prepare_seq2seq_batch method #6122

[T5Tokenizer] add prepare_seq2seq_batch method #6122

patil-suraj commented Jul 29, 2020

patil-suraj Jul 29, 2020

sshleifer Jul 29, 2020

patil-suraj Jul 29, 2020

patil-suraj Jul 29, 2020 •

edited

Loading

sshleifer Jul 29, 2020

sshleifer Jul 29, 2020

patil-suraj Jul 29, 2020

patil-suraj Jul 29, 2020

sshleifer Aug 1, 2020

sshleifer left a comment

sshleifer Jul 31, 2020

sgugger Jul 31, 2020

sshleifer Jul 31, 2020

patil-suraj Jul 31, 2020

sshleifer Jul 31, 2020

sgugger left a comment

sgugger Jul 31, 2020

sgugger Jul 31, 2020

sgugger Jul 31, 2020

sgugger Jul 31, 2020

sgugger Jul 31, 2020

sgugger Jul 31, 2020

patil-suraj commented Aug 1, 2020

sshleifer left a comment

sshleifer Aug 1, 2020

sshleifer Aug 1, 2020

codecov bot commented Aug 17, 2020 •

edited

Loading

patil-suraj commented Aug 17, 2020

	An T5 sequence has the following format, where ``X`` represents the sequence:
	A T5 sequence has the following format, where ``X`` represents the sequence:

	token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
	token_ids_1 (:obj:`List[int]`, `optional`):

	:obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
	:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.

	:obj:`BatchEncoding`: with keys input_ids, attention_mask, decoder_input_ids, decoder_attention_mask.
	:class:`~transformers.BatchEncoding`: with keys input_ids, attention_mask, decoder_input_ids, decoder_attention_mask.

[T5Tokenizer] add prepare_seq2seq_batch method #6122

[T5Tokenizer] add prepare_seq2seq_batch method #6122

Conversation

patil-suraj commented Jul 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patil-suraj Jul 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshleifer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patil-suraj commented Aug 1, 2020

sshleifer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 17, 2020 • edited Loading

Codecov Report

patil-suraj commented Aug 17, 2020

patil-suraj Jul 29, 2020 •

edited

Loading

codecov bot commented Aug 17, 2020 •

edited

Loading