Update retokenization tool to support RoBERTa; fix WSC #903

sleepinyourhat · 2019-08-27T21:11:33Z

It looks like the retokenization code wasn't touched in #890, which means that WSC isn't supported with RoBERTa models. This should fix that.

This reverts commit 0cf7b23.

…ster

…tokenization

sleepinyourhat · 2019-08-27T22:00:57Z

@HaokunLiu pointed out a possible issue privately—not ready to merge until we figure that out...

iftenney · 2019-08-27T22:34:33Z

jiant/utils/retokenize.py

@@ -307,14 +309,6 @@ def align_moses(text: Text) -> Tuple[TokenAligner, List[Text]]:
    return ta, moses_tokens


-def align_openai(text: Text) -> Tuple[TokenAligner, List[Text]]:
-    eow_tokens = space_tokenize_with_eow(text)


This is slightly different from the new version, because it added end-of-word markers (as used in the original GPT) instead of beginning-of-word markers to ensure correct character overlap. We should probably add something to align_wpm to detect this and do the appropriate padding?

Otherwise, it might be safer to just replace align_wpm with a much simpler implementation than the black-box character-based alignment we have now - just split into the original tokens, then split each one into wordpieces while tracking offsets. This is recommended for BERT (https://github.com/google-research/bert#tokenization), but not sure if it's compatible with SentencePiece or other subword models.

HaokunLiu · 2019-08-27T23:37:09Z

Well, I forgot to update this last time, sorry. I think this should work now. I didn't update the test script with all the new tokenizers, but I did manually run and checked each new tokenizers on the cases in test script, making sure the result spans are selecting the right content.

W4ngatang · 2019-08-28T14:29:32Z

Is this also something we need to worry about with WiC?

HaokunLiu · 2019-08-28T17:43:27Z

Is this also something we need to worry about with WiC?

Good catch.

WiC preprocess relies upon parts of a sentence can be tokenized separately then combined. But that is not the case for ByteBPE(roberta and gpt2).

bert
['Ah', '##hh', '##h', ',', 'what', "'", 's', 'in', 'the', 'box', '?']
roberta
['Ah', 'hhh', ',', 'Ġwhat', "'s", 'Ġin', 'Ġthe', 'Ġbox', '?']
xlnet
['▁Ah', 'hhh', ',', '▁what', "'", 's', '▁in', '▁the', '▁box', '?']
openai-gpt
['a', 'hhhh', ',', 'what', "'s", 'in', 'the', 'box', '?']
gpt2
['Ah', 'hhh', ',', 'Ġwhat', "'s", 'Ġin', 'Ġthe', 'Ġbox', '?']
transfo-xl
['Ahhhh', ',', 'what', ''s', 'in', 'the', 'box', '?']
xlm
['ah', 'h', 'hh', ',', 'what', "'s", 'in', 'the', 'box', '?']

sleepinyourhat · 2019-08-28T19:35:26Z

jiant/tasks/tasks.py

            sent_mid = tokenize_and_truncate(self._tokenizer_name, word, self.max_seq_len)
-            sent_tok = sent_tok1 + sent_mid + sent_tok2
+            sent_tok = tokenize_and_truncate(self._tokenizer_name, sent, self.max_seq_len)


Looks reasonable. Have you confirmed that no other tasks use the old logic?

no more tasks using old logic now.

jiant/utils/retokenize.py

iftenney · 2019-09-03T22:28:11Z

jiant/tasks/qa.py

+            placeholder_loc = len(
+                tokenize_and_truncate(self.tokenizer_name, sent_parts[0], self.max_seq_len)
+            )
+            sent_tok = tokenize_and_truncate(self.tokenizer_name, sent, self.max_seq_len)


This seems a little sketchy - prefer something like return sent_tok[:i] + ["@placeholder"] + sent_tok[i:] than mutating with insert?

Also, should this be truncated to self.max_seq_len - to account for the placeholder?

iftenney · 2019-09-03T22:30:30Z

jiant/utils/retokenize.py

+    return ta, sentencepiece_tokens
+
+
+def align_bpe(text: Text, tokenizer_name: str) -> Tuple[TokenAligner, List[Text]]:


Add function docstrings giving examples and stating which models these apply to?

Since these are fairly dense and involve a number of edge cases, should we write some unit tests for this module?

I am afraid it will not be very economical to do this.
The only occasions we need to modify this, is either adding new tokenizer or refactoring retokenize. When the first happens, the old test cases can always pass, and we always need to add new test cases (which has little reuse value). When the second happens, I don't expect we will keep the same interface.
I wonder what do you think about it?

I don't think it'll be hard - this module has a very minimal API so it should be easy to write tests. We are refactoring and adding new tokenizers here, and we'd want tests both to ensure there aren't regressions and that the new functionality is doing what we expect.

(This is partly my fault for not having any on the original version, but since it's grown quite a bit in scope and tokenization bugs can be very hard to detect otherwise, I think it's warranted now.)

I see. I'll do it later today.

iftenney · 2019-09-05T19:14:12Z

jiant/utils/retokenize.py

+    return ta, sentencepiece_tokens
+
+
+def align_bpe(text: Text, tokenizer_name: str) -> Tuple[TokenAligner, List[Text]]:


I don't think it'll be hard - this module has a very minimal API so it should be easy to write tests. We are refactoring and adding new tokenizers here, and we'd want tests both to ensure there aren't regressions and that the new functionality is doing what we expect.

(This is partly my fault for not having any on the original version, but since it's grown quite a bit in scope and tokenization bugs can be very hard to detect otherwise, I think it's warranted now.)

sleepinyourhat · 2019-09-11T13:44:43Z

What's left to do here? We've been holding off on the RoBERTa release for a while now...

HaokunLiu · 2019-09-11T14:05:56Z

What's left to do here? We've been holding off on the RoBERTa release for a while now...

I have added docstrings and test cases, the only problem is using bypebpe tokenizer requires download spacy en, I don't know if there is anyway to do it in circleci. If we simply remove the test case for bytebpe, then this PR is ready to go. Otherwise I'm still waiting to see if @davidbenton can tell me something.

sleepinyourhat · 2019-09-11T14:36:23Z

Ah, that should be easy—I didn't realize you were waiting. CircleCI just runs a list of commands, so you can just add a shell command to download the data here:

https://github.com/nyu-mll/jiant/blob/master/.circleci/config.yml#L50

If it takes more than a few minutes to download, though, it's okay to skip.

HaokunLiu · 2019-09-11T17:11:41Z

This should be ready now.

sleepinyourhat · 2019-09-11T21:38:57Z

Great! @iftenney, merge when ready.

@Placeholder

* Rename namespaces to suppress warnings. * Revert "Rename namespaces to suppress warnings." This reverts commit 0cf7b23. * Initial attempt. * Fix WSC retokenization. * Remove obnoxious newline. * fix retokenize * debug * WiC fix * add spaces in docstring * update record task * clean up * "@Placeholder" fix * max_seq_len fix * black * add docstring * update docstring * add test script for retokenize * Revert "add test script for retokenize" * Create test_retokenize.py * update to pytorch_transformer 1.2.0 * package, download updates

sleepinyourhat added 14 commits July 12, 2019 15:01

Rename namespaces to suppress warnings.

0cf7b23

Revert "Rename namespaces to suppress warnings."

38c5581

This reverts commit 0cf7b23.

Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…

0c4546b

…ster

Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…

4e2734b

…ster

Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…

df3a271

…ster

Initial attempt.

9c1ba46

Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…

57076f3

…ster

Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…

0665933

…ster

Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…

c6c30fa

…ster

Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…

e41718d

…ster

Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…

174e564

…ster

Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…

9db0ddf

…ster

Fix WSC retokenization.

a5638c2

Remove obnoxious newline.

c4e1fc3

sleepinyourhat requested review from iftenney, pruksmhc and W4ngatang as code owners August 27, 2019 21:11

Merge branch 'master' of https://github.com/nyu-mll/jiant into fix-re…

e2121fa

…tokenization

sleepinyourhat mentioned this pull request Aug 27, 2019

WSC broken with RoBERTa #904

Closed

iftenney requested changes Aug 27, 2019

View reviewed changes

HaokunLiu added 2 commits August 27, 2019 18:55

fix retokenize

48defcb

debug

9ab7b6d

WiC fix

dde2b9a

sleepinyourhat commented Aug 28, 2019

View reviewed changes

jiant/utils/retokenize.py Outdated Show resolved Hide resolved

add spaces in docstring

afbed57

Merge branch 'master' into fix-retokenization

cb11cb7

HaokunLiu requested a review from iftenney September 1, 2019 16:56

iftenney requested changes Sep 3, 2019

View reviewed changes

HaokunLiu added 3 commits September 4, 2019 14:50

"@Placeholder" fix

e863f65

max_seq_len fix

1894e10

black

cf422b3

iftenney requested changes Sep 5, 2019

View reviewed changes

sleepinyourhat added the 0.x.0 release on fix Put out a new 0.x.0 release when this is fixed. label Sep 6, 2019

HaokunLiu added 6 commits September 7, 2019 15:30

add docstring

2e94cce

update docstring

c28ce99

add test script for retokenize

d8ae32b

Revert "add test script for retokenize"

7402be1

Create test_retokenize.py

59f9e50

update to pytorch_transformer 1.2.0

4cdb3da

HaokunLiu added 2 commits September 11, 2019 12:39

package, download updates

08b256a

Merge branch 'master' into fix-retokenization

a8abb35

HaokunLiu requested a review from iftenney September 11, 2019 17:41

iftenney approved these changes Sep 13, 2019

View reviewed changes

sleepinyourhat merged commit 2692f65 into master Sep 13, 2019

HaokunLiu deleted the fix-retokenization branch September 15, 2019 14:54

iftenney mentioned this pull request Nov 5, 2019

ValueError on aligner_fun for RoBERTa tokenizer #934

Closed

HaokunLiu mentioned this pull request Dec 28, 2019

fix roberta retokenization error #982

Merged

jeswan mentioned this pull request Sep 17, 2020

[CLOSED] Update retokenization tool to support RoBERTa; fix WSC nyu-mll/jiant-v1-legacy#903

Closed

jeswan added the jiant-v1-legacy Relevant to versions <= v1.3.2 label Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update retokenization tool to support RoBERTa; fix WSC #903

Update retokenization tool to support RoBERTa; fix WSC #903

sleepinyourhat commented Aug 27, 2019

sleepinyourhat commented Aug 27, 2019

iftenney Aug 27, 2019

HaokunLiu commented Aug 27, 2019

W4ngatang commented Aug 28, 2019

HaokunLiu commented Aug 28, 2019

sleepinyourhat Aug 28, 2019

HaokunLiu Aug 29, 2019

iftenney Sep 3, 2019

HaokunLiu Sep 4, 2019

iftenney Sep 3, 2019

HaokunLiu Sep 4, 2019

iftenney Sep 5, 2019

HaokunLiu Sep 6, 2019

iftenney Sep 5, 2019

sleepinyourhat commented Sep 11, 2019

HaokunLiu commented Sep 11, 2019

sleepinyourhat commented Sep 11, 2019

HaokunLiu commented Sep 11, 2019

sleepinyourhat commented Sep 11, 2019

		return ta, sentencepiece_tokens


		def align_bpe(text: Text, tokenizer_name: str) -> Tuple[TokenAligner, List[Text]]:

Update retokenization tool to support RoBERTa; fix WSC #903

Update retokenization tool to support RoBERTa; fix WSC #903

Conversation

sleepinyourhat commented Aug 27, 2019

sleepinyourhat commented Aug 27, 2019

Choose a reason for hiding this comment

HaokunLiu commented Aug 27, 2019

W4ngatang commented Aug 28, 2019

HaokunLiu commented Aug 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sleepinyourhat commented Sep 11, 2019

HaokunLiu commented Sep 11, 2019

sleepinyourhat commented Sep 11, 2019

HaokunLiu commented Sep 11, 2019

sleepinyourhat commented Sep 11, 2019