Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update retokenization tool to support RoBERTa; fix WSC #903

Merged
merged 33 commits into from
Sep 13, 2019

Conversation

sleepinyourhat
Copy link
Contributor

It looks like the retokenization code wasn't touched in #890, which means that WSC isn't supported with RoBERTa models. This should fix that.

@sleepinyourhat
Copy link
Contributor Author

@HaokunLiu pointed out a possible issue privately—not ready to merge until we figure that out...

@@ -307,14 +309,6 @@ def align_moses(text: Text) -> Tuple[TokenAligner, List[Text]]:
return ta, moses_tokens


def align_openai(text: Text) -> Tuple[TokenAligner, List[Text]]:
eow_tokens = space_tokenize_with_eow(text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly different from the new version, because it added end-of-word markers (as used in the original GPT) instead of beginning-of-word markers to ensure correct character overlap. We should probably add something to align_wpm to detect this and do the appropriate padding?

Otherwise, it might be safer to just replace align_wpm with a much simpler implementation than the black-box character-based alignment we have now - just split into the original tokens, then split each one into wordpieces while tracking offsets. This is recommended for BERT (https://github.com/google-research/bert#tokenization), but not sure if it's compatible with SentencePiece or other subword models.

@HaokunLiu
Copy link
Member

Well, I forgot to update this last time, sorry. I think this should work now. I didn't update the test script with all the new tokenizers, but I did manually run and checked each new tokenizers on the cases in test script, making sure the result spans are selecting the right content.

@W4ngatang
Copy link
Collaborator

Is this also something we need to worry about with WiC?

@HaokunLiu
Copy link
Member

Is this also something we need to worry about with WiC?

Good catch.

WiC preprocess relies upon parts of a sentence can be tokenized separately then combined. But that is not the case for ByteBPE(roberta and gpt2).

bert
['Ah', '##hh', '##h', ',', 'what', "'", 's', 'in', 'the', 'box', '?']
roberta
['Ah', 'hhh', ',', 'Ġwhat', "'s", 'Ġin', 'Ġthe', 'Ġbox', '?']
xlnet
['▁Ah', 'hhh', ',', '▁what', "'", 's', '▁in', '▁the', '▁box', '?']
openai-gpt
['a', 'hhhh', ',', 'what', "'s", 'in', 'the', 'box', '?']
gpt2
['Ah', 'hhh', ',', 'Ġwhat', "'s", 'Ġin', 'Ġthe', 'Ġbox', '?']
transfo-xl
['Ahhhh', ',', 'what', ''s', 'in', 'the', 'box', '?']
xlm
['ah', 'h', 'hh', ',', 'what', "'s", 'in', 'the', 'box', '?']

sent_mid = tokenize_and_truncate(self._tokenizer_name, word, self.max_seq_len)
sent_tok = sent_tok1 + sent_mid + sent_tok2
sent_tok = tokenize_and_truncate(self._tokenizer_name, sent, self.max_seq_len)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable. Have you confirmed that no other tasks use the old logic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no more tasks using old logic now.

placeholder_loc = len(
tokenize_and_truncate(self.tokenizer_name, sent_parts[0], self.max_seq_len)
)
sent_tok = tokenize_and_truncate(self.tokenizer_name, sent, self.max_seq_len)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little sketchy - prefer something like return sent_tok[:i] + ["@placeholder"] + sent_tok[i:] than mutating with insert?

Also, should this be truncated to self.max_seq_len - to account for the placeholder?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

return ta, sentencepiece_tokens


def align_bpe(text: Text, tokenizer_name: str) -> Tuple[TokenAligner, List[Text]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add function docstrings giving examples and stating which models these apply to?

Since these are fairly dense and involve a number of edge cases, should we write some unit tests for this module?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid it will not be very economical to do this.
The only occasions we need to modify this, is either adding new tokenizer or refactoring retokenize. When the first happens, the old test cases can always pass, and we always need to add new test cases (which has little reuse value). When the second happens, I don't expect we will keep the same interface.
I wonder what do you think about it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it'll be hard - this module has a very minimal API so it should be easy to write tests. We are refactoring and adding new tokenizers here, and we'd want tests both to ensure there aren't regressions and that the new functionality is doing what we expect.

(This is partly my fault for not having any on the original version, but since it's grown quite a bit in scope and tokenization bugs can be very hard to detect otherwise, I think it's warranted now.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'll do it later today.

return ta, sentencepiece_tokens


def align_bpe(text: Text, tokenizer_name: str) -> Tuple[TokenAligner, List[Text]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it'll be hard - this module has a very minimal API so it should be easy to write tests. We are refactoring and adding new tokenizers here, and we'd want tests both to ensure there aren't regressions and that the new functionality is doing what we expect.

(This is partly my fault for not having any on the original version, but since it's grown quite a bit in scope and tokenization bugs can be very hard to detect otherwise, I think it's warranted now.)

@sleepinyourhat sleepinyourhat added the 0.x.0 release on fix Put out a new 0.x.0 release when this is fixed. label Sep 6, 2019
@sleepinyourhat
Copy link
Contributor Author

What's left to do here? We've been holding off on the RoBERTa release for a while now...

@HaokunLiu
Copy link
Member

What's left to do here? We've been holding off on the RoBERTa release for a while now...

I have added docstrings and test cases, the only problem is using bypebpe tokenizer requires download spacy en, I don't know if there is anyway to do it in circleci. If we simply remove the test case for bytebpe, then this PR is ready to go. Otherwise I'm still waiting to see if @davidbenton can tell me something.

@sleepinyourhat
Copy link
Contributor Author

Ah, that should be easy—I didn't realize you were waiting. CircleCI just runs a list of commands, so you can just add a shell command to download the data here:

https://github.com/nyu-mll/jiant/blob/master/.circleci/config.yml#L50

If it takes more than a few minutes to download, though, it's okay to skip.

@HaokunLiu
Copy link
Member

This should be ready now.

@sleepinyourhat
Copy link
Contributor Author

Great! @iftenney, merge when ready.

@sleepinyourhat sleepinyourhat merged commit 2692f65 into master Sep 13, 2019
@HaokunLiu HaokunLiu deleted the fix-retokenization branch September 15, 2019 14:54
phu-pmh pushed a commit that referenced this pull request Apr 17, 2020
* Rename namespaces to suppress warnings.

* Revert "Rename namespaces to suppress warnings."

This reverts commit 0cf7b23.

* Initial attempt.

* Fix WSC retokenization.

* Remove obnoxious newline.

* fix retokenize

* debug

* WiC fix

* add spaces in docstring

* update record task

* clean up

* "@Placeholder" fix

* max_seq_len fix

* black

* add docstring

* update docstring

* add test script for retokenize

* Revert "add test script for retokenize"

* Create test_retokenize.py

* update to pytorch_transformer 1.2.0

* package, download updates
@jeswan jeswan added the jiant-v1-legacy Relevant to versions <= v1.3.2 label Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.x.0 release on fix Put out a new 0.x.0 release when this is fixed. jiant-v1-legacy Relevant to versions <= v1.3.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants