-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update retokenization tool to support RoBERTa; fix WSC #903
Conversation
This reverts commit 0cf7b23.
@HaokunLiu pointed out a possible issue privately—not ready to merge until we figure that out... |
@@ -307,14 +309,6 @@ def align_moses(text: Text) -> Tuple[TokenAligner, List[Text]]: | |||
return ta, moses_tokens | |||
|
|||
|
|||
def align_openai(text: Text) -> Tuple[TokenAligner, List[Text]]: | |||
eow_tokens = space_tokenize_with_eow(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly different from the new version, because it added end-of-word markers (as used in the original GPT) instead of beginning-of-word markers to ensure correct character overlap. We should probably add something to align_wpm
to detect this and do the appropriate padding?
Otherwise, it might be safer to just replace align_wpm
with a much simpler implementation than the black-box character-based alignment we have now - just split into the original tokens, then split each one into wordpieces while tracking offsets. This is recommended for BERT (https://github.com/google-research/bert#tokenization), but not sure if it's compatible with SentencePiece or other subword models.
Well, I forgot to update this last time, sorry. I think this should work now. I didn't update the test script with all the new tokenizers, but I did manually run and checked each new tokenizers on the cases in test script, making sure the result spans are selecting the right content. |
Is this also something we need to worry about with WiC? |
Good catch. WiC preprocess relies upon parts of a sentence can be tokenized separately then combined. But that is not the case for ByteBPE(roberta and gpt2). bert |
sent_mid = tokenize_and_truncate(self._tokenizer_name, word, self.max_seq_len) | ||
sent_tok = sent_tok1 + sent_mid + sent_tok2 | ||
sent_tok = tokenize_and_truncate(self._tokenizer_name, sent, self.max_seq_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable. Have you confirmed that no other tasks use the old logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no more tasks using old logic now.
jiant/tasks/qa.py
Outdated
placeholder_loc = len( | ||
tokenize_and_truncate(self.tokenizer_name, sent_parts[0], self.max_seq_len) | ||
) | ||
sent_tok = tokenize_and_truncate(self.tokenizer_name, sent, self.max_seq_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little sketchy - prefer something like return sent_tok[:i] + ["@placeholder"] + sent_tok[i:]
than mutating with insert?
Also, should this be truncated to self.max_seq_len -
to account for the placeholder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
return ta, sentencepiece_tokens | ||
|
||
|
||
def align_bpe(text: Text, tokenizer_name: str) -> Tuple[TokenAligner, List[Text]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add function docstrings giving examples and stating which models these apply to?
Since these are fairly dense and involve a number of edge cases, should we write some unit tests for this module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid it will not be very economical to do this.
The only occasions we need to modify this, is either adding new tokenizer or refactoring retokenize. When the first happens, the old test cases can always pass, and we always need to add new test cases (which has little reuse value). When the second happens, I don't expect we will keep the same interface.
I wonder what do you think about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it'll be hard - this module has a very minimal API so it should be easy to write tests. We are refactoring and adding new tokenizers here, and we'd want tests both to ensure there aren't regressions and that the new functionality is doing what we expect.
(This is partly my fault for not having any on the original version, but since it's grown quite a bit in scope and tokenization bugs can be very hard to detect otherwise, I think it's warranted now.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I'll do it later today.
return ta, sentencepiece_tokens | ||
|
||
|
||
def align_bpe(text: Text, tokenizer_name: str) -> Tuple[TokenAligner, List[Text]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it'll be hard - this module has a very minimal API so it should be easy to write tests. We are refactoring and adding new tokenizers here, and we'd want tests both to ensure there aren't regressions and that the new functionality is doing what we expect.
(This is partly my fault for not having any on the original version, but since it's grown quite a bit in scope and tokenization bugs can be very hard to detect otherwise, I think it's warranted now.)
What's left to do here? We've been holding off on the RoBERTa release for a while now... |
I have added docstrings and test cases, the only problem is using bypebpe tokenizer requires download spacy en, I don't know if there is anyway to do it in circleci. If we simply remove the test case for bytebpe, then this PR is ready to go. Otherwise I'm still waiting to see if @davidbenton can tell me something. |
Ah, that should be easy—I didn't realize you were waiting. CircleCI just runs a list of commands, so you can just add a shell command to download the data here: https://github.com/nyu-mll/jiant/blob/master/.circleci/config.yml#L50 If it takes more than a few minutes to download, though, it's okay to skip. |
This should be ready now. |
Great! @iftenney, merge when ready. |
* Rename namespaces to suppress warnings. * Revert "Rename namespaces to suppress warnings." This reverts commit 0cf7b23. * Initial attempt. * Fix WSC retokenization. * Remove obnoxious newline. * fix retokenize * debug * WiC fix * add spaces in docstring * update record task * clean up * "@Placeholder" fix * max_seq_len fix * black * add docstring * update docstring * add test script for retokenize * Revert "add test script for retokenize" * Create test_retokenize.py * update to pytorch_transformer 1.2.0 * package, download updates
It looks like the retokenization code wasn't touched in #890, which means that WSC isn't supported with RoBERTa models. This should fix that.