Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update retokenization tool to support RoBERTa; fix WSC #903

Merged
merged 33 commits into from
Sep 13, 2019
Merged
Changes from 15 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
0cf7b23
Rename namespaces to suppress warnings.
sleepinyourhat Jul 12, 2019
38c5581
Revert "Rename namespaces to suppress warnings."
sleepinyourhat Jul 12, 2019
0c4546b
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat Jul 15, 2019
4e2734b
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat Jul 21, 2019
df3a271
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat Jul 22, 2019
9c1ba46
Initial attempt.
sleepinyourhat Jul 24, 2019
57076f3
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat Jul 24, 2019
0665933
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat Aug 6, 2019
c6c30fa
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat Aug 8, 2019
e41718d
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat Aug 21, 2019
174e564
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat Aug 25, 2019
9db0ddf
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat Aug 26, 2019
a5638c2
Fix WSC retokenization.
sleepinyourhat Aug 27, 2019
c4e1fc3
Remove obnoxious newline.
sleepinyourhat Aug 27, 2019
e2121fa
Merge branch 'master' of https://github.com/nyu-mll/jiant into fix-re…
sleepinyourhat Aug 27, 2019
48defcb
fix retokenize
HaokunLiu Aug 27, 2019
9ab7b6d
debug
HaokunLiu Aug 27, 2019
dde2b9a
WiC fix
HaokunLiu Aug 28, 2019
afbed57
add spaces in docstring
HaokunLiu Aug 28, 2019
5b13e5d
update record task
HaokunLiu Aug 28, 2019
507f9d4
clean up
HaokunLiu Aug 28, 2019
cb11cb7
Merge branch 'master' into fix-retokenization
sleepinyourhat Aug 29, 2019
e863f65
"@placeholder" fix
HaokunLiu Sep 4, 2019
1894e10
max_seq_len fix
HaokunLiu Sep 4, 2019
cf422b3
black
HaokunLiu Sep 4, 2019
2e94cce
add docstring
HaokunLiu Sep 7, 2019
c28ce99
update docstring
HaokunLiu Sep 7, 2019
d8ae32b
add test script for retokenize
HaokunLiu Sep 7, 2019
7402be1
Revert "add test script for retokenize"
HaokunLiu Sep 7, 2019
59f9e50
Create test_retokenize.py
HaokunLiu Sep 7, 2019
4cdb3da
update to pytorch_transformer 1.2.0
HaokunLiu Sep 10, 2019
08b256a
package, download updates
HaokunLiu Sep 11, 2019
a8abb35
Merge branch 'master' into fix-retokenization
HaokunLiu Sep 11, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 5 additions & 13 deletions jiant/utils/retokenize.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,10 @@
# install with: pip install python-Levenshtein
from Levenshtein.StringMatcher import StringMatcher

from .tokenizers import get_tokenizer
from .utils import unescape_moses
from jiant.pytorch_transformers_interface import input_module_uses_pytorch_transformers
from jiant.utils.tokenizers import get_tokenizer
from jiant.utils.utils import unescape_moses


# Tokenizer instance for internal use.
_SIMPLE_TOKENIZER = SpaceTokenizer()
Expand Down Expand Up @@ -307,14 +309,6 @@ def align_moses(text: Text) -> Tuple[TokenAligner, List[Text]]:
return ta, moses_tokens


def align_openai(text: Text) -> Tuple[TokenAligner, List[Text]]:
eow_tokens = space_tokenize_with_eow(text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly different from the new version, because it added end-of-word markers (as used in the original GPT) instead of beginning-of-word markers to ensure correct character overlap. We should probably add something to align_wpm to detect this and do the appropriate padding?

Otherwise, it might be safer to just replace align_wpm with a much simpler implementation than the black-box character-based alignment we have now - just split into the original tokens, then split each one into wordpieces while tracking offsets. This is recommended for BERT (https://github.com/google-research/bert#tokenization), but not sure if it's compatible with SentencePiece or other subword models.

openai_utils = get_tokenizer("OpenAI.BPE")
bpe_tokens = openai_utils.tokenize(text)
ta = TokenAligner(eow_tokens, bpe_tokens)
return ta, bpe_tokens


def align_wpm(text: Text, tokenizer_name: str) -> Tuple[TokenAligner, List[Text]]:
# If using lowercase, do this for the source tokens for better matching.
do_lower_case = tokenizer_name.endswith("uncased")
Expand All @@ -331,9 +325,7 @@ def align_wpm(text: Text, tokenizer_name: str) -> Tuple[TokenAligner, List[Text]
def get_aligner_fn(tokenizer_name: Text):
if tokenizer_name == "MosesTokenizer":
return align_moses
elif tokenizer_name == "OpenAI.BPE":
return align_openai
elif tokenizer_name.startswith("bert-") or tokenizer_name.startswith("xlnet-"):
elif input_module_uses_pytorch_transformers(tokenizer_name):
return functools.partial(align_wpm, tokenizer_name=tokenizer_name)
else:
raise ValueError(f"Unsupported tokenizer '{tokenizer_name}'")