-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update retokenization tool to support RoBERTa; fix WSC #903
Merged
Merged
Changes from 15 commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
0cf7b23
Rename namespaces to suppress warnings.
sleepinyourhat 38c5581
Revert "Rename namespaces to suppress warnings."
sleepinyourhat 0c4546b
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat 4e2734b
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat df3a271
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat 9c1ba46
Initial attempt.
sleepinyourhat 57076f3
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat 0665933
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat c6c30fa
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat e41718d
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat 174e564
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat 9db0ddf
Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…
sleepinyourhat a5638c2
Fix WSC retokenization.
sleepinyourhat c4e1fc3
Remove obnoxious newline.
sleepinyourhat e2121fa
Merge branch 'master' of https://github.com/nyu-mll/jiant into fix-re…
sleepinyourhat 48defcb
fix retokenize
HaokunLiu 9ab7b6d
debug
HaokunLiu dde2b9a
WiC fix
HaokunLiu afbed57
add spaces in docstring
HaokunLiu 5b13e5d
update record task
HaokunLiu 507f9d4
clean up
HaokunLiu cb11cb7
Merge branch 'master' into fix-retokenization
sleepinyourhat e863f65
"@placeholder" fix
HaokunLiu 1894e10
max_seq_len fix
HaokunLiu cf422b3
black
HaokunLiu 2e94cce
add docstring
HaokunLiu c28ce99
update docstring
HaokunLiu d8ae32b
add test script for retokenize
HaokunLiu 7402be1
Revert "add test script for retokenize"
HaokunLiu 59f9e50
Create test_retokenize.py
HaokunLiu 4cdb3da
update to pytorch_transformer 1.2.0
HaokunLiu 08b256a
package, download updates
HaokunLiu a8abb35
Merge branch 'master' into fix-retokenization
HaokunLiu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly different from the new version, because it added end-of-word markers (as used in the original GPT) instead of beginning-of-word markers to ensure correct character overlap. We should probably add something to
align_wpm
to detect this and do the appropriate padding?Otherwise, it might be safer to just replace
align_wpm
with a much simpler implementation than the black-box character-based alignment we have now - just split into the original tokens, then split each one into wordpieces while tracking offsets. This is recommended for BERT (https://github.com/google-research/bert#tokenization), but not sure if it's compatible with SentencePiece or other subword models.