-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XLNet support and overhaul/cleanup of BERT support #845
Conversation
This reverts commit 0cf7b23.
Hello @sleepinyourhat! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
You can repair most issues by installing black and running: Comment last updated at 2019-08-07 21:30:58 UTC |
@@ -106,7 +106,7 @@ def del_field_tokens(instance): | |||
del field.tokens | |||
|
|||
|
|||
def _index_split(task, split, indexers, vocab, record_file): | |||
def _index_split(task, split, indexers, vocab, record_file, boundary_token_fn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method now needs to be passed around all over the place, but it helps simplify some really hairy task-specific logic surrounding the placement of [cls] and [sep].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not urgent. Yet I'm kind of worried about this. After all, how to deal with boundary, and how to put multiple input sentences together are model-specific. Right now, implementing these at data processing sphere does not make much difference. But when people come up with all sorts of weird things to do, processing split will become more bloated, making it harder to add new models or tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a bit messy. I think we could do better, but it's not obvious how—there are lots of tasks for which it's not precisely clear how to set up BERT. This PR isn't changing the messy-ish setup, it's just making it more visible.
Mind making an issue?
@@ -2123,7 +2143,7 @@ def __init__(self, path, max_seq_len, name="ccg", **kw): | |||
self.val_data_text = None | |||
self.test_data_text = None | |||
|
|||
def process_split(self, split, indexers) -> Iterable[Type[Instance]]: | |||
def process_split(self, split, indexers, boundary_token_fn) -> Iterable[Type[Instance]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CCG and other tagging tasks are likely to be especially tricky, because of the need to align the tags to the tokens.
- There's subtle logic here and in the retokenization code and in this script that this may have broken. @pruksmhc @iftenney, would either of you be willing to take a pass, and maybe even push an update to this branch? 😬
- I'm not sure I see what the differences are between wordpiece and sentencepiece tokenization. @iftenney - Do you know sentencepiece? Is there anything in that difference that we should worry about? The main thing that I can see is that spaces are explicitly included in the tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'll take a pass early tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the CCG code at all.
For edge probing retokenize code, the only tokenization-specific thing is some postprocessing to encourage closer alignment - I'll have to check what sentencepiece looks like there. (Alternatively, I think sentencepiece might give you byte offsets so I could just use those directly.)
This shouldn't block anything; there's already an exception that will be thrown if the tokenizer isn't whitelisted:
jiant/jiant/utils/retokenize.py
Line 339 in 7ff148e
raise ValueError(f"Unsupported tokenizer '{tokenizer_name}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the major problems/bugs were resolved, and the changes seem sane. Of course, test and make sure no major regressions in any tasks remain before merging.
@pruksmhc Thanks! I'm testing everything I can, but do make sure to have a close look at CCG. I'm not fully set up to test that. |
CCG has tokenization as a off-line preprocessing step (which should honestly be changed). I would say just put in the documentation that CCG is not yet set up for XLNet, and leave CCG for another PR. |
The off-line preprocessing step should honestly be made as part of load_data too. |
@pruksmhc - Mind making an issue? |
This should be ready to go after one last test. @iftenney Any last comments before I merge? |
APIs from pytorch_transfromers. | ||
""" | ||
|
||
def __init__(self, args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this only seems to depend on args.exp_dir
and args.pytorch_transformers_embedding_mode
- for cleaner abstractions, can we make this explicit?
This should work:
def __init__(self, exp_dir, pytorch_transformers_embedding_mode, **unused_kw):
# do stuff
and call as module(**args.to_dict())
(similarly with parameter_setup()
below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inheriting classes add a few more args, so this would either expand the code a fair bit or else add an inconsistency. Weak/lazy preference to leave as is for now.
|
* Rename namespaces to suppress warnings. * Revert "Rename namespaces to suppress warnings." This reverts commit 0cf7b23. * Initial working-ish attempt. * Intermediate check-in... * More partial progress. * Another pass... * Fix sep/cls handling, cleanup. * Further cleanup. * Keyword name fix. * Another flag fix. * Pull debug print. * Line length cleanup. * WiC fix. * Two task setup bugs. * BoolQ typo * Improved segment handling. * Delete unused is_pair_task, other cleanup/fixes. * Fix deleted path from merge. * Fix cache path. * Address (spurious?) tokenization warning. * Select pool_type automatically to match model. h/t Haokun Liu * Config updates. * Path fix * Fix XLNet UNK handling. * Internal temporary MNLI alternate. * Revert "Internal temporary MNLI alternate." This reverts commit 455792a. * Add helper fn tests * Finish merge * Remove unused argument. * Possible ReCoRD bug fix * Cleanup * Fix merge issues. * Revert "Remove unused argument." This reverts commit 96a7c37. * Assorted responses to Alex's commenst. * Further ReCoRD fix. * @iftenney's comments. * Fix/simplify segment logic. * @W4ngatang's comments * Cleanup. * Cleanup * Fix issues with alternative embeddings_mode settings, max_layer. * More mix cleanup. * Masking fix. * Address (most of) @iftenney's comments * Tidying. * Misc cleanup. * Comment.
There's a lot going on here, and I'm still debugging. Suggestions for tests to add are very welcome!
I'm adding a few semi-related changes that are meant to help with clarity/maintainability:
I caught a bug along the way:
Note to self: