Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQuAD update #1232

Merged
merged 5 commits into from
Nov 11, 2020
Merged

SQuAD update #1232

merged 5 commits into from
Nov 11, 2020

Conversation

zphang
Copy link
Collaborator

@zphang zphang commented Nov 11, 2020

  1. Fixes SQuAD v2 task config to properly supply kwargs
  2. Update SQuAD tokenization logic based on Fix tokenization in SQuAD for RoBERTa, Longformer, BART huggingface/transformers#7387 (and subsequently refactored in [tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers improvements - General tests speedups huggingface/transformers#7970)

The summary for 2 is that previously (including in the Hugging Face implementation), the context was being tokenized per-word rather than as a string. This is problematic for tokenizers that treat start-of-word tokens differently.

Although the cited PR states that there is a large performance improvement from this fix, I did not observe this myself in my testing (both in jiant as well as in the transformers examples). Both versions appeared to perform comparably.

For jiant users, this will impact tokenization, but training a fresh model for either old or new tokenization should both work. I recommended deleting and recreating any SQuAD-based caches. However, if you consistently use only the old tokenization caches, that should be fine too.

@codecov
Copy link

codecov bot commented Nov 11, 2020

Codecov Report

Merging #1232 (99fc599) into master (e2e85c9) will decrease coverage by 0.01%.
The diff coverage is 20.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1232      +/-   ##
==========================================
- Coverage   56.57%   56.56%   -0.02%     
==========================================
  Files         147      147              
  Lines       10578    10582       +4     
==========================================
+ Hits         5985     5986       +1     
- Misses       4593     4596       +3     
Impacted Files Coverage Δ
...t/scripts/download_data/dl_datasets/files_tasks.py 7.49% <ø> (ø)
jiant/tasks/lib/templates/squad_style/core.py 30.71% <20.00%> (-0.09%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2e85c9...99fc599. Read the comment docs.

@@ -21,6 +21,9 @@

import logging

# Store the tokenizers which insert 2 separators tokens
MULTI_SEP_TOKENS_TOKENIZERS_SET = {"roberta", "camembert", "bart"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on moving tokenizer specific wrapping constants to a dedicated file (ext/tokenizers/constants.py or similar)? Worried about the logic being scattered throughout files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant comes from Transformers: https://github.com/huggingface/transformers/blob/969859d5f67c7106de4d1098c4891c9b03694bbe/src/transformers/data/processors/squad.py#L17

Unfortunately, it comes in a later version than what we're currently requiring, so we can't currently import it (unless we bump the version, I'd like to wait until after Fall to do it because of the v3.0.2 issue). The upside is that it's from the same file that all the rest of the SQuAD code comes from, so it'll be consistent for anyone looking up update the SQuAD implementation in the future.

@jeswan jeswan merged commit 838cdd2 into nyu-mll:master Nov 11, 2020
@zphang zphang deleted the squad_update branch November 11, 2020 20:59
@zphang zphang restored the squad_update branch November 11, 2020 20:59
@zphang zphang deleted the squad_update branch November 11, 2020 20:59
@zphang zphang restored the squad_update branch November 11, 2020 20:59
@zphang zphang deleted the squad_update branch November 11, 2020 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants