SQuAD update #1232

zphang · 2020-11-11T06:26:29Z

Fixes SQuAD v2 task config to properly supply kwargs
Update SQuAD tokenization logic based on Fix tokenization in SQuAD for RoBERTa, Longformer, BART huggingface/transformers#7387 (and subsequently refactored in [tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers improvements - General tests speedups huggingface/transformers#7970)

The summary for 2 is that previously (including in the Hugging Face implementation), the context was being tokenized per-word rather than as a string. This is problematic for tokenizers that treat start-of-word tokens differently.

Although the cited PR states that there is a large performance improvement from this fix, I did not observe this myself in my testing (both in jiant as well as in the transformers examples). Both versions appeared to perform comparably.

For jiant users, this will impact tokenization, but training a fresh model for either old or new tokenization should both work. I recommended deleting and recreating any SQuAD-based caches. However, if you consistently use only the old tokenization caches, that should be fine too.

codecov · 2020-11-11T06:33:45Z

Codecov Report

Merging #1232 (99fc599) into master (e2e85c9) will decrease coverage by 0.01%.
The diff coverage is 20.00%.

@@            Coverage Diff             @@
##           master    #1232      +/-   ##
==========================================
- Coverage   56.57%   56.56%   -0.02%     
==========================================
  Files         147      147              
  Lines       10578    10582       +4     
==========================================
+ Hits         5985     5986       +1     
- Misses       4593     4596       +3

Impacted Files	Coverage Δ
...t/scripts/download_data/dl_datasets/files_tasks.py	`7.49% <ø> (ø)`
jiant/tasks/lib/templates/squad_style/core.py	`30.71% <20.00%> (-0.09%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2e85c9...99fc599. Read the comment docs.

jeswan · 2020-11-11T16:00:46Z

jiant/tasks/lib/templates/squad_style/core.py

@@ -21,6 +21,9 @@

 import logging

+# Store the tokenizers which insert 2 separators tokens
+MULTI_SEP_TOKENS_TOKENIZERS_SET = {"roberta", "camembert", "bart"}


Thoughts on moving tokenizer specific wrapping constants to a dedicated file (ext/tokenizers/constants.py or similar)? Worried about the logic being scattered throughout files.

This constant comes from Transformers: https://github.com/huggingface/transformers/blob/969859d5f67c7106de4d1098c4891c9b03694bbe/src/transformers/data/processors/squad.py#L17

Unfortunately, it comes in a later version than what we're currently requiring, so we can't currently import it (unless we bump the version, I'd like to wait until after Fall to do it because of the v3.0.2 issue). The upside is that it's from the same file that all the rest of the SQuAD code comes from, so it'll be consistent for anyone looking up update the SQuAD implementation in the future.

zphang added 3 commits November 8, 2020 00:36

Squad tokenization

b2a836f

more squad_update

27a5764

squad2 fix

7d64b19

zphang requested review from HaokunLiu and jeswan as code owners November 11, 2020 06:26

zphang added 2 commits November 11, 2020 01:30

flake

16b54ee

Merge remote-tracking branch 'origin/master' into squad_update

99fc599

jeswan reviewed Nov 11, 2020

View reviewed changes

jeswan approved these changes Nov 11, 2020

View reviewed changes

jeswan merged commit 838cdd2 into nyu-mll:master Nov 11, 2020

zphang deleted the squad_update branch November 11, 2020 20:59

zphang restored the squad_update branch November 11, 2020 20:59

zphang deleted the squad_update branch November 11, 2020 20:59

zphang restored the squad_update branch November 11, 2020 20:59

zphang deleted the squad_update branch November 11, 2020 20:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQuAD update #1232

SQuAD update #1232

zphang commented Nov 11, 2020

codecov bot commented Nov 11, 2020

jeswan Nov 11, 2020

zphang Nov 11, 2020

SQuAD update #1232

SQuAD update #1232

Conversation

zphang commented Nov 11, 2020

codecov bot commented Nov 11, 2020

Codecov Report

jeswan Nov 11, 2020

Choose a reason for hiding this comment

zphang Nov 11, 2020

Choose a reason for hiding this comment