This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
[torchscript] Fix tokenization error for special tokens #4489
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Patch description
The cairaoke model has a somewhat new format that revealed a bug in tokenizing and parsing the special tokens.
This change fix that, please see the test plan for the issue and the fix.
Testing steps
The command
buck run //deeplearning/projects/parlai:parlaicmd -- torchscript --model-file manifold://cair_models/tree/cairaoke/experimental/faiar_coco/f333577911/cstudio_r0 --model fb:bart/cairaoke_bart --scripted-model-file manifold://cair_models/tree/cairaoke/experimental/test/script_debug.pt --no_cuda --input 'produce timer for 45 minutes and 60 minutes|api_resp: get_entity.time = 2021-04-23t07:11:00.000-07:00 , 2021-04-23t07:11:00.000-07:00'
api_call: create_timer.time.0 = 2021-04-23t07:11:00.000-07:00 ; create_timer.time.1 = 2021-04-23t07:11:00.000-07:00 ; create_timer.time.2 = 2021-04-23t07:11:00.000-07:00
the wrong prediction was also happening for the unscripted model
produced token
[33416, 33423, 33415, 33415, 1053, 40, 19263, 17, 4447, 17, 20, 740, 2775, 2190, 1053, 40, 19263, 17, 4447, 17, 19, 740, 302, 3590, 2190]
expected tokens
[33416, 33423, 33415, 1053, 40, 19263, 17, 4447, 17, 19, 740, 302, 3590, 2190, 33415, 1053, 40, 19263, 17, 4447, 17, 20, 740, 2775, 2190]
after fix
api_call: create_timer.time.0 = 2021-04-23t07:11:00.000-07:00 ; create_timer.time.1 = 2021-04-23t07:11:00.000-07:00