Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

[torchscript] Fix tokenization error for special tokens #4489

Merged
merged 1 commit into from
Apr 8, 2022

Conversation

maryamdaneshi
Copy link
Contributor

Patch description
The cairaoke model has a somewhat new format that revealed a bug in tokenizing and parsing the special tokens.
This change fix that, please see the test plan for the issue and the fix.

Testing steps

The command
buck run //deeplearning/projects/parlai:parlaicmd -- torchscript --model-file manifold://cair_models/tree/cairaoke/experimental/faiar_coco/f333577911/cstudio_r0 --model fb:bart/cairaoke_bart --scripted-model-file manifold://cair_models/tree/cairaoke/experimental/test/script_debug.pt --no_cuda --input 'produce timer for 45 minutes and 60 minutes|api_resp: get_entity.time = 2021-04-23t07:11:00.000-07:00 , 2021-04-23t07:11:00.000-07:00'

api_call: create_timer.time.0 = 2021-04-23t07:11:00.000-07:00 ; create_timer.time.1 = 2021-04-23t07:11:00.000-07:00 ; create_timer.time.2 = 2021-04-23t07:11:00.000-07:00
the wrong prediction was also happening for the unscripted model
produced token
[33416, 33423, 33415, 33415, 1053, 40, 19263, 17, 4447, 17, 20, 740, 2775, 2190, 1053, 40, 19263, 17, 4447, 17, 19, 740, 302, 3590, 2190]
expected tokens
[33416, 33423, 33415, 1053, 40, 19263, 17, 4447, 17, 19, 740, 302, 3590, 2190, 33415, 1053, 40, 19263, 17, 4447, 17, 20, 740, 2775, 2190]
after fix
api_call: create_timer.time.0 = 2021-04-23t07:11:00.000-07:00 ; create_timer.time.1 = 2021-04-23t07:11:00.000-07:00

Copy link
Contributor

@EricMichaelSmith EricMichaelSmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, great find!

@maryamdaneshi
Copy link
Contributor Author

is the pre-commit issue known? seems to be on all other pull requests as well

@klshuster
Copy link
Contributor

is the pre-commit issue known? seems to be on all other pull requests as well

yep, known issue

@maryamdaneshi maryamdaneshi merged commit 1d70b0e into main Apr 8, 2022
@maryamdaneshi maryamdaneshi deleted the torchscript_fix branch April 8, 2022 21:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants