[torchscript] Support special tokens in torchscript module. #3644

stephenroller · 2021-05-04T20:37:53Z

Patch description
Add special tokens support for torchscript. The implementation doesn't reflect the original, because the original just used recursion, which torchscript doesn't support.

Testing steps
New CI. Internal testing

EricMichaelSmith

Makes sense, minor nits

EricMichaelSmith · 2021-05-05T12:46:21Z

parlai/torchscript/modules.py

@@ -481,7 +488,41 @@ def encode(self, text: str) -> List[str]:
        """
        if self.add_prefix_space:
            text = f' {text}'
-        return self.helper_encode(text)
+
+        # constants for readability


hmm perhaps it'd help to have a 1-sentence comment about how the special tokens code works at a high level, and maybe also what FINAL and SPLITABLE are?

EricMichaelSmith · 2021-05-05T12:49:33Z

tests/nightly/gpu/test_torchscript.py

@@ -62,6 +62,72 @@ def test_token_splitter(self):
                    if idx + 1 == num_examples:
                        break

+    def test_special_tokenization(self):
+        from parlai.core.dict import DictionaryAgent


nit: these imports could go to the top, right? Or at least the first 2?

EricMichaelSmith · 2021-05-05T12:49:52Z

tests/nightly/gpu/test_torchscript.py

+        from parlai.torchscript.modules import ScriptableDictionaryAgent
+
+        SPECIAL = ['Q00', 'Q01']
+        text = "Don't have a Q00, man! Have a Q01 instead."


EricMichaelSmith · 2021-05-05T12:51:12Z

tests/nightly/gpu/test_torchscript.py

+            assert len(tokenized) == 15
+            assert sda.vec2txt(tokenized) == text
+            nice_tok = [sda.ind2tok[i] for i in tokenized]
+


Nit: a few variables that are assigned to but never used, given linting messages

EricMichaelSmith · 2021-05-05T12:53:17Z

tests/nightly/gpu/test_torchscript.py

+            special_tokenized = sda.txt2vec(text)
+            assert len(special_tokenized) == 15
+            assert sda.vec2txt(special_tokenized) == text
+            assert special_tokenized != tokenized


Thought: it could be even more explicit to check the actual strings of the output tokens, instead of just their length and whether they match with/without special tokens. No strong opinion on this either way, though

stephenroller · 2021-05-18T17:46:26Z

Sorry, landing in expediency.

[torchscript] Support special tokens in torchscript module.

58ef77d

facebook-github-bot added the CLA Signed label May 4, 2021

stephenroller requested a review from EricMichaelSmith May 4, 2021 20:38

EricMichaelSmith approved these changes May 5, 2021

View reviewed changes

stephenroller merged commit 3bf87ea into master May 18, 2021

stephenroller deleted the torchspecial branch May 18, 2021 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torchscript] Support special tokens in torchscript module. #3644

[torchscript] Support special tokens in torchscript module. #3644

stephenroller commented May 4, 2021

EricMichaelSmith left a comment

EricMichaelSmith May 5, 2021

EricMichaelSmith May 5, 2021

EricMichaelSmith May 5, 2021

EricMichaelSmith May 5, 2021

EricMichaelSmith May 5, 2021

stephenroller commented May 18, 2021

[torchscript] Support special tokens in torchscript module. #3644

[torchscript] Support special tokens in torchscript module. #3644

Conversation

stephenroller commented May 4, 2021

EricMichaelSmith left a comment

Choose a reason for hiding this comment

EricMichaelSmith May 5, 2021

Choose a reason for hiding this comment

EricMichaelSmith May 5, 2021

Choose a reason for hiding this comment

EricMichaelSmith May 5, 2021

Choose a reason for hiding this comment

EricMichaelSmith May 5, 2021

Choose a reason for hiding this comment

EricMichaelSmith May 5, 2021

Choose a reason for hiding this comment

stephenroller commented May 18, 2021