Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated characters, underscore and comma preprocessors to be TorchScriptable. #3602

Merged
merged 7 commits into from
Sep 14, 2023

Conversation

martindavis
Copy link
Contributor

@martindavis martindavis commented Sep 13, 2023

The comma, underscore and characters preprocessors are not currently TorchScriptable because they do not implement the TorchScript Module methods. As part of this PR, we now have a general StringSplitTokenizer that can be re-used across the various defaults we support. Unfortunately, since TorchScript only supports basic data types, CharactersToListTokenizer had to be written as a separate class instead of using Lamda Function.

Will help users run into this error less often:

ValueError: comma is not supported by torchscript. Please use one of {'sentencepiece', 'space_punct', 'clip', 
'gpt2bpe', 'bert', 'space'}.

This covered by the tests/integration_tests/test_torchscript.py::test_torchscript_e2e_text test because this PR updates TORCHSCRIPT_COMPATIBLE_TOKENIZERS to include the new TorchScriptable tokenizers.

@github-actions
Copy link

github-actions bot commented Sep 13, 2023

Unit Test Results

  6 files  ±0    6 suites  ±0   47m 17s ⏱️ + 2m 17s
31 tests ±0  26 ✔️ ±0    5 💤 ±0  0 ±0 
82 runs  ±0  66 ✔️ ±0  16 💤 ±0  0 ±0 

Results for commit 7401aa6. ± Comparison against base commit d15a0c5.

♻️ This comment has been updated with latest results.

ludwig/utils/tokenizers.py Show resolved Hide resolved
if isinstance(v, torch.Tensor):
raise ValueError(f"Unsupported input: {v}")

inputs: List[str] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you are adapting an existing implementation, though this seems more complicated than I would expect (for example, why do we have a get_tokens() function that returns its own input?).

@geoffreyangus, ooc does this also look strange to you, or is this imposed on us by torchscript?

Copy link
Contributor

@geoffreyangus geoffreyangus Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like NgramTokenizer, which subclasses SpaceStringToListTokenizer (which in turn subclasses the new StringSplitTokenizer), seems to override get_tokens: https://github.com/ludwig-ai/ludwig/pull/3602/files#diff-5cbace55f4f4fd07725c061b9f981b83fe43cb53b0045cf1257c9fb5d4931f0dR132-R142

@martindavis martindavis merged commit 2365de7 into master Sep 14, 2023
@martindavis martindavis deleted the update-basic-preprocessors-to-be-torchscriptable branch September 14, 2023 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants