[Torchscript] Parallelized Text/Sequence Preprocessing #2206
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR parallelizes sequence tokenization for sequence and text features. This gives us better or equal inference throughput relative to vanilla Ludwig Model preprocessing.
One thing to note: there were/are strange interactions with
torch.no_grad
and the added parallelism. It seems thattorch.no_grad
affects some global flag that activates/deactivates gradient computation (link), and it seems that this flag is not properly reset after some scripted, parallelized operation.The workaround has to do with the insight that gradients are not computed during preprocessing time and that the extraneous
torch.no_grad
statements could be removed (particularly around preprocessing and postprocessing). Applying thetorch.no_grad
context exclusively at the predictor stage of inference is enough to ensure that the module output tensors contain no gradients. Added tests confirm this. That said, we should keep this issue in mind if we decide to introduce parallelism in our ECD architecture.