-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for serializing modules involved in LambdaOp execution by value #1741
Conversation
This commit adds an optional parameter to Workflow.save so that users can indicate that certain modules should be serialized by value. This is necessary if a LambdaOp in a workflow depends on a module whose source file will not be available in the deployment environment. Related to NVIDIA-Merlin#1737.
This commit adds code to automatically infer LambdaOp module dependencies in several common cases: 1. in which a function is passed to LambdaOp by name, 2. in which the argument to LambdaOp is a lambda expression that refers to a function by a fully-qualified name, and 3. in which the argument to LambdaOp is a lambda expression that refers to a function via another variable The current implementation does not inspect the bodies of any function passed to LambdaOp, and many corner cases are (necessarily) omitted. However, this support should be complete enough to be useful for many users. Automatic inference is optional (via a parameter to Workflow.save) but it could be the default in the future. Related to issue NVIDIA-Merlin#1737.
Documentation preview |
The CI tests on this PR appear to hang during |
Thanks, @karlhigley — I’ll try and figure out what’s going on here. |
Thanks, @karlhigley! |
Thank you, @willb! Sorry it took us so long to get it merged, but this looks great. We literally ran into this problem again today, so the timing is fortuitous. |
These commits address issue #1737 in two ways:
Workflow.save
to serialize an explicit list of named modules by value, andWorkflow.save
to use a heuristic to automatically infer the external modules involved in a workflow.Taken together, these commits will make it easier to serialize workflows that are resilient to execution in environments without source files for all of the modules that were available when the workflow was created, e.g., Docker containers.
The main contribution is the addition of a
modules_byvalue
parameter toWorkflow.save
. If this is passed an explicit list of modules,Workflow.save
will directcloudpickle
to serialize these modules by value. If it is passed the string"auto"
,Workflow.save
will employ a heuristic approach that will be useful for many real-world uses of LambdaOp.I believe that it would be harmless and useful to make
"auto"
the default but have not done so in this PR.For more background on some of the tradeoffs that would be involved in making the heuristic more precise, please see this blog post. I do not believe that many of the cases I identified in that post (explicit imports within functions, list comprehensions, etc.) are likely to present in realistic NVTabular LambdaOp client code, and thus took a relatively simple approach.