-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import submodules accessed by pickled functions #80
Conversation
Codecov Report
@@ Coverage Diff @@
## master #80 +/- ##
==========================================
+ Coverage 78.97% 79.44% +0.46%
==========================================
Files 2 2
Lines 490 501 +11
Branches 97 102 +5
==========================================
+ Hits 387 398 +11
Misses 75 75
Partials 28 28
Continue to review full report at Codecov.
|
Thank you for the PR and for continuing on while things are quiet here. Happy to bring this in once it passes CI if you're willing to help us maintain the package in general. |
tests/cloudpickle_test.py
Outdated
# uses fork to preserve the loaded environment. | ||
assert not subprocess.call(['python', '-c', | ||
'import pickle; (pickle.loads(' + | ||
s.__str__() + '))()']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure all the characters in a pickle are safe to pass as an argument to a subprocess call. Maybe you should try to base64 encode the payload.
OK, so this change to cloudpickle tries to detect whether a function (being pickled) might access (via attribute) a module contained within a package. If so it instructs for the module to be re-imported on unpickle. (Cloudpickle already instructs to re-import the top-level package, but this is not always sufficient to re-import required child modules.) An alternative to doing this would simply be to document, that all functions to be pickled must import dependent modules as top-level objects in the namespace (i.e. ask the user never to There is a test that the function can be unpickled and executed in a fresh new process. (Thanks @ogrisel for suggesting the encoding.) There is also a pair of tests that use the original process (which is trickier to test because must undo imports). It is a pair because cloudpickle handles globals and closures slightly differently. Hope this is helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides my comments, this LGTM.
cloudpickle/cloudpickle.py
Outdated
tokens = set(name[len(prefix):].split('.')) | ||
if not tokens - set(code.co_names): | ||
self.save(module) # ensure the unpickler executes import of this submodule | ||
self.write(pickle.POP) # then discard the reference to it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cosmetic: could you please move the comments to be on their own lines (before the matching statements) so as to avoid long (80+ columns lines).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, this file does not adopt that convention elsewhere.
|
||
# deserialise | ||
f = pickle.loads(s) | ||
f() # perform test for error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that your fix should also handle a case such as follows:
global etree
import xml.etree.ElementTree as etree
def example():
x = etree.Comment
...
Maybe it would still be worth adding a test to make sure that this import pattern is also supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, such a test would have passed even prior to this pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright but thanks for having added the test as a non-regression test anyway :)
This test would already pass prior to the changes for supporting submodules. per ogrisel
Squash merged. Any volunteer for a bugfix release? |
@@ -307,6 +328,8 @@ def save_function_tuple(self, func): | |||
save(_fill_function) # skeleton function updater | |||
write(pickle.MARK) # beginning of tuple that _fill_function expects | |||
|
|||
self._save_subimports(code, set(f_globals.values()) | set(closure)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line creates a regression: #86
This brings in fixes and upgrades from the [cloudpickle](https://github.com/cloudpipe/cloudpickle) module, notably: * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80) * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90) * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88) * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85) * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72) * Allow pickling of builtin methods (cloudpipe/cloudpickle#57) * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52) * Support method descriptor (cloudpipe/cloudpickle#46) * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
I'm reviewing a new issue in the parallel-tutorial and I believe it may be related to this PR. The error I'm seeing is:
|
## What changes were proposed in this pull request? Based on apache#18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle. Some notable changes: * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80) * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90) * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88) * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85) * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72) * Allow pickling of builtin methods (cloudpipe/cloudpickle#57) * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52) * Support method descriptor (cloudpipe/cloudpickle#46) * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32) * ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes. * Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96) * BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102) ## How was this patch tested? Existing PySpark unit tests + the unit tests from the cloudpickle project on their own. Author: Holden Karau <holden@us.ibm.com> Author: Kyle Kelley <rgbkrk@gmail.com> Closes apache#18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.
address issue #78