Import submodules accessed by pickled functions #80

benjimin · 2017-02-07T05:27:57Z

address issue #78

codecov-io · 2017-02-07T23:33:38Z

Codecov Report

Merging #80 into master will increase coverage by 0.46%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #80      +/-   ##
==========================================
+ Coverage   78.97%   79.44%   +0.46%     
==========================================
  Files           2        2              
  Lines         490      501      +11     
  Branches       97      102       +5     
==========================================
+ Hits          387      398      +11     
  Misses         75       75              
  Partials       28       28

Impacted Files	Coverage Δ
cloudpickle/cloudpickle.py	`79.31% <100%> (+0.46%)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72de40b...4b9bfbc. Read the comment docs.

rgbkrk · 2017-02-10T07:22:56Z

Thank you for the PR and for continuing on while things are quiet here.

Happy to bring this in once it passes CI if you're willing to help us maintain the package in general.

ogrisel · 2017-02-11T21:07:44Z

tests/cloudpickle_test.py

+        # uses fork to preserve the loaded environment.
+        assert not subprocess.call(['python', '-c',
+                                    'import pickle; (pickle.loads(' +
+                                    s.__str__() + '))()'])


I am not sure all the characters in a pickle are safe to pass as an argument to a subprocess call. Maybe you should try to base64 encode the payload.

benjimin · 2017-02-13T01:36:10Z

OK, so this change to cloudpickle tries to detect whether a function (being pickled) might access (via attribute) a module contained within a package. If so it instructs for the module to be re-imported on unpickle. (Cloudpickle already instructs to re-import the top-level package, but this is not always sufficient to re-import required child modules.)

An alternative to doing this would simply be to document, that all functions to be pickled must import dependent modules as top-level objects in the namespace (i.e. ask the user never to import foo.bar except as foobar). However, it is in the spirit of cloudpickle to try to be more magic than ordinary pickle (e.g. not asking the user to refactor-out lambdas etc).

There is a test that the function can be unpickled and executed in a fresh new process. (Thanks @ogrisel for suggesting the encoding.) There is also a pair of tests that use the original process (which is trickier to test because must undo imports). It is a pair because cloudpickle handles globals and closures slightly differently.

Hope this is helpful.

ogrisel

Besides my comments, this LGTM.

ogrisel · 2017-02-14T08:02:01Z

cloudpickle/cloudpickle.py

+                        tokens = set(name[len(prefix):].split('.'))
+                        if not tokens - set(code.co_names):
+                            self.save(module) # ensure the unpickler executes import of this submodule
+                            self.write(pickle.POP) # then discard the reference to it


cosmetic: could you please move the comments to be on their own lines (before the matching statements) so as to avoid long (80+ columns lines).

Note, this file does not adopt that convention elsewhere.

ogrisel · 2017-02-14T08:14:07Z

tests/cloudpickle_test.py

+
+        # deserialise
+        f = pickle.loads(s)
+        f() # perform test for error


I assume that your fix should also handle a case such as follows:

global etree import xml.etree.ElementTree as etree def example(): x = etree.Comment ...

Maybe it would still be worth adding a test to make sure that this import pattern is also supported.

Note, such a test would have passed even prior to this pull request.

alright but thanks for having added the test as a non-regression test anyway :)

per ogrisel

This test would already pass prior to the changes for supporting submodules. per ogrisel

ogrisel · 2017-02-24T14:15:37Z

Squash merged. Any volunteer for a bugfix release?

pitrou · 2017-05-23T13:13:15Z

cloudpickle/cloudpickle.py

@@ -307,6 +328,8 @@ def save_function_tuple(self, func):
        save(_fill_function)  # skeleton function updater
        write(pickle.MARK)    # beginning of tuple that _fill_function expects

+        self._save_subimports(code, set(f_globals.values()) | set(closure))


This line creates a regression: #86

This brings in fixes and upgrades from the [cloudpickle](https://github.com/cloudpipe/cloudpickle) module, notably: * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80) * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90) * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88) * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85) * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72) * Allow pickling of builtin methods (cloudpipe/cloudpickle#57) * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52) * Support method descriptor (cloudpipe/cloudpickle#46) * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)

ahmadia · 2017-07-09T17:27:56Z

I'm reviewing a new issue in the parallel-tutorial and I believe it may be related to this PR. The error I'm seeing is:

Traceback (most recent call last):
  File "prep.py", line 64, in <module>
    dask.compute(values)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/base.py", line 204, in compute
    results = get(dsk, keys, **kwargs)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/multiprocessing.py", line 177, in get
    raise_exception=reraise, **kwargs)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/local.py", line 521, in get_async
    raise_exception(exc, tb)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/compatibility.py", line 59, in reraise
    raise exc.with_traceback(tb)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/local.py", line 289, in execute_task
    task, data = loads(task_info)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/cloudpickle/cloudpickle.py", line 840, in subimport
    __import__(name)
ImportError: No module named '_pandasujson'

## What changes were proposed in this pull request? Based on apache#18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle. Some notable changes: * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80) * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90) * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88) * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85) * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72) * Allow pickling of builtin methods (cloudpipe/cloudpickle#57) * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52) * Support method descriptor (cloudpipe/cloudpickle#46) * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32) * ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes. * Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96) * BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102) ## How was this patch tested? Existing PySpark unit tests + the unit tests from the cloudpickle project on their own. Author: Holden Karau <holden@us.ibm.com> Author: Kyle Kelley <rgbkrk@gmail.com> Closes apache#18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.

benjimin added 3 commits February 7, 2017 16:10

Add regression tests for submodule dependence of pickled function, is…

b31d000

…sue cloudpipe#78

save submodules to resolve issue cloudpipe#78

0657c6d

change test example for python 2.6 support

0922273

fix test_submodule (incomplete un-import)

11709cb

ogrisel reviewed Feb 11, 2017

View reviewed changes

add subprocess test

6854eaa

benjimin force-pushed the submodule branch from 18dbfeb to 6854eaa Compare February 13, 2017 00:15

ogrisel approved these changes Feb 14, 2017

View reviewed changes

benjimin added 3 commits February 24, 2017 14:38

cosmetic

ee40673

per ogrisel

add test of function imports

3319f50

This test would already pass prior to the changes for supporting submodules. per ogrisel

strengthen import test

4b9bfbc

ogrisel merged commit 938fc0d into cloudpipe:master Feb 24, 2017

pitrou reviewed May 23, 2017

View reviewed changes

rgbkrk mentioned this pull request Jun 12, 2017

[SPARK-21070][PYSPARK] Upgrade cloudpickle apache/spark#18282

Closed

ahmadia mentioned this pull request Jul 9, 2017

ImportError: No module named '_pandasujson' #106

Open

holdenk mentioned this pull request Jul 26, 2017

[SPARK-21070][PYSPARK] Attempt to update cloudpickle again apache/spark#18734

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import submodules accessed by pickled functions #80

Import submodules accessed by pickled functions #80

benjimin commented Feb 7, 2017

codecov-io commented Feb 7, 2017 •

edited

Loading

rgbkrk commented Feb 10, 2017

ogrisel Feb 11, 2017

benjimin commented Feb 13, 2017

ogrisel left a comment

ogrisel Feb 14, 2017

benjimin Feb 24, 2017

ogrisel Feb 14, 2017

benjimin Feb 24, 2017

ogrisel Feb 24, 2017

ogrisel commented Feb 24, 2017

pitrou May 23, 2017

ahmadia commented Jul 9, 2017

Import submodules accessed by pickled functions #80

Import submodules accessed by pickled functions #80

Conversation

benjimin commented Feb 7, 2017

codecov-io commented Feb 7, 2017 • edited Loading

Codecov Report

rgbkrk commented Feb 10, 2017

ogrisel Feb 11, 2017

Choose a reason for hiding this comment

benjimin commented Feb 13, 2017

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel Feb 14, 2017

Choose a reason for hiding this comment

benjimin Feb 24, 2017

Choose a reason for hiding this comment

ogrisel Feb 14, 2017

Choose a reason for hiding this comment

benjimin Feb 24, 2017

Choose a reason for hiding this comment

ogrisel Feb 24, 2017

Choose a reason for hiding this comment

ogrisel commented Feb 24, 2017

pitrou May 23, 2017

Choose a reason for hiding this comment

ahmadia commented Jul 9, 2017

codecov-io commented Feb 7, 2017 •

edited

Loading