Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import submodules accessed by pickled functions #80

Merged
merged 8 commits into from
Feb 24, 2017

Conversation

benjimin
Copy link
Contributor

@benjimin benjimin commented Feb 7, 2017

address issue #78

@codecov-io
Copy link

codecov-io commented Feb 7, 2017

Codecov Report

Merging #80 into master will increase coverage by 0.46%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #80      +/-   ##
==========================================
+ Coverage   78.97%   79.44%   +0.46%     
==========================================
  Files           2        2              
  Lines         490      501      +11     
  Branches       97      102       +5     
==========================================
+ Hits          387      398      +11     
  Misses         75       75              
  Partials       28       28
Impacted Files Coverage Δ
cloudpickle/cloudpickle.py 79.31% <100%> (+0.46%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72de40b...4b9bfbc. Read the comment docs.

@rgbkrk
Copy link
Member

rgbkrk commented Feb 10, 2017

Thank you for the PR and for continuing on while things are quiet here.

Happy to bring this in once it passes CI if you're willing to help us maintain the package in general.

# uses fork to preserve the loaded environment.
assert not subprocess.call(['python', '-c',
'import pickle; (pickle.loads(' +
s.__str__() + '))()'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure all the characters in a pickle are safe to pass as an argument to a subprocess call. Maybe you should try to base64 encode the payload.

@benjimin
Copy link
Contributor Author

OK, so this change to cloudpickle tries to detect whether a function (being pickled) might access (via attribute) a module contained within a package. If so it instructs for the module to be re-imported on unpickle. (Cloudpickle already instructs to re-import the top-level package, but this is not always sufficient to re-import required child modules.)

An alternative to doing this would simply be to document, that all functions to be pickled must import dependent modules as top-level objects in the namespace (i.e. ask the user never to import foo.bar except as foobar). However, it is in the spirit of cloudpickle to try to be more magic than ordinary pickle (e.g. not asking the user to refactor-out lambdas etc).

There is a test that the function can be unpickled and executed in a fresh new process. (Thanks @ogrisel for suggesting the encoding.) There is also a pair of tests that use the original process (which is trickier to test because must undo imports). It is a pair because cloudpickle handles globals and closures slightly differently.

Hope this is helpful.

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides my comments, this LGTM.

tokens = set(name[len(prefix):].split('.'))
if not tokens - set(code.co_names):
self.save(module) # ensure the unpickler executes import of this submodule
self.write(pickle.POP) # then discard the reference to it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmetic: could you please move the comments to be on their own lines (before the matching statements) so as to avoid long (80+ columns lines).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this file does not adopt that convention elsewhere.


# deserialise
f = pickle.loads(s)
f() # perform test for error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that your fix should also handle a case such as follows:

global etree
import xml.etree.ElementTree as etree
def example():
    x = etree.Comment
...

Maybe it would still be worth adding a test to make sure that this import pattern is also supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, such a test would have passed even prior to this pull request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright but thanks for having added the test as a non-regression test anyway :)

per ogrisel
This test would already pass prior to the changes for supporting submodules.

per ogrisel
@ogrisel ogrisel merged commit 938fc0d into cloudpipe:master Feb 24, 2017
@ogrisel
Copy link
Contributor

ogrisel commented Feb 24, 2017

Squash merged. Any volunteer for a bugfix release?

@@ -307,6 +328,8 @@ def save_function_tuple(self, func):
save(_fill_function) # skeleton function updater
write(pickle.MARK) # beginning of tuple that _fill_function expects

self._save_subimports(code, set(f_globals.values()) | set(closure))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line creates a regression: #86

rgbkrk added a commit to rgbkrk/spark that referenced this pull request Jun 12, 2017
This brings in fixes and upgrades from the [cloudpickle](https://github.com/cloudpipe/cloudpickle) module, notably:

* Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80)
* Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90)
* Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88)
* Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85)
* Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72)
* Allow pickling of builtin methods (cloudpipe/cloudpickle#57)
* Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52)
* Support method descriptor (cloudpipe/cloudpickle#46)
* No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
@ahmadia
Copy link

ahmadia commented Jul 9, 2017

I'm reviewing a new issue in the parallel-tutorial and I believe it may be related to this PR. The error I'm seeing is:

Traceback (most recent call last):
  File "prep.py", line 64, in <module>
    dask.compute(values)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/base.py", line 204, in compute
    results = get(dsk, keys, **kwargs)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/multiprocessing.py", line 177, in get
    raise_exception=reraise, **kwargs)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/local.py", line 521, in get_async
    raise_exception(exc, tb)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/compatibility.py", line 59, in reraise
    raise exc.with_traceback(tb)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/local.py", line 289, in execute_task
    task, data = loads(task_info)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/cloudpickle/cloudpickle.py", line 840, in subimport
    __import__(name)
ImportError: No module named '_pandasujson'

ghost pushed a commit to dbtsai/spark that referenced this pull request Aug 22, 2017
## What changes were proposed in this pull request?

Based on apache#18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle.

Some notable changes:
* Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80)
* Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90)
* Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88)
* Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85)
* Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72)
* Allow pickling of builtin methods (cloudpipe/cloudpickle#57)
* Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52)
* Support method descriptor (cloudpipe/cloudpickle#46)
* No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
* ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes.
* Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96)
* BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102)

## How was this patch tested?

Existing PySpark unit tests + the unit tests from the cloudpickle project on their own.

Author: Holden Karau <holden@us.ibm.com>
Author: Kyle Kelley <rgbkrk@gmail.com>

Closes apache#18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants