-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21070][PYSPARK] Upgrade cloudpickle #18282
Conversation
This brings in fixes and upgrades from the [cloudpickle](https://github.com/cloudpipe/cloudpickle) module, notably: * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80) * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90) * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88) * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85) * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72) * Allow pickling of builtin methods (cloudpipe/cloudpickle#57) * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52) * Support method descriptor (cloudpipe/cloudpickle#46) * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
This fails one test, looking into it now. |
@HyukjinKwon thanks for bringing that up :) We had a PR to consider using py4j as a dependency rather than a copied zip file, @rxin was against the idea. For now I think simply upgrading cloudpickle as is makes sense, and we can revisit the dependency management if the support burden of this approach becomes difficult (or as we start taking on additional Python libraries). |
Yea, my only worry is, I think we already have some customized codes in Spark's copy and wonder if it is difficult to resolve the conflicts here (for me, IIRC, it took me a while to check out each conflict before). I would go +1 if it is easy for those changes. I will take a look at my best to help review. cc @davies too. |
I am not sure if this affects pyspark but any function object serialized with 0.2.2 or earlier cannot be deserialized with 0.3.0 or greater. I haven't investigated making this compatible because I am not sure if it is needed but I wanted to make sure you are aware. |
Thanks for that point @llllllllll , since cloudpickle isn't used for any data which is persisted to be used between different versions of Spark we don't need the format to be backwards compatible. |
Jenkins, ok to test. Jenkins add to whitelist. |
Test build #78102 has finished for PR 18282 at commit
|
Note: I don't think I'll have time to finish this PR for a few weeks so this is freely available for someone else to take on. |
@rgbkrk Shall we close this and reopen when it's ready? |
Works for me, sorry I don't have cycles at the moment. |
## What changes were proposed in this pull request? Based on apache#18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle. Some notable changes: * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80) * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90) * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88) * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85) * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72) * Allow pickling of builtin methods (cloudpipe/cloudpickle#57) * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52) * Support method descriptor (cloudpipe/cloudpickle#46) * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32) * ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes. * Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96) * BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102) ## How was this patch tested? Existing PySpark unit tests + the unit tests from the cloudpickle project on their own. Author: Holden Karau <holden@us.ibm.com> Author: Kyle Kelley <rgbkrk@gmail.com> Closes apache#18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.
What changes were proposed in this pull request?
This brings in fixes and upgrades from the cloudpickle module. Notable fixes:
__file__
attribute are not dynamic (Assume modules with __file__ attribute are not dynamic cloudpipe/cloudpickle#85)How was this patch tested?
The test suite for cloudpickle tests the internal implementation. There are integration tests here in PySpark that use the
CloudPickleSerializer
.