-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove non-standard __transient__ support #110
Conversation
Codecov Report
@@ Coverage Diff @@
## master #110 +/- ##
==========================================
+ Coverage 81.34% 83.12% +1.77%
==========================================
Files 2 2
Lines 563 551 -12
Branches 113 107 -6
==========================================
Hits 458 458
+ Misses 75 65 -10
+ Partials 30 28 -2
Continue to review full report at Codecov.
|
awesome analysis btw @Peque |
I have no objection to removing support for this non-standard protocol |
Anyone mind if I give this a quick spin in Spark to see if we run into any issues first? We don't directly use the library but I'm eternally hopefully I'll find a way to migrate us to using the common cloud pickle. |
Happy to wait to see. |
@holdenk unrelated to this particular pull request but I'm curious: do you know if this is in any way related to the use of regular Python pickle + my pyrolite library (that provides a java implementation of the pickle protocol), in Spark? |
@irmen I'm not sure I'm resolving the "this" correctly in do you know if this is in any way related to the use of regular Python pickle correctly - what are you referencing? |
@holdenk not really sure myself tbh... I'm not familiar with [py]Spark myself. I guess I'm looking for the reason cloudpickle is used in Spark when pickle (+pyrolite for java connectivity) is used as well. Maybe I'm confusing pyspark and spark itself |
So from memory the Of course this is based off of fuzzy memories of long ago discussions, so don't hold me to be 100% accurate and before you make any decissions on this reach out and we can sort out the real reasons (or a close enough approximation for whatever you need). |
I got a little curious if it would be helpful if PySpark and Spark (in Scala) used pickles for the interchange format using pyrolite on the JVM side. |
So at the end of the day I think we're going to start seeing different formats being used for interchange in Spark (like Arrow) - so I think the best path forward is seeing what happens there. I'm going to also perhaps take this opportunity to resurrects @rgbkrk's work on getting Spark using a current version of cloudpickle since I need to do some of that to test this PR w/Spark anyways. |
@rgbkrk as far as I know, Spark already uses pickle + pyrolite. |
Oh silly me. |
So I'm seeing some failures with this in Spark, but my build was in a bit of a bad state so I'm going to kick off a fully clean build and see where this takes us. Sorry for my slowness at looking at this on the Spark side I've been running around a lot the past few days. |
no worries |
LGTM, there are some outstanding issues but they aren't impacted but this PR, |
Thanks for checking! |
## What changes were proposed in this pull request? Based on apache#18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle. Some notable changes: * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80) * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90) * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88) * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85) * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72) * Allow pickling of builtin methods (cloudpipe/cloudpickle#57) * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52) * Support method descriptor (cloudpipe/cloudpickle#46) * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32) * ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes. * Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96) * BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102) ## How was this patch tested? Existing PySpark unit tests + the unit tests from the cloudpickle project on their own. Author: Holden Karau <holden@us.ibm.com> Author: Kyle Kelley <rgbkrk@gmail.com> Closes apache#18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.
The
__transient__
dunder attribute is not the standard way to prevent attributes from being pickled. Instead, the standard approach is to use the__getstate__
and__setstate__
magic methods.Considering that:
__transient__
attribute. What is it? Could it be thrown away? #108).I propose to remove any support for the non-standard approach.
As mentioned in #108, some other projects might be using this attribute. But when looking at those projects:
cloudpickle.py
file (hence the match when searching for__transient__
).__transient__
but without depending oncloudpickle
as an external module dependency.I think even though this change may break other's code it is an unlikely scenario. Anyway, if that was the case, I think they should be fixing their code rather than making
cloudpickle
carry ugly fixes. Also, they can always choose to use an oldercloudpickle
version from PyPi.Fixes #108.