-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21070][PYSPARK] Attempt to update cloudpickle again #18734
[SPARK-21070][PYSPARK] Attempt to update cloudpickle again #18734
Conversation
This brings in fixes and upgrades from the [cloudpickle](https://github.com/cloudpipe/cloudpickle) module, notably: * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80) * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90) * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88) * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85) * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72) * Allow pickling of builtin methods (cloudpipe/cloudpickle#57) * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52) * Support method descriptor (cloudpipe/cloudpickle#46) * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
…ent__ support from PR#110
…__transient__ support from PR#110" This reverts commit 1cfd38f.
…ful cloudpickle fixes. Revert "Revert "Copy over support work weakset, dynamic classess, and remove __transient__ support from PR#110"" This reverts commit cff6bfb.
Test build #79947 has finished for PR 18734 at commit
|
cc @rgbkrk since this is based on your original PR. + @HyukjinKwon |
You are my hero, thank you. |
I guess this should be fine in general if it simply ports the changes in cloudpickle. Will double check next week (to me, it takes a while to double check). @rgbkrk, would you mind double checking this one and see if it looks good to you? |
Yeah this is looking great, it also brings in the latest from cloudpickle that wasn't in my original patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@holdenk, LGTM with few questions.
What I checked was few custom changes after 04e44b3 assuming these were started from both cloudpipe/cloudpickle@c4f8851 and cloudpipe/cloudpickle@7aebb7e
I checked e044705, 04e44b3, e044705, 5520418, ee913e6, d489354, dbfc7aa, 20e6280 and 6297697 which I assume we added after that.
python/pyspark/cloudpickle.py
Outdated
modname = pickle.whichmodule(obj, name) | ||
except Exception: | ||
modname = None | ||
modname = pickle.whichmodule(obj, name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@holdenk, would it be safe to remove? It looks a custom fix we did to deal with https://issues.apache.org/jira/browse/SPARK-16077 and sounds we should keep this as https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling describes Python 2.7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question, the underlying issue was marked resolved in 2014 and from looking at the commit it seems like it should actually be resolved. That being said its true some people might be on a system installed with an old verison of six so perhaps we should keep this workaround.
|
||
def inject_numpy(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume removing this is fine per cloudpipe/cloudpickle@1e91fa7.
d.pop('__doc__', None) | ||
# handle property and staticmethod | ||
dd = {} | ||
for k, v in d.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we are fine per the tests added in e044705.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets double check this part with @davies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gentle re-ping to @davies - do you have an opinion on this?
BTW, I also checked it passes tests with Python 3.6 in my local. |
If we can reach agreement on this I'll see about trying to get our local workarounds upstreamed into cloudpickle. |
Workarounds welcomed on cloudpickle. 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @holdenk , LGTM!
Just a note that we just shipped the fixes from @HyukjinKwon within cloudpickle (as v0.4.0), so we're at least roughly in sync now. 😄 |
huzzah! I'm in the middle of getting some code working for a talk tomorrow so I'll circle back on this on Friday. If @davies has any opinions though it would be great to hear them. |
cc @ueshin who I believe is also appropriate to review this. |
Thank you for cc-ing me. |
cloudpickle now has this workaround, so it can stay. |
# if func is lambda, def'ed at prompt, is in main, or is nested, then | ||
# we'll pickle the actual function object rather than simply saving a | ||
# reference (as is done in default pickler), via save_function_tuple. | ||
if islambda(obj) or obj.__code__.co_filename == '<stdin>' or themodule is None: | ||
#print("save global", islambda(obj), obj.__code__.co_filename, modname, themodule) | ||
if (islambda(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a side note, it looks this PR includes cloudpipe/cloudpickle#51 too (SPARK-21753).
Test build #80826 has finished for PR 18734 at commit
|
Test build #80828 has finished for PR 18734 at commit
|
retest this please |
Test build #80842 has finished for PR 18734 at commit
|
Let's go with this. I am quite sure it is worth for a go. We can almost sync it and can deduplucate review costs and etc.. Will go merging this one in few days with another double check if there is no objection here and it is not mergrd by anyone. |
Sounds good @HyukjinKwon :) I think we should consider if we want to upstream the named tuple workaround as well. |
But for now yes I think we are in a good position. If no one objects of course :) |
Yea, it looks so. Named tuple one reminds me of the workaround we have for named tuple to make picklable - spark/python/pyspark/serializers.py Lines 395 to 446 in d03aebb
Maybe, we could take a look and see if we could get rid of this or port this. Anyway, let me take a final look. |
I am merging this because: cloudpickle looks initially ported from cloudpipe/cloudpickle@7aebb7e and cloudpipe/cloudpickle@c4f8851 (-> 04e44b3), where I see both are identical. After 04e44b3, we have diff - e044705, 5520418, ee913e6, d489354, dbfc7aa, 20e6280 and 6297697 [SPARK-9116] [SQL] [PYSPARK] support Python only UDT in main, e044705: I think this part is only what we are worried of. It looks supporting spark/python/pyspark/sql/tests.py Lines 141 to 173 in 9660831
spark/python/pyspark/sql/tests.py Lines 898 to 927 in 9660831
[SPARK-10542] [PYSPARK] fix serialize namedtuple, 5520418: We keep the changes: and the related test pass: Lines 211 to 219 in 77cc0d6
[SPARK-13697] [PYSPARK] Fix the missing module name of TransformFunctionSerializer.loads, ee913e6: We keep this change: and the related test pass: Lines 233 to 237 in 77cc0d6
We should probably port this one to [SPARK-16077] [PYSPARK] catch the exception from pickle.whichmodule(), d489354: We keep this change: This patch even should be safer as I and @rgbkrk verified this with some tests: [SPARK-17472] [PYSPARK] Better error message for serialization failures of large objects in Python, dbfc7aa: We keep this change: Probably, we should port this change into [SPARK-19019] [PYTHON] Fix hijacked This change was ported from [SPARK-19505][PYTHON] AttributeError on Exception.message in Python3, 6297697: We keep this change: |
retest this please |
Test build #80950 has finished for PR 18734 at commit
|
LGTM I am going to credit this to @rgbkrk per (http://spark.apache.org/contributing.html)
I just double checked if the tests passes with Python 3.6.0, and if I could run pi example with pypy manually (SPARK-21753), with the current status. |
Merged to master. |
Can you guys please exclude me from the loop?
-
Best,
Daniel Biesiada
… On 22 Aug 2017, at 04:22, asfgit ***@***.***> wrote:
Closed #18734 <#18734> via 751f513 <751f513>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#18734 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC25HQ7SYMpvvHn2Meyysr5Id26oxyzgks5sajtMgaJpZM4OjRjO>.
|
Oh, it was me cc'ing you mistakenly. Sorry for the noise. |
What changes were proposed in this pull request?
Based on #18282 by @rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle.
Some notable changes:
How was this patch tested?
Existing PySpark unit tests + the unit tests from the cloudpickle project on their own.