Remove non-standard transient support #110

Peque · 2017-07-24T15:50:57Z

The __transient__ dunder attribute is not the standard way to prevent attributes from being pickled. Instead, the standard approach is to use the __getstate__ and __setstate__ magic methods.

Considering that:

This is an old fix that was implemented maybe for unsupported Python versions (i.e.: Python 2.6).
Nobody knows what it is doing there exactly or why it was added.
Having this special non-standard case implemented makes the code more complex and may result in unexpected behavior (see The __transient__ attribute. What is it? Could it be thrown away? #108).
This code was not even covered by tests, so removing it should increase code coverage and make the module more robust.

I propose to remove any support for the non-standard approach.

As mentioned in #108, some other projects might be using this attribute. But when looking at those projects:

Most simply have copies of the cloudpickle.py file (hence the match when searching for __transient__).
Others simply seem to be using __transient__ but without depending on cloudpickle as an external module dependency.
Most seem to be not very relevant (i.e.: fewer than 2 stars).

I think even though this change may break other's code it is an unlikely scenario. Anyway, if that was the case, I think they should be fixing their code rather than making cloudpickle carry ugly fixes. Also, they can always choose to use an older cloudpickle version from PyPi.

Fixes #108.

codecov-io · 2017-07-24T15:56:18Z

Codecov Report

Merging #110 into master will increase coverage by 1.77%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #110      +/-   ##
==========================================
+ Coverage   81.34%   83.12%   +1.77%     
==========================================
  Files           2        2              
  Lines         563      551      -12     
  Branches      113      107       -6     
==========================================
  Hits          458      458              
+ Misses         75       65      -10     
+ Partials       30       28       -2

Impacted Files	Coverage Δ
cloudpickle/cloudpickle.py	`83.02% <ø> (+1.77%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d986894...bb3708c. Read the comment docs.

rgbkrk · 2017-07-24T16:13:30Z

I think this is good reasoning and we'll want to make a major version release for this change if others think this is a good path to go.

/cc @ogrisel and @mrocklin

irmen · 2017-07-24T18:01:40Z

awesome analysis btw @Peque

mrocklin · 2017-07-24T18:20:12Z

I have no objection to removing support for this non-standard protocol

holdenk · 2017-07-24T19:37:35Z

Anyone mind if I give this a quick spin in Spark to see if we run into any issues first? We don't directly use the library but I'm eternally hopefully I'll find a way to migrate us to using the common cloud pickle.

rgbkrk · 2017-07-24T19:47:01Z

Anyone mind if I give this a quick spin in Spark to see if we run into any issues first?

Happy to wait to see.

irmen · 2017-07-24T19:49:46Z

@holdenk unrelated to this particular pull request but I'm curious: do you know if this is in any way related to the use of regular Python pickle + my pyrolite library (that provides a java implementation of the pickle protocol), in Spark?

holdenk · 2017-07-24T20:03:14Z

@irmen I'm not sure I'm resolving the "this" correctly in do you know if this is in any way related to the use of regular Python pickle correctly - what are you referencing?

irmen · 2017-07-24T20:10:27Z

@holdenk not really sure myself tbh... I'm not familiar with [py]Spark myself. I guess I'm looking for the reason cloudpickle is used in Spark when pickle (+pyrolite for java connectivity) is used as well. Maybe I'm confusing pyspark and spark itself

holdenk · 2017-07-24T20:24:28Z

So from memory the CloudPickle serializer is used for "commands" or functions (as well as things included in their scope like broadcast variables). I think its just for better handling of serialized functions+dependencies whereas for the contents of data we need to be able to parse the keys (which are encoded as long values) in Java - we don't have to understand any part of the function in Java.

Of course this is based off of fuzzy memories of long ago discussions, so don't hold me to be 100% accurate and before you make any decissions on this reach out and we can sort out the real reasons (or a close enough approximation for whatever you need).

rgbkrk · 2017-07-24T23:42:38Z

I got a little curious if it would be helpful if PySpark and Spark (in Scala) used pickles for the interchange format using pyrolite on the JVM side.

holdenk · 2017-07-24T23:48:07Z

So at the end of the day I think we're going to start seeing different formats being used for interchange in Spark (like Arrow) - so I think the best path forward is seeing what happens there. I'm going to also perhaps take this opportunity to resurrects @rgbkrk's work on getting Spark using a current version of cloudpickle since I need to do some of that to test this PR w/Spark anyways.

irmen · 2017-07-25T16:47:28Z

@rgbkrk as far as I know, Spark already uses pickle + pyrolite.

rgbkrk · 2017-07-25T17:27:07Z

Oh silly me.

holdenk · 2017-07-25T20:07:45Z

So I'm seeing some failures with this in Spark, but my build was in a bit of a bad state so I'm going to kick off a fully clean build and see where this takes us. Sorry for my slowness at looking at this on the Spark side I've been running around a lot the past few days.

irmen · 2017-07-25T20:14:45Z

no worries

holdenk · 2017-07-25T21:41:23Z

LGTM, there are some outstanding issues but they aren't impacted but this PR,

rgbkrk · 2017-07-25T22:40:10Z

Thanks for checking!

## What changes were proposed in this pull request? Based on apache#18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle. Some notable changes: * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80) * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90) * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88) * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85) * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72) * Allow pickling of builtin methods (cloudpipe/cloudpickle#57) * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52) * Support method descriptor (cloudpipe/cloudpickle#46) * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32) * ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes. * Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96) * BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102) ## How was this patch tested? Existing PySpark unit tests + the unit tests from the cloudpickle project on their own. Author: Holden Karau <holden@us.ibm.com> Author: Kyle Kelley <rgbkrk@gmail.com> Closes apache#18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.

Remove non-standard __transient__ support

bb3708c

Peque mentioned this pull request Jul 24, 2017

The __transient__ attribute. What is it? Could it be thrown away? #108

Closed

rgbkrk merged commit eb49847 into cloudpipe:master Jul 25, 2017

holdenk mentioned this pull request Jul 26, 2017

[SPARK-21070][PYSPARK] Attempt to update cloudpickle again apache/spark#18734

Closed

Peque mentioned this pull request Jul 26, 2017

Add cloudpickle serialization support irmen/Pyro4#179

Merged

Peque deleted the transient branch July 26, 2017 08:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove non-standard transient support #110

Remove non-standard transient support #110

Peque commented Jul 24, 2017 •

edited

Loading

codecov-io commented Jul 24, 2017 •

edited

Loading

rgbkrk commented Jul 24, 2017

irmen commented Jul 24, 2017

mrocklin commented Jul 24, 2017

holdenk commented Jul 24, 2017

rgbkrk commented Jul 24, 2017

irmen commented Jul 24, 2017

holdenk commented Jul 24, 2017

irmen commented Jul 24, 2017

holdenk commented Jul 24, 2017

rgbkrk commented Jul 24, 2017

holdenk commented Jul 24, 2017

irmen commented Jul 25, 2017

rgbkrk commented Jul 25, 2017

holdenk commented Jul 25, 2017

irmen commented Jul 25, 2017

holdenk commented Jul 25, 2017

rgbkrk commented Jul 25, 2017

Remove non-standard __transient__ support #110

Remove non-standard __transient__ support #110

Conversation

Peque commented Jul 24, 2017 • edited Loading

codecov-io commented Jul 24, 2017 • edited Loading

Codecov Report

rgbkrk commented Jul 24, 2017

irmen commented Jul 24, 2017

mrocklin commented Jul 24, 2017

holdenk commented Jul 24, 2017

rgbkrk commented Jul 24, 2017

irmen commented Jul 24, 2017

holdenk commented Jul 24, 2017

irmen commented Jul 24, 2017

holdenk commented Jul 24, 2017

rgbkrk commented Jul 24, 2017

holdenk commented Jul 24, 2017

irmen commented Jul 25, 2017

rgbkrk commented Jul 25, 2017

holdenk commented Jul 25, 2017

irmen commented Jul 25, 2017

holdenk commented Jul 25, 2017

rgbkrk commented Jul 25, 2017

Remove non-standard transient support #110

Remove non-standard transient support #110

Peque commented Jul 24, 2017 •

edited

Loading

codecov-io commented Jul 24, 2017 •

edited

Loading