-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Support out-of-band pickling (protocol 5) #34244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks, looks interesting. At a glance, it looks like we're successfully using pickle5 protocol when pickling underlying ndarrays. import pandas as pd
import numpy as np
import pickle
import pickletools
a = np.arange(4)
b = pd.Series(a)
pickletools.dis(pickletools.optimize(pickle.dumps(a, protocol=5)))
pickletools.dis(pickletools.optimize(pickle.dumps(b, protocol=5))) So the primary work to do here are
|
Good point! Yeah if the objects Pandas uses for data storage already support pickle protocol 5 then it should just work. NumPy arrays are a good example (since they already support pickle protocol 5). Not sure what other objects might be used. Certainly testing would help build confidence :) My guess is size of objects shouldn't matter unless Pandas does something different with data representation of large objects. |
The other potentially large objects would be extension arrays (Categorical, etc.). All of pandas' extension arrays do consistent of one or more NumPy ndarrays. |
Ok, so this may already just work then. FWIW this seems to be the case with In [1]: import pickle
In [2]: import numpy
In [3]: import pandas
In [4]: d = pandas.DataFrame({"a": [1, 2, 3], "b": [0.5, 0.2, 0.3]})
In [5]: f = []
...: h = pickle.dumps(d, protocol=5, buffer_callback=f.append)
In [6]: [numpy.asarray(e) for e in f]
Out[6]: [array([[0.5, 0.2, 0.3]]), array([[1, 2, 3]])] Where are the current pickle tests? |
Should all be in |
One other observation is if a column is represented with many small NumPy arrays, this will be true of the pickled form as well. During unpickling would Pandas keep the small NumPy arrays or would it consolidate them into a single one? |
Is your feature request related to a problem?
It would be nice if Pandas objects supported pickle's protocol 5 for out-of-band serialization. This would allow the underlying data to be captured in
PickleBuffer
s (specializedmemoryview
). For libraries using pickle's protocol 5 to transmit data over the wire, this would allow for zero-copy data transmission.Describe the solution you'd like
Pandas objects implement
__reduce_ex__
and if theprotocol
argument is5
or greater, they constructPickleBuffer
s out of any data arguments.API breaking implications
NA as it should be possible to fallback to existing behavior for older pickle protocols. Users have to actively opt-in at a higher level API (through pickle) to see any effect.
Describe alternatives you've considered
NA
Additional context
This would be useful in libraries that support distributed dataframes ;)
The text was updated successfully, but these errors were encountered: