-
-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
serialization of file-like objects #830
Comments
I don't think serializing streams is even theoretically possible in general. Or rather, where it is possible, it is the business of the file-like object itself to support Python's pickle protocol, serializing its internal stream state somehow. But open to ideas, CC @mpenkov :) |
I would suggest reading into a tempfile (or shared_memory if filesize allows), and sharing the filename/mem-pointer across processes. |
Good points, to be sure. I'm not proposing storage of the bytes so much as passing around the file-like objects as references (perhaps keeping seek information, but not even necessarily). This is would enable the things opened and then potentially passed to xarray to be moved between machines inside of Dask/Spark/etc. clusters nicely. Obviously this wouldn't work for disk-local file access, but for cloud providers, things online, etc. serializing the appropriate configs should be sufficient to realize the file-like objects on the other side to then seek into and read byte ranges or what have you |
you could try serialising with dill. afaik dask uses/used it. maybe you can adopt it in xarray? |
For sure, dill can solve the issue in some instances but def __getstate__(self):
# Called when pickling
return {'url': self.url, 'position': self._position}
def __setstate__(self, state):
# Called when unpickling
self.__init__(state['url'])
self.seek(state['position']) |
Problem description
I'd be curious to get opinions on whether serialization/deserialization should be supported for the file-like objects at the core of this library. This would be useful for distributed processing workflows that pass around either the file-like objects themselves or - and this is the case for xarray, which is the use case I'm interested in specifically - which can be constructed using these file-like objects as arguments. Obviously, if xarray datasets are hanging onto file-like objects that are not serializable, they are then not serializable themselves.
Steps/code to reproduce the problem
The above throws
NotImplementedError: object proxy must define __reduce_ex__()
This one throws
TypeError: cannot pickle '_io.BufferedReader' object
Versions
macOS-14.4.1-arm64-arm-64bit
Python 3.11.9 (main, May 22 2024, 12:34:58) [Clang 15.0.0 (clang-1500.3.9.4)]
smart_open 7.0.4
Checklist
Before you create the issue, please make sure you have:
The text was updated successfully, but these errors were encountered: