-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization with "msgpack"
doesn't preserve list
s
#3716
Comments
It's worth noting msgpack itself by default will always return a In [1]: import msgpack
In [2]: t = (0, 1, 2)
In [3]: msgpack.loads(msgpack.dumps(t))
Out[3]: [0, 1, 2]
In [4]: l = [0, 1, 2]
In [5]: msgpack.loads(msgpack.dumps(l))
Out[5]: [0, 1, 2] That said, we actually force it to return a In [6]: msgpack.loads(msgpack.dumps(l), use_list=False)
Out[6]: (0, 1, 2) This choice goes back to PR ( #2000 ). Though it seems we knew this could present a problem ( #2000 (comment) ). Not sure if the reasoning behind that change still holds today or IOW what our constraints are now. Edit: Also related is this upstream discussion ( msgpack/msgpack-python#98 ). |
It would be nice if this was resolved and there were no auto conversions, it also bit us when we were implementing the scheduler in Rust. |
Yeah I'm not sure how to do this as MessagePack doesn't have such a distinction. Much like JSON it just has arrays. Am curious how this affected Rust. Are there |
I know that the upstream messagepack Python crate makes handling this difficult. In general, in our Rust implementation we somehow had to make sure that we were generating the same messages as the rest of Dask (clients/workers) assumed, sometimes the code was implicitly relying on assumptions like a specific thing is a list/tuple or a specific message is wrapped in a list of size one etc. It was not problematic to implement these things, but it was error-prone to find them. So for me personally, consistency and documented behaviour would be the most important thing here. We don't really care if it returns lists or tuples, but it's annoying when you send a list with 5 elements and it returns a list and then you send a list with 6 elements and it returns a tuple or something akin to that. It would be nice if it always returned a list or always returned a tuple, although I'm not sure if that can be handled in general, because small/non-serialized objects are left as-is AFAIK. To be honest, I don't remember what specifically was the problem regarding tuples, but we had to handle it in our protocol changes (maybe it only needed to be handled because of Dask unit tests or the rest of Dask code). I'll try to poke in our protocol changes to see if something breaks when I change tuples/lists in the encoding. Although it seems that this is a problem on the receiver side, because msgpack-python simply deserializes all flat iterables to a list OR all to a tuple? |
Yeah that makes sense. Agree it is error-prone currently and it would be good to get away from that. Maybe we can come up with something using import msgpack
def default(obj):
if isinstance(obj, tuple):
return msgpack.ExtType(10, msgpack.packb(list(obj)))
else:
raise TypeError("Unknown type: %s" % repr(type(obj)))
def ext_hook(code, data):
if code == 10:
return tuple(msgpack.unpackb(data))
else:
return msgpack.ExtType(code, data)
data = (2, 3)
packed = msgpack.packb(data, default=default, strict_types=True, use_bin_type=True)
unpacked = msgpack.unpackb(packed, ext_hook=ext_hook)
print((data, unpacked))
data = [2, 3]
packed = msgpack.packb(data, default=default, strict_types=True, use_bin_type=True)
unpacked = msgpack.unpackb(packed, ext_hook=ext_hook)
print((data, unpacked)) |
This looks reasonable to me, if it doesn't add too much complexity (I don't know on how many places it needs to be changed). |
Actually had a better idea so that we don't have to call import msgpack
def encode_tuple(obj):
if isinstance(obj, tuple):
obj = ["__tuple__", *obj]
return obj
def decode_tuple(obj):
if obj[0] == "__tuple__":
obj = tuple(obj[1:])
return obj
data = (2, 3)
packed = msgpack.packb(data, default=encode_tuple, strict_types=True, use_bin_type=True)
unpacked = msgpack.unpackb(packed, list_hook=decode_tuple)
print((data, unpacked))
data = [2, 3]
packed = msgpack.packb(data, default=encode_tuple, strict_types=True, use_bin_type=True)
unpacked = msgpack.unpackb(packed, list_hook=decode_tuple)
print((data, unpacked)) |
Addressing with PR ( #4575 ) |
cc @jrbourbeau @madsbk (also please see comment above) |
Currently if a
list
is serialized (and is small enough), it will be handled correctly.However for larger
list
s, this breaks down and atuple
is returned instead.The text was updated successfully, but these errors were encountered: