Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization with "msgpack" doesn't preserve lists #3716

Open
jakirkham opened this issue Apr 16, 2020 · 9 comments
Open

Serialization with "msgpack" doesn't preserve lists #3716

jakirkham opened this issue Apr 16, 2020 · 9 comments

Comments

@jakirkham
Copy link
Member

Currently if a list is serialized (and is small enough), it will be handled correctly.

In [1]: from distributed.protocol import serialize, deserialize                 

In [2]: t = (0, 1, 2)                                                           

In [3]: deserialize(*serialize(t, serializers=["msgpack"]))                     
Out[3]: (0, 1, 2)

In [4]: l = [0, 1, 2]                                                           

In [5]: deserialize(*serialize(l, serializers=["msgpack"]))                     
Out[5]: [0, 1, 2]

However for larger lists, this breaks down and a tuple is returned instead.

In [1]: from distributed.protocol import serialize, deserialize                 

In [2]: t = (0, 1, 2, 3, 4, 5, 6)                                               

In [3]: deserialize(*serialize(t, serializers=["msgpack"]))                     
Out[3]: (0, 1, 2, 3, 4, 5, 6)

In [4]: l = [0, 1, 2, 3, 4, 5, 6]                                               

In [5]: deserialize(*serialize(l, serializers=["msgpack"]))                     
Out[5]: (0, 1, 2, 3, 4, 5, 6)
@jakirkham
Copy link
Member Author

jakirkham commented Apr 16, 2020

It's worth noting msgpack itself by default will always return a list.

In [1]: import msgpack                                                          

In [2]: t = (0, 1, 2)                                                           

In [3]: msgpack.loads(msgpack.dumps(t))                                         
Out[3]: [0, 1, 2]

In [4]: l = [0, 1, 2]                                                           

In [5]: msgpack.loads(msgpack.dumps(l))                                         
Out[5]: [0, 1, 2]

That said, we actually force it to return a tuple, which presents its own set of issues.

In [6]: msgpack.loads(msgpack.dumps(l), use_list=False)                         
Out[6]: (0, 1, 2)

This choice goes back to PR ( #2000 ). Though it seems we knew this could present a problem ( #2000 (comment) ). Not sure if the reasoning behind that change still holds today or IOW what our constraints are now.

Edit: Also related is this upstream discussion ( msgpack/msgpack-python#98 ).

@Kobzol
Copy link
Contributor

Kobzol commented May 18, 2020

It would be nice if this was resolved and there were no auto conversions, it also bit us when we were implementing the scheduler in Rust.

@jakirkham
Copy link
Member Author

Yeah I'm not sure how to do this as MessagePack doesn't have such a distinction. Much like JSON it just has arrays.

Am curious how this affected Rust. Are there tuple/list equivalents that need this distinction? Or does this have something to do with how Dask is encoding data? Or is there some other issue?

@Kobzol
Copy link
Contributor

Kobzol commented May 18, 2020

I know that the upstream messagepack Python crate makes handling this difficult. In general, in our Rust implementation we somehow had to make sure that we were generating the same messages as the rest of Dask (clients/workers) assumed, sometimes the code was implicitly relying on assumptions like a specific thing is a list/tuple or a specific message is wrapped in a list of size one etc. It was not problematic to implement these things, but it was error-prone to find them.

So for me personally, consistency and documented behaviour would be the most important thing here. We don't really care if it returns lists or tuples, but it's annoying when you send a list with 5 elements and it returns a list and then you send a list with 6 elements and it returns a tuple or something akin to that. It would be nice if it always returned a list or always returned a tuple, although I'm not sure if that can be handled in general, because small/non-serialized objects are left as-is AFAIK.

To be honest, I don't remember what specifically was the problem regarding tuples, but we had to handle it in our protocol changes (maybe it only needed to be handled because of Dask unit tests or the rest of Dask code). I'll try to poke in our protocol changes to see if something breaks when I change tuples/lists in the encoding. Although it seems that this is a problem on the receiver side, because msgpack-python simply deserializes all flat iterables to a list OR all to a tuple?

@jakirkham
Copy link
Member Author

jakirkham commented May 19, 2020

Yeah that makes sense. Agree it is error-prone currently and it would be good to get away from that.

Maybe we can come up with something using ExtType?

import msgpack


def default(obj):
    if isinstance(obj, tuple):
        return msgpack.ExtType(10, msgpack.packb(list(obj)))
    else:
        raise TypeError("Unknown type: %s" % repr(type(obj)))

def ext_hook(code, data):
    if code == 10:
        return tuple(msgpack.unpackb(data))
    else:
        return msgpack.ExtType(code, data)


data = (2, 3)
packed = msgpack.packb(data, default=default, strict_types=True, use_bin_type=True)
unpacked = msgpack.unpackb(packed, ext_hook=ext_hook)
print((data, unpacked))


data = [2, 3]
packed = msgpack.packb(data, default=default, strict_types=True, use_bin_type=True)
unpacked = msgpack.unpackb(packed, ext_hook=ext_hook)
print((data, unpacked))

@Kobzol
Copy link
Contributor

Kobzol commented May 19, 2020

This looks reasonable to me, if it doesn't add too much complexity (I don't know on how many places it needs to be changed).

@jakirkham
Copy link
Member Author

Actually had a better idea so that we don't have to call msgpack.packb in the default function, but can just rely on msgpack to serialize the result for us. Also this just overloads lists a bit by prefixing them with __tuple__ if that is what they are. We could do something similar by placing them in a dict instead. Anyways should be a bit cleaner approach overall.

import msgpack


def encode_tuple(obj):
    if isinstance(obj, tuple):
        obj = ["__tuple__", *obj]
    return obj

def decode_tuple(obj):
    if obj[0] == "__tuple__":
        obj = tuple(obj[1:])
    return obj


data = (2, 3)
packed = msgpack.packb(data, default=encode_tuple, strict_types=True, use_bin_type=True)
unpacked = msgpack.unpackb(packed, list_hook=decode_tuple)
print((data, unpacked))

data = [2, 3]
packed = msgpack.packb(data, default=encode_tuple, strict_types=True, use_bin_type=True)
unpacked = msgpack.unpackb(packed, list_hook=decode_tuple)
print((data, unpacked))

@jakirkham
Copy link
Member Author

Addressing with PR ( #4575 )

@jakirkham
Copy link
Member Author

cc @jrbourbeau @madsbk (also please see comment above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants