Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msgpack.Encoder flag to skip any coercion of data types not native to the encoding #734

Open
nfcampos opened this issue Sep 14, 2024 · 1 comment

Comments

@nfcampos
Copy link

Description

Hi

This library is great, well documented and well thought-out, thanks for the work you've put into it!

I'm the lead maintainer of https://github.com/langchain-ai/langgraph and was considering using msgspec to power serialization of the checkpoints. Checkpoints snapshot the current state of the computation in LangGraph, and as such contain custom python objects, for which we don't have a schema. We serialize these as a tuple containing a reference to the constructor, and the arguments needed to recreate it after deserialization.

Would you consider adding a flag to the msgpack Encoder in the library to send any types that dont map 1:1 to msgpack types to enc_hook? This would enable us to use your library, as we then encode those types as msgpack extension types. Without this flag, for instance when a users values contain a uuid they get back a string, when it contains a set they get back a list, when it contains an enum value they get back the value (eg a string).

You can see here how we serialize to msgpack langchain-ai/langgraph#1716 currently using a slower library that doesn't coerce types

We wouldn't need any changes to the Decoder interface, that one already covers our needs with ext_hook

Thanks
Nuno

@nfcampos nfcampos changed the title Encoder flag to skip any coercion of data types not native to the encoding msgpack.Encoder flag to skip any coercion of data types not native to the encoding Sep 14, 2024
@trim21
Copy link
Contributor

trim21 commented Sep 21, 2024

+1 for this.

We previous use msgpack-python and have fine-grained control how python type are encoded, for example Decimal and datetime and set.

with enc_hook and ext_hook, we can have original python value back after encoding and decoding.

But with msgspec.msgpack, they are handled by msgspec, and enc_hook are not called for many types, make it impossible to do this.

a full example would look like this:

from datetime import datetime
from decimal import Decimal
from typing import Any

from msgspec import msgpack


_ext_id_set = 1
_ext_id_dt = 2
_ext_id_decimal = 3
_ext_id_userdata = 4


class UserData:
    def __init__(self, msg: str):
        self.msg = msg

    def __repr__(self):
        return f"<UserData msg={self.msg!r}>"


def _default(obj: Any) -> msgpack.Ext:
    if isinstance(obj, set):
        return msgpack.Ext(_ext_id_set, msgpack.encode(list(obj)))

    if isinstance(obj, datetime):
        return msgpack.Ext(_ext_id_dt, obj.isoformat(timespec="microseconds").encode())

    if isinstance(obj, Decimal):
        return msgpack.Ext(_ext_id_decimal, str(obj).encode())

    if isinstance(obj, UserData):
        return msgpack.Ext(_ext_id_userdata, obj.msg.encode())

    raise TypeError("Unknown type: {!r}".format(obj))


def _ext_hook(code: int, data: memoryview) -> Any:
    if code == _ext_id_set:
        return set(msgpack.decode(data.tobytes()))
    if code == _ext_id_dt:
        return datetime.fromisoformat(str(data, "utf8"))
    if code == _ext_id_decimal:
        return Decimal(str(data, "utf8"))

    if code == _ext_id_userdata:
        return UserData(str(data, "utf8"))

    return msgpack.Ext(code, data)


encoder = msgpack.Encoder(enc_hook=_default)
decoder = msgpack.Decoder(ext_hook=_ext_hook)


def dumps(v: Any) -> bytes:
    return encoder.encode(v)


def loads(b: bytes) -> Any:
    return decoder.decode(b)


py_data = {
    "0": Decimal("233"),
    "1": datetime.now().astimezone(),
    "2": {1, 2, 3},
    "3": UserData("hello world"),
}

after_serialization = loads(dumps(py_data))
print(after_serialization)

assert after_serialization == py_data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants