Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nesting custom tags #252

Open
1 task done
sam-t-pratt opened this issue Jan 22, 2025 · 11 comments
Open
1 task done

Nesting custom tags #252

sam-t-pratt opened this issue Jan 22, 2025 · 11 comments

Comments

@sam-t-pratt
Copy link

Things to check first

  • I have searched the existing issues and didn't find my feature already requested there

Feature description

I am working on a project where I'd like to be able to serialize nested custom tags, or collections of custom tags. I have been trying to understand from the documentation the best approach to use, especially if I'd also like to be able to use those tags separately. It feels like it should be possible now, but I can't feel out the right way to do it. I suspect it's just a documentation/user knowledge problem, so if there is an obvious way to do this that I am not understanding, I would be more than happy to take that lesson and write up a self-contained example and paragraph for a documentation PR.

Use case

As an example, I have an Interval type which is a wrapper around a 2-tuple of floats, with operators defined for interval arithmetic. Separately, I have a Vector[T] type which is a generic numeric vector (similar to a numpy vector or glm, etc) - one common use case in my project is a Vector[Interval] - so while I've defined encoders and decoders that can be passed to default and tag_hook for the standalone types, I haven't found the right way to handle the case where Vector is a collection of Intervals (rather than float or int). Is there a way to do this that I am missing? Any pointers would be much appreciated!

@agronholm
Copy link
Owner

I'm not sure. A series of CBOR byte codes that such a structure would encode into would surely help me figure out if this is feasible.

@sam-t-pratt
Copy link
Author

I am working on building out the byte codes that match my idea, which is highlighting for me that perhaps this isn't a documentation issue with cbor2 and perhaps a gap in my knowledge of the specification. The specification states here that

A tag applies semantics to the data item it encloses. Tags can nest: if tag A encloses tag B, which encloses data item C, tag A applies to the result of applying tag B on data item C.

Based on that idea, my concept is effectively to have a structure that looks like the following:

CBORTag(
    VectorTagNumber,
    (
        CBORTag(IntervalTagNumber, (float, float)),
        CBORTag(IntervalTagNumber, (float, float)),
        CBORTag(IntervalTagNumber, (float, float))
    )
)

This is (amusingly) exactly the scenario defined in the spec; I have a data item (float) contained in a tag (Interval) which has some custom semantics, followed by enclosing that in another tag (Vector) which applies yet more semantics.

Hopefully that helps clarify the intent. It seems from the spec like this should be possible, but both fortunately and unfortunately, cbor2 has worked so well that I haven't needed to learn the machinery of the spec in great detail, so it's taking me some time to figure out how to construct the correct sequence of byte codes.

@sam-t-pratt
Copy link
Author

sam-t-pratt commented Jan 22, 2025

After spending more time reading through the spec, I think that one reasonable sequence of bytes might look like the following. I used the CBOR Playground to test this.

D9 057A                 # tag(1402) <- tag number is a stand-in for vector. 
   83                   # array(3)
      D9 0579           # tag(1401) <- tag number is a stand-in for interval.
         82             # array(2)
            FA 3F99999A # float(1.2) <- floats are just arbitrary values.
            FA 4059999A # float(3.4)
      D9 0579           # tag(1401)
         82             # array(2)
            FA 40B33333 # float(5.6)
            FA 40F9999A # float(7.8)
      D9 0579           # tag(1401)
         82             # array(2)
            FA 41100000 # float(9.0)
            FA 40490FD8 # float(3.141592)

For easier copy-paste, here is the raw byte string:

D9057A83D9057982FA3F99999AFA4059999AD9057982FA40B33333FA40F9999AD9057982FA41100000FA40490FD8

It seems like the array specifiers are perhaps a bit of a cheat, since I am using them to wrap existing well-defined CBOR types (floats), but if I can figure out how to define this structure in cbor2, then I can reconstruct the original object (with its python semantics) from the CBOR object after deserialization. My goal for injecting the tags into the stream (rather than just serializing 6 floats and saving myself ~15 bytes of overhead) is to enforce type requirements when the stream is serialized and deserialized; if the tag is unrecognized during decoding, the decoder will reject the tag and raise an error, so that types are enforced even across the encode/decode boundary.

I am not able to get python3 -m cbor2.tool to give me a reasonable output with hex or base64 strings (which I suspect is operator error, not the fault of cbor2.tool) so I stuck to the playground website linked above.

@agronholm
Copy link
Owner

Ok, so have you read this yet?

@sam-t-pratt
Copy link
Author

Yes, and that's where I think I am misunderstanding things. I have simple implementations of default and tag_hook for these types. In the case of these two, they're both wrappers around tuple with extra methods to suit their purposes, but I have other types where this nested tag property would also be useful.

def vector_default(encoder, value):
    encoder.encode(CBORTag(1402, *value))

def vector_hook(decoder, tag, _shareable_index = 0):
    if tag.tag != 1402:
        return tag

    return Vector3(*tag.value)

# interval_default and _hook are functionally identical

I can then construct an object to encode, and encode it:

foo = Vector3(Interval(1.2, 3.4), Interval(5.6, 7.8), Interval(9.0, 3.1))
bits = cbor2.dumps(foo, default=vector_default)

The resultant byte string is below:

8382fb3ff3333333333333fb400b33333333333382fb4012000000000000fb401acccccccccccd82fb4008cccccccccccdfb4021cccccccccccd

This leads me to two questions; first - the tags I've defined (0x0579, 0x057A) don't actually appear in the byte string; cbor2 is outsmarting me and using arrays directly rather than injecting tags - I hadn't noticed this behavior before, but now that I have been playing with the encoded values directly, it's apparent - and surprising; am I doing something wrong with my default function?

The second question is what actually caused me to ask for help. If I have a sequence of bytes that means vector_tag, array[3], interval_tag, array[2], float, float, interval_tag, ... as in my previous comment, how do I pass both tag_hook methods to loads? When I pass tag_hook=vector_default, it seems like it will encounter the nested interval tag and return it, leading to the decoder not knowing what to do with the unknown tag.

If I pass the byte string from my comment above (the one I constructed by hand) into dumps, it correctly constructs a Vector object, but the contained values are just the raw tagged interval values:

test = bytes.fromhex('D9057C83D9057982FA3F99999AFA4059999AD9057982FA40B33333FA40F9999AD9057982FA41100000FA40490FD8')

print(cbor2.loads(test, tag_hook=Vector3.decode))
Vector3(CBORTag(1401, [1.2000000476837158, 3.4000000953674316]), CBORTag(1401, [5.599999904632568, 7.800000190734863]), CBORTag(1401, [9.0, 3.141592025756836]))

Hopefully that makes sense - it really feels like I am just reading the docs wrong here, since it seems like this is the expected behavior from the library, but not what I expected.

@agronholm
Copy link
Owner

What does your Interval class look like? You seem to be unpacking it into the CBORTag which would certainly eliminate the Interval type from the encoding process.

@sam-t-pratt
Copy link
Author

To simplify things, I've created a small example script that is just the bare bones of what I'm trying to do:

from __future__ import annotations
import cbor2


class Interval(tuple):
    TAG_NUMBER = 1402

    def __new__(cls, lower, upper):
        return super().__new__(cls, (lower, upper))

    @staticmethod
    def encode(encoder: cbor2.CBOREncoder, value: Interval) -> None:
        encoder.encode(cbor2.CBORTag(Interval.TAG_NUMBER, value))

    @staticmethod
    def decode(decoder: cbor2.CBORDecoder, tag: cbor2.CBORTag, _shareable_index: int = 0) -> Interval | cbor2.CBORTag:
        if tag.tag != Interval.TAG_NUMBER:
            return tag

        return Interval(*tag.value)

    def __repr__(self) -> str:
        return f"interval({self[0]}, {self[1]})"

    # Lots of other methods specific to interval


class Vector(tuple):
    TAG_NUMBER = 1401

    def __new__(cls, x, y, z):
        return super().__new__(cls, (x, y, z))

    @staticmethod
    def encode(encoder: cbor2.CBOREncoder, value: Vector) -> None:
        encoder.encode(cbor2.CBORTag(Vector.TAG_NUMBER, value))

    @staticmethod
    def decode(decoder: cbor2.CBORDecoder, tag: cbor2.CBORTag, _shareable_index: int = 0) -> Vector | cbor2.CBORTag:
        if tag.tag != Vector.TAG_NUMBER:
            return tag

        return Vector(*tag.value)

    def __repr__(self) -> str:
        return f"vector({self[0]}, {self[1]}, {self[2]})"

    # Lots of other methods specific to vector


if __name__ == "__main__":
    print("CBOR2 Nested Tag Test Script")

    # Construct an arbitrary test value
    test_value = Vector(Interval(0.1, 2.3), Interval(4.5, 6.7), Interval(8.9, 1.0))

    print(f"Initial test value: {test_value}")
    # outputs "Initial test value: vector(interval(0.1, 2.3), interval(4.5, 6.7), interval(8.9, 1.0))"

    # Serialize it with the vector encoder
    bits = cbor2.dumps(test_value, default=Vector.encode)

    print(f"Test value hex string:\n{bits.hex()}")


    # Deserialize it and compare
    reconstructed = cbor2.loads(bits, tag_hook=Vector.decode)

    print(f"Reconstructed test value: {reconstructed}")
    # outputs "Reconstructed test value: [[0.1, 2.3], [4.5, 6.7], [8.9, 1.0]]"

The goal is to end up with a reconstructed value that matches the test value - both in terms of the actual numeric values, but also the types. So the two areas where I'm doing something wrong are:

  1. What am I doing wrong to cause cbor to not emit custom tags?
  2. How would I handle the case where Vector and Interval both need to pass their encoder and decoder methods to default and tag_hook respectively?

Thank you for helping out; I apologize if this feels like a really dumb sequence of back-and-forth - I really appreciate your time.

@agronholm
Copy link
Owner

This is probably caused by inheriting from tuple, which cbor2 can handle natively. What if you don't do that?

@sam-t-pratt
Copy link
Author

sam-t-pratt commented Jan 23, 2025

That works for Vector, but that leads back to the original question - the encoder (and decoder) functions needed for Vector and Interval are different, so the encoding fails as soon as we hit the Interval inside of the vector. What is the right way to handle nested custom types like this?

On a related note; I would have expected CBOREncoder.encode(CBORTag(...)) to emit a custom tag even when cbor2 can handle the value natively, since the default_encoder lookup seems like it should return encode_semantic, which would then encode the tag before calling encode on tag.value, right?

@agronholm
Copy link
Owner

That works for Vector, but that leads back to the original question - the encoder (and decoder) functions needed for Vector and Interval are different, so the encoding fails as soon as we hit the Interval inside of the vector. What is the right way to handle nested custom types like this?

On a related note; I would have expected CBOREncoder.encode(CBORTag(...)) to emit a custom tag even when cbor2 can handle the value natively, since the default_encoder lookup seems like it should return encode_semantic, which would then encode the tag before calling encode on tag.value, right?

You're only telling cbor2.dumps() how to encode Vector, but not how to encode Interval. Your decoding process has the same issue. Once you give the respective functions a callback that can handle both classes, it should work.

As for the semantics of handling custom types inherited from natively handled types, I'll think about changing the semantics in the next major release (as such a change would be potentially backwards incompatible). The ability to override the encoding and decoding of any type has been requested before.

@sam-t-pratt
Copy link
Author

Got it, thanks. I'll see how I can work around the inheritance issue. I think I can handle the callbacks by building functions that can handle nested types at runtime, but I'll have to play with that for a while - thank you for helping me understand!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants