Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to encode recursive types? #35

Open
artegoser opened this issue Aug 31, 2024 · 10 comments
Open

Is it possible to encode recursive types? #35

artegoser opened this issue Aug 31, 2024 · 10 comments
Labels
documentation Improvements or additions to documentation

Comments

@artegoser
Copy link

image

image

image

In serde version there is error: type changed

@finnbear
Copy link
Member

finnbear commented Aug 31, 2024

Recursive types are not implemented (edit: for Encode) and would be tricky to implement based on how bitcode@0.6 encodes nested values. You can use bitcode@0.5 instead, which works very differently and has a #[bitcode(recursive)] attribute for Encode.

Edit: for serde, see #35 (comment)

@finnbear finnbear added the documentation Improvements or additions to documentation label Aug 31, 2024
@artegoser
Copy link
Author

Thx. So, is this feature not planned? Or will it be possible again someday?

@finnbear
Copy link
Member

Not currently planned, but the documentation could be more clear about this so I'll leave the issue open.

@caibear
Copy link
Collaborator

caibear commented Sep 1, 2024

Can use 0.6 with serde if you remove serde untagged from your enums. Serde untagged breaks most binary formats even if some don't complain right away.

@artegoser
Copy link
Author

Yes, it works, only the purpose in reducing the amount of data is not fulfilled. In this case enum identifier is written, which significantly increases the file size. I compared with dlhn, however it cannot decode this. The size becomes larger than my daletpack format, which does not compress well. Is it possible to implement serde(untagged) support in your case?

@finnbear
Copy link
Member

finnbear commented Sep 1, 2024

Is it possible to implement serde(untagged) support in your case?

No, because #[serde(untagged)] works by trying all possible variants until one deserializes. No version of bitcode is self-describing, so it's possible that the wrong variant would deserialize without an error but the data would be corrupt.

Consider the case of #[serde(untagged)] enum { Big(u16), Small(u8) }. If you serialized Small(5) with serde_json, it would deserialize as Big(5). In bitcode, #[serde(untagged)] enum { Int(usize), Str(String) } could have the worse problem that bytes from a string encoding could be interpreted as an integer. Data corruption is not something we tolerate in bitcode, although most likely the entire message would fail to deserialize.

Making bitcode self-describing (per-instance, to support #[serde(untagged)]) would probably cost much more data than the enum variants that #[serde(untagged)] tries to omit.

The #[serde(untagged)] feature is better for textual formats like JSON, where different types have disjoint prefixes like [ or {.

The size becomes larger than my daletpack format

If you can provide specific details (schema + encoding) of how you can outperform bitcode on size, that would be interesting 👀

@artegoser
Copy link
Author

artegoser commented Sep 1, 2024

If you can provide specific details (schema + encoding) of how you can outperform bitcode on size, that would be interesting 👀

This format is specifically designed for my schema, but it doesn't compress well (I'm not very knowledgeable about how to optimize this), so I'm looking for solutions that are already out there. It's not for all data types, so it won't help you.

For my format, I had the idea of exposing each type (if it's an enum) and writing a separate identifier for each one.

For example

enum Enum {
  First(Other),
  Second(Other)
}

enum Other {
  String(String),
  Num(u8)
}

Enum in this case has 4 binary identifiers for each type

Enum::FirstString
Enum::FirstNum

Enum::SecondString
Enum::SecondNum

I don't know how procedurally realizable this is, but it should work. I don't know if this structure compresses well either.

I partially implemented this in the daletpack scheme (but still eventually would like to do it for all types and procedurally)

@caibear
Copy link
Collaborator

caibear commented Sep 1, 2024

After running your benchmark it looks like various formats take ~300 bytes. I would recommend testing on a larger dataset if you want to accurately benchmark compression. Compression algorithms are generally inefficient in terms of size and speed on input smaller than tens of kilobytes.

Also zlib is just a wrapper of deflate.

@DontBreakAlex
Copy link

Hi ! I have a few types that contain serde_json::Valuess, and I also get a panic with type changed. I checked and it seems that serde_json is not using #[serde(untagged)]. What's going on ?

@finnbear
Copy link
Member

I recently added a warning to the README about this; serde_json::Value has a custom Serialize implementation that is very similar to an untagged enum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants