Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Allow custom string dictionary + use location of repeated strings #122

Open
joetex opened this issue Nov 9, 2023 · 3 comments

Comments

@joetex
Copy link

joetex commented Nov 9, 2023

I didn't know this package existed, and wrote my own, doh. But mine is scoped too heavily for my project. I want to switch to something a bit more flexible that has community support, and would love to get some of the reductions I implemented. In msgpackr for strings, I've been unable to get any boost from bundleStrings, which was odd.

Enhancements:

  1. Allow custom string dictionary. An array of commonly used strings that is fed identically to both Packr and Unpackr. It should only take up two bytes per string to lookup against this table for dictionary length of 255.

  2. Store location of repeated strings instead of encoding strings twice. If "hello" gets encoded at byte position 53, and the serializer sees "hello" again later, it should just encode the location position 53 for that 2nd "hello". Again, taking only 2 bytes or more if distance is greater than 255.

Feel free to see my own awful implementation, acos-json-encoder.
Edit: link goes to line where I implemented

@lmachens
Copy link

lmachens commented Apr 8, 2024

I am looking for this enhancement too.
@kriszyp can you check this request?

I think the bundleStrings could be optimized by not saving the same string multiple times.

@kriszyp
Copy link
Owner

kriszyp commented Apr 8, 2024

You might consider using CBOR packing, which was designed for this purpose:
https://github.com/kriszyp/cbor-x?tab=readme-ov-file#cbor-packing
However, this will only find exact string value duplicates (no duplicates within string, it won't do any compression of {foo: 'hello', bar: 'hello world'}. For more general string deduping, that is kind of the whole point RLE compression, and there are plenty of great compression formats and tools which are much better than anything msgpack could offer.

@lmachens
Copy link

lmachens commented Apr 8, 2024

@kriszyp Thank you very much!
This is exactly what I was looking for. I only need to find exact string duplicates.

Great results!
original size: 386275 bytes
msgpackr (with useRecords): 101464 bytes
cbor (with useRecords and pack): 61865 bytes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants