Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should differential encoding be used? #15

Open
qwertie opened this issue Nov 9, 2021 · 1 comment
Open

Should differential encoding be used? #15

qwertie opened this issue Nov 9, 2021 · 1 comment

Comments

@qwertie
Copy link

qwertie commented Nov 9, 2021

I'm not sure what the general attitude in the CG is about compression, but I remember that early on there was a strong focus on minimizing file size. With that in mind, standard compression algorithms like deflate don't like arbitrary numbers - they are unable, for instance, to predict that an increasing number sequence will continue to increase, so a workaround for this is to store numbers as differences, e.g. instead of [5, 8, 11, 21], store [5, 3, 3, 10]. This makes the numbers smaller, which tends to increase the compression ratio because it concentrates probability mass (e.g. in arithmetic/huffman encoding) toward smaller numbers.

The current format is quite natural:

the u32 byte offset of the hinted instruction from the first instruction of the function.

But since the list must be sorted, it would be straightforward to use differential encoding instead.

I haven't been very active in the Wasm community but I remember in the beginning there was an idea of having higher-level Wasm binary encodings, outside the core spec and implemented by 3rd parties, that could be used to optimize file size. Did that ever actually happen? If not, optimizing items in the core spec for compressibility starts to look more worthwhile.

@yuri91
Copy link
Collaborator

yuri91 commented Nov 18, 2021

I think that differential encoding was mentioned at some point in the design, but to keep it more similar to other sections (like dwarf data) we kept it simpler for the moment.

I agree that it would make sense, even more so because the offsets are 32 bit integers in LEB format, which makes it even more advantageous to have small numbers, even without further compression on top.

On the other hand, I expect branch hints to be very sparse, so I am not sure if there will be significant gains in practice.
I will try to estimate it once I will have data on a real codebase using this.

NOTE: The Branch hinting section follows the Code Annotation format: https://github.com/WebAssembly/tool-conventions/blob/main/CodeAnnotations.md , so this change would need to happen there as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants