Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

impl Format for i128/u128 #199

Closed
Dirbaio opened this issue Oct 20, 2020 · 4 comments · Fixed by #284
Closed

impl Format for i128/u128 #199

Dirbaio opened this issue Oct 20, 2020 · 4 comments · Fixed by #284
Assignees
Labels
breaking change fix / feature / improvement involves a breaking change and needs to wait until next minor version difficulty: easy Pretty easy to solve priority: low Low priority for the Knurling team status: needs PR Issue just needs a Pull Request implementing the changes type: enhancement Enhancement or feature request
Milestone

Comments

@Dirbaio
Copy link
Contributor

Dirbaio commented Oct 20, 2020

Would also be nice to have i128/u128 while at it.

Originally posted by @MathiasKoch in #160 (comment)

@jonas-schievink
Copy link
Contributor

I wonder if we should use LEB128 encoding for some of the larger types...

@jonas-schievink jonas-schievink added the type: enhancement Enhancement or feature request label Oct 20, 2020
@Dirbaio
Copy link
Contributor Author

Dirbaio commented Oct 20, 2020

Yes... Definitely for u64 and IMO it's worth it for u32 as well.

@japaric
Copy link
Member

japaric commented Oct 20, 2020

I wonder if we should use LEB128 encoding for some of the larger types...

warning this section contains math

We can compute how many bytes on average , say, a u16 will take when it's LEB128 if we assume, that in practice, the value will be uniformely distributed across the entire u16 range (and that's a pretty big if):

  • the range 0..(1<<7) (128 values) is LEB128 encoded as 1 byte
  • the range (1<<7)..(1<<14) (16,256 values) is LEB128 encoded as 2 bytes
  • the range (1<<14)..(1<<16) (49,152 values) is LEB128 encoded as 3 bytes

If we assume a uniform distribution then each u16 value has a probability of 1 / (1<<16) of being chosen. We can then compute the average encoding size (in bytes) as follows:

const PROB: f64 = 1. / (1<<16) as f64;

let avg = 128. * 1. * PROB + // `0..(1<<7)` encoded as 1 byte
    16_256. * 2. * PROB +    // `(1<<7)..(1<<14)` encoded as 2 bytes
    49_152. * 3. * PROB;     // `(1<<14)..(1<<16)` encoded as 3 bytes
dbg!(avg);

Prints 2.748046875. On average u16 is encoded as 2.748 bytes (again assuming a uniform distribution of values).
If you don't use LEB128 encoding then u16 is always encoded as 2 bytes.
Under this assumption, it's better to not use LEB128 encoding.

If you repeat this exercise for u32 and u64 you get (please someone check the numbers):

  • u32 is LEB128 encoded as 4.937 bytes on average
  • u64 is LEB128 encoded as 9.496 bytes on average

Under this assumption LEB128 encoding doesn't pay off for u32 or u64.


(this section is free from math)

"Will all u16, u32 and u64 variables, in practice, follow a uniform distribution?"

Very likely no. Each variable in a single application is going to follow a different probability distribution, e.g. a u32 counter variable that's usually reset to 0 vs a u32 hash value vs a u64 monotonic counter variable.

"Why are we bothering with LEB128 encoding at all if the above math says that, an average, it's worse than using no encoding?"

The length of a (string) slice (usize) on a 32-bit microcontroller is very likely going to be LEB128 encoded as 1, 2 or 3 bytes (i.e. be less than 1<<21) because the system likely doesn't have enough memory for a [u8; 1<<21] (that's 2MB) or larger arrays / slices. Because we know that, we have high confidence that LEB128 encoding the length will pay off (compared to the 4-byte bandwidth usage of not encoding the 32-bit usize).

"But why are we using LEB128 for usize?"

This is more hand wavy but the hope is that the usize value will more often than not come from slice.len(), str.len() or be some other kind of collection length. If that's the case then LEB128 encoding should pay off (on memory constrained systems).

"And ... isize?"

whistles innocently

"So, should we use LEB128 for other integer types other than usize?"

I don't know. Doing it in a blanket fashion, either never LEB128 encoding u32s or always LEB128 encoding u32s, is going to be good in some cases and bad in others. LEB128 encoding a u32 only saves bandwidth if the value is less than 1 << 21 (that's 2 million values); if the value is equal or greater than 1 << 28 (that's 4 billion (9 zeros) values) then LEB128 encoding actually wastes bandwidth. Ultimately, only the programmer knows on which range a particular variable in their application is going to fall and the answer is going to vary from variable to variable ...

"But clearly u32 counters are more commonly used / printed / logged than u32 hashes!"

Maybe? I would tend to agree but I don't have any data other than my own biased personal experience.

@Dirbaio
Copy link
Contributor Author

Dirbaio commented Oct 21, 2020

What about applying some simple "zero-removing" compression to the encoded byte stream, instead of trying to compress each integer on its own?

Something like this: https://capnproto.org/encoding.html#packing

The max overhead of that is 12.5%. A u32 can take max "4.5 bytes" instead of 5 bytes. An upside is it'd also compress sparse [u8]s and such. A downside is it'd bloat things that typically don't have zero bytes, such as strings. Printing strings shouldn't be common in defmt though.

signed integers could be encoded by flipping all non-sign bits if the sign bit is set, to turn 0xFF's into 0x00s, so small negative numbers compress better. (for example, -1 would be 0x8000000, -2 would be 0x80000001, i32::MIN would be 0xFFFFFFFF)

@jonas-schievink jonas-schievink added difficulty: easy Pretty easy to solve priority: low Low priority for the Knurling team status: needs PR Issue just needs a Pull Request implementing the changes labels Nov 10, 2020
@jonas-schievink jonas-schievink added this to the v0.1.1 milestone Nov 17, 2020
@japaric japaric modified the milestones: v0.1.2, v0.2.0 Nov 27, 2020
@japaric japaric added the breaking change fix / feature / improvement involves a breaking change and needs to wait until next minor version label Nov 27, 2020
@Dirbaio Dirbaio mentioned this issue Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change fix / feature / improvement involves a breaking change and needs to wait until next minor version difficulty: easy Pretty easy to solve priority: low Low priority for the Knurling team status: needs PR Issue just needs a Pull Request implementing the changes type: enhancement Enhancement or feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants