-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
look into using Ryu instead of Errol3 for floating point printing #1299
Comments
Note if porting the reference C implementation: we should port the |
I'll likely do an initial implementation of this in the next few days. |
Will be committing here for the moment. https://github.com/tiehuis/zig-ryu Also, it would be nice to implement proper |
Quick update:
This leaves the formatted printing of values remaining before I'd be happy to make a PR. There is an upstream issue for that as well: ulfjack/ryu#27. |
|
Not really. Main requirement here is to put the time and effort in getting the formatted float output modes modes done for alignment/precision, scientific/decimal modes etc. It'll probably tie in with #1358 since they both touch a lot of the formatting code. |
This is a response to #1290 (comment). Related moreso to this issue so bringing the discussion here. Here are some actual numbers in regards to the size savings. First, we can construct two simple examples: format-std.zig
format-ryu.zig
Compiling both of these under format-std.oTotal size is 29163 bytes (excluding
format-ryu.oTotal size is 3007 bytes. Note also that ryu only requires integer operations while errol operates on floats directly.
An important thing to note is that internally our errol implementation only works on f64 values. Even when printing an f32 we cast up to an f32 and back down once down. Likewise for f128, in which we lose precision during printing. The current ryu implementation has a 32, 64 and 128 bit backend for each of the float types which any narrower type can use without loss of precision. That being said, the above example is slightly apples to oranges, since we are comparing a 32-bit printing backend to a 64-bit. Modifying our original examples we get the following: format-std.oSame size as previously. Slightly smaller by a few bytes since casting between f32 to f64 is now not done. format-ryu.oTotal size is 4176 bytes. Slightly larger that the f32 but not a huge amount. The increase in size is not linear as the types increase since the precomputed values are different between the engines. Storing lots of tables is only required for increasing runtime performance. We can precompute smaller amounts and compute based on these instead of doing them all to keep the table sizes reasonable. This is done as per the original implementation.
Summary
|
Wow, huge improvements! |
An additional note but there is some preliminary support now upstream for formatting to a specific precision with ryu. This is done for the f64 backend. This is a separate implementation distinct from the existing (which is focused on shortest representable unique output). I haven't yet looked at it in depth but this was the main remaining issue on the "backend" side of things. |
So does that need to be re-written to, to work specifically with ryu's output, so we can be round-trip accurate for all values (including subnormals)? Note that the Ryu paper specifies that differn't rounding modes are available, so parse_float.zig must match the rounding mode of the ryu implementation. |
@tiehuis is there something ready to merge? |
No. I haven't done anything on this since my last message here so it is still in the same state. You could technically merge right now (updating the code for latest zig), but you would only have the default format provided by ryu and would be missing precision and different notation options. |
I ported the 32-bit version to D if anyone wants to see it for reference. It's all in one file in D (tests as well): https://github.com/dragon-lang/mar/blob/master/src/mar/ryu.d |
|
I've implemented the fixed precision modes for scientific/fixed in a new branch: https://github.com/tiehuis/zig-ryu/tree/dev I'll likely make a PR soon once I finish up some comparisons and do some testing. Do note that we need more tables to handle this formatting so I'll need to benchmark the increase compared to just preferring shortest unique. Also, since I only have a 64-bit variant for these, the plan is to upcast f16 and f32 to f64 for these modes. This is fine since it will preserve accuracy, however we will need to downcast f128 until a dedicated implementation is added. I don't consider this a blocker since errol currently does this now so it remains in line at least with the status quo. |
I consider this an interim workaround/hack until ziglang#1299 is finished. There is a bug in the original C implementation of the errol3 (and errol4) algorithm that can result in undefined behavior or an obviously incorrect result (leading ':' in the output) This change checks for those two problems and uses a slower fallback path if they occur. I can't guarantee that this will always produce the correct result, but since the workaround is only used if the original algorithm is guaranteed to fail, it should never turn a previously-correct result into an incorrect one. Fixes ziglang#11283
I consider this an interim workaround/hack until #1299 is finished. There is a bug in the original C implementation of the errol3 (and errol4) algorithm that can result in undefined behavior or an obviously incorrect result (leading ':' in the output) This change checks for those two problems and uses a slower fallback path if they occur. I can't guarantee that this will always produce the correct result, but since the workaround is only used if the original algorithm is guaranteed to fail, it should never turn a previously-correct result into an incorrect one. Fixes #11283
I consider this an interim workaround/hack until ziglang#1299 is finished. There is a bug in the original C implementation of the errol3 (and errol4) algorithm that can result in undefined behavior or an obviously incorrect result (leading ':' in the output) This change checks for those two problems and uses a slower fallback path if they occur. I can't guarantee that this will always produce the correct result, but since the workaround is only used if the original algorithm is guaranteed to fail, it should never turn a previously-correct result into an incorrect one. Fixes ziglang#11283
@tiehuis Thanks for sharing your port. I moved everything into a single folder ( tiehuis/zig-ryu@master...frmdstryr:zig-ryu:single-folder ) updated the tests and used it to format a f32 on a cortex-m4f (@192mhz). With ryu: const t = mcu.system.time(.s);
var buf: [20]u8 = undefined;
const result = ryu(f32).printFixed(&buf, t, 1);
try lcd.writeLine(3, result); Size is ~116k (a lot larger) but it only takes about 9us to format!
With zig's current fmt const t = mcu.system.time(.s);
var buf: [20]u8 = undefined;
const result = std.fmt.bufPrint(&buf, "{d:.1}", .{t}) catch unreachable;
try lcd.writeLine(3, result); Size is ~46k (a lot smaller) but it takes about ~120us to format the time. (Oddly for small numbers it is almost 500us while ryu seems to be faster in that case).
Both output the correct value and are compiled with release small. It is unfortunate that the tables are so large... as that makes it unusable for a lot of mcus with smaller flash sizes... for PC's the size is negligible.
|
I wonder if table values could be generated lazily for -OReleaseSmall builds 🤔 |
The tables for the fixed printing variant are pretty large. Something to note as well is the fixed print for f32 up-casts internally to f64 so uses larger tables than is strictly required. There isn't an upstream f32 fixed print however so I didn't implement one in my original code. As for smaller tables or lazy computation, I do know that the computation of accurate values for these tables uses BigInteger's and needs higher precision than native values so the performance hit could be significant. I am unsure however if you could compute partial tables and get a better trade-off. On your edit, Ryu doesn't use floating point values internally to compute the string so this matches my expectation. |
It is in fact possible to do fixed-precision formatting with a much smaller table: https://jk-jeon.github.io/posts/2022/12/fixed-precision-formatting/ For small enough precisions I think it will be faster than Ryu-printf. Actually, the algorithm operates in different ways for small precisions and large precisions, and the portion for the small precisions is recently adapted into fmtlib (fmtlib/fmt#3269). That portion reuses the table used for the shortest roundtrip formatting (which for the case of Dragonbox is of size 9~10KB, or 500~600 bytes with a bit of runtime performance cost). Large precisions case (which should be very rare in practice) still can be done with an additional table of size 580 bytes with 2x-3x slower performance than Ryu-printf, or with a table of size 3~4 KB with similar performance to Ryu-printf, or something in between those two, according to the benchmark I've done. |
This replaces the errol backend with one based on ryu. The 128-bit backend only is implemented. This supports all floating-point types and does not use fp logic to print. Closes ziglang#1181. Closes ziglang#1299. Closes ziglang#3612.
Bold claim:
paper: https://dl.acm.org/citation.cfm?id=3192369
C implementation: https://github.com/ulfjack/ryu
things to measure:
The text was updated successfully, but these errors were encountered: