Implement support for uint64_t values in ICU backend #246

Flamefire · 2024-12-05T19:54:36Z

ICU doesn't support uint64_t directly but provides access to formatting and parsing of decimal number strings.

Use Boost.Charconv to interface with that such that values larger than INT64_MAX can be formatted correctly and parsed at all.

Fixes #235

mborland · 2025-01-13T14:40:59Z

try_scientific_to_int looks good to me. The one recommendation I would have is to add fuzzing to it. The charconv fuzzer is in this folder: https://github.com/boostorg/charconv/tree/develop/fuzzing, and the associated CI run is https://github.com/boostorg/charconv/blob/develop/.github/workflows/fuzz.yml. In the fuzzing folder that Jamfile is already setup to run every .cpp file in the folder for 30 seconds, and looks for a text file in the seed corpus folder of the same name. In the seed corpus text file add examples of what properly formatted input looks like to seed the fuzzer. ChatGPT can probably spit out a good python program to generate that text file for you. I found lots of small errors throughout charconv with fuzzing I never would have found myself.

Flamefire · 2025-01-14T19:11:14Z

try_scientific_to_int looks good to me. The one recommendation I would have is to add fuzzing to it.
[...]
I found lots of small errors throughout charconv with fuzzing I never would have found myself.

Not sure what that should show. The input is pretty restricted as it is the intermediate result of ICU, so should already be well formed. A problem I anticipate is a logic bug in the code yielding wrong results but the fuzzer wouldn't find that, would it?
AFAIK fuzzing just triggers asserts and memory errors. Or what kind of issues have you found?

Anyway working on that in a new branch to try it. Can't hurt.

mborland · 2025-01-14T19:35:27Z

try_scientific_to_int looks good to me. The one recommendation I would have is to add fuzzing to it.
[...]
I found lots of small errors throughout charconv with fuzzing I never would have found myself.

Not sure what that should show. The input is pretty restricted as it is the intermediate result of ICU, so should already be well formed. A problem I anticipate is a logic bug in the code yielding wrong results but the fuzzer wouldn't find that, would it? AFAIK fuzzing just triggers asserts and memory errors. Or what kind of issues have you found?

Anyway working on that in a new branch to try it. Can't hurt.

Fuzzing helped me fix logic errors. In the old_crashes folder you'll see things that previously broke the logic or hit asserts. One that might impact you is something like this: https://github.com/boostorg/charconv/blob/develop/fuzzing/old_crashes/fuzz_from_chars_float/crash-9a5e44f5d633b9e38560345546d787a8f6b23ba6. You could try related items like 5E, 5e, 5.5e etc. since those should be malformed for you.

As reported in #235 formatting the first number which doesn't fit into int64_t anymore fails to add the thousands separators. I.e.: `9223372036854775807` -> `9,223,372,036,854,775,807` `9223372036854775808` -> `9223372036854775808` Add a test reproducing that that for all backends.

ICU doesn't support uint64_t directly but provides access to formatting and parsing of decimal number strings. Use Boost.Charconv to interface with that. Fixes #235

ICU might return 9223372036854775810 as 9.22337203685477581E+18 Use the internal parser of Boost.Charconv to handle this.

`boost::charconv::detail::parser` is not made for parsing (large) integers in exponential notation. It is mainly tested for parsing floating point numbers in hexadecimal format. Given we know ICU will output either an integer string or a number in "E notation" (1.2E2) we can convert that rather easily to a "regular" integer string by "moving" the dot to the right according to the exponent. The trailing gap is filled with zeros before passing it to `from_chars` which is now able to handle the range checks for us. This avoids overflows that can happen when multiplying the significant by the exponent which, due to integer arithmetic, would be cumbersome to guard against. Any situation that could yield a fractional or a too large value can be caught early.

Instead of filling a temporary buffer we can decompose a number like "x.yyyEz" to "(x * 10^3 + yyy) * 10^(z - 3)" I.e. we subtract from the exponent what is required as an exponent to make the fractional into an integer significant. For the simple case of "xEz" we just do "x * 10^z". This requires additional range checks before multiplying but avoids extra memory accesses.

Falls back to C locale where output doesn't match.

- Date formatting for UInt64 - Error cases - Practically unreachable cases

codecov · 2025-01-16T17:29:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.97%. Comparing base (5f24abe) to head (d56bad6).
Report is 11 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #246      +/-   ##
===========================================
+ Coverage    95.87%   95.97%   +0.09%     
===========================================
  Files          119      119              
  Lines        10410    10611     +201     
===========================================
+ Hits          9981    10184     +203     
+ Misses         429      427       -2

Files with missing lines	Coverage Δ
src/icu/formatter.cpp	`82.45% <100.00%> (+1.93%)`	⬆️
src/icu/formatter.hpp	`100.00% <ø> (ø)`
src/icu/numeric.cpp	`95.68% <100.00%> (-0.15%)`	⬇️
src/util/numeric_conversion.hpp	`100.00% <100.00%> (ø)`
test/formatting_common.hpp	`100.00% <100.00%> (ø)`
test/show_config.cpp	`100.00% <ø> (ø)`
test/test_formatting.cpp	`100.00% <100.00%> (ø)`
test/test_posix_formatting.cpp	`100.00% <100.00%> (ø)`
test/test_std_formatting.cpp	`100.00% <100.00%> (ø)`
test/test_util_numeric_convert.cpp	`100.00% <100.00%> (ø)`
... and 1 more

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f24abe...d56bad6. Read the comment docs.

Flamefire force-pushed the fix-large-number-icu branch 4 times, most recently from 4e98bb5 to 0936dae Compare January 6, 2025 08:43

Flamefire force-pushed the fix-large-number-icu branch 5 times, most recently from da2e86d to b7b933b Compare January 12, 2025 16:32

Flamefire added 9 commits January 15, 2025 13:57

GHA: Show output of all runs of binaries in test folder

b017126

Implement support for uint64_t values in ICU backend

42e65d0

ICU doesn't support uint64_t directly but provides access to formatting and parsing of decimal number strings. Use Boost.Charconv to interface with that. Fixes #235

Handle ICU version that keep parsed number in scientific format

3536525

ICU might return 9223372036854775810 as 9.22337203685477581E+18 Use the internal parser of Boost.Charconv to handle this.

Remove left-over condition

8019e88

Fix test failure for POSIX formatting when locale is not available

d03c4ed

Falls back to C locale where output doesn't match.

Reduce verbosity of formatting_common test

86e59b6

Flamefire force-pushed the fix-large-number-icu branch from f20a688 to b1c49f6 Compare January 15, 2025 13:41

boostorg deleted a comment from codecov bot Jan 15, 2025

Add some tests for missed cases

d56bad6

- Date formatting for UInt64 - Error cases - Practically unreachable cases

Flamefire force-pushed the fix-large-number-icu branch from b1c49f6 to d56bad6 Compare January 16, 2025 08:11

boostorg deleted a comment from codecov bot Jan 16, 2025

Flamefire merged commit ba5ff7b into develop Jan 16, 2025
54 checks passed

Flamefire deleted the fix-large-number-icu branch January 16, 2025 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement support for uint64_t values in ICU backend #246

Implement support for uint64_t values in ICU backend #246

Flamefire commented Dec 5, 2024

mborland commented Jan 13, 2025

Flamefire commented Jan 14, 2025

mborland commented Jan 14, 2025

codecov bot commented Jan 16, 2025 •

edited

Loading

Implement support for uint64_t values in ICU backend #246

Implement support for uint64_t values in ICU backend #246

Conversation

Flamefire commented Dec 5, 2024

mborland commented Jan 13, 2025

Flamefire commented Jan 14, 2025

mborland commented Jan 14, 2025

codecov bot commented Jan 16, 2025 • edited Loading

Codecov Report

codecov bot commented Jan 16, 2025 •

edited

Loading