Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement support for uint64_t values in ICU backend #246

Merged
merged 10 commits into from
Jan 16, 2025

Conversation

Flamefire
Copy link
Collaborator

ICU doesn't support uint64_t directly but provides access to formatting and parsing of decimal number strings.

Use Boost.Charconv to interface with that such that values larger than INT64_MAX can be formatted correctly and parsed at all.

Fixes #235

@Flamefire Flamefire force-pushed the fix-large-number-icu branch 4 times, most recently from 4e98bb5 to 0936dae Compare January 6, 2025 08:43
@Flamefire Flamefire force-pushed the fix-large-number-icu branch 5 times, most recently from da2e86d to b7b933b Compare January 12, 2025 16:32
@mborland
Copy link
Member

try_scientific_to_int looks good to me. The one recommendation I would have is to add fuzzing to it. The charconv fuzzer is in this folder: https://github.com/boostorg/charconv/tree/develop/fuzzing, and the associated CI run is https://github.com/boostorg/charconv/blob/develop/.github/workflows/fuzz.yml. In the fuzzing folder that Jamfile is already setup to run every .cpp file in the folder for 30 seconds, and looks for a text file in the seed corpus folder of the same name. In the seed corpus text file add examples of what properly formatted input looks like to seed the fuzzer. ChatGPT can probably spit out a good python program to generate that text file for you. I found lots of small errors throughout charconv with fuzzing I never would have found myself.

@Flamefire
Copy link
Collaborator Author

try_scientific_to_int looks good to me. The one recommendation I would have is to add fuzzing to it.
[...]
I found lots of small errors throughout charconv with fuzzing I never would have found myself.

Not sure what that should show. The input is pretty restricted as it is the intermediate result of ICU, so should already be well formed. A problem I anticipate is a logic bug in the code yielding wrong results but the fuzzer wouldn't find that, would it?
AFAIK fuzzing just triggers asserts and memory errors. Or what kind of issues have you found?

Anyway working on that in a new branch to try it. Can't hurt.

@mborland
Copy link
Member

try_scientific_to_int looks good to me. The one recommendation I would have is to add fuzzing to it.
[...]
I found lots of small errors throughout charconv with fuzzing I never would have found myself.

Not sure what that should show. The input is pretty restricted as it is the intermediate result of ICU, so should already be well formed. A problem I anticipate is a logic bug in the code yielding wrong results but the fuzzer wouldn't find that, would it? AFAIK fuzzing just triggers asserts and memory errors. Or what kind of issues have you found?

Anyway working on that in a new branch to try it. Can't hurt.

Fuzzing helped me fix logic errors. In the old_crashes folder you'll see things that previously broke the logic or hit asserts. One that might impact you is something like this: https://github.com/boostorg/charconv/blob/develop/fuzzing/old_crashes/fuzz_from_chars_float/crash-9a5e44f5d633b9e38560345546d787a8f6b23ba6. You could try related items like 5E, 5e, 5.5e etc. since those should be malformed for you.

As reported in #235 formatting the first number which doesn't fit into
int64_t anymore fails to add the thousands separators.
I.e.:
`9223372036854775807` -> `9,223,372,036,854,775,807`
`9223372036854775808` -> `9223372036854775808`

Add a test reproducing that that for all backends.
ICU doesn't support uint64_t directly but provides access to formatting
and parsing of decimal number strings.
Use Boost.Charconv to interface with that.

Fixes #235
ICU might return 9223372036854775810 as 9.22337203685477581E+18
Use the internal parser of Boost.Charconv to handle this.
`boost::charconv::detail::parser` is not made for parsing (large)
integers in exponential notation.
It is mainly tested for parsing floating point numbers in hexadecimal format.

Given we know ICU will output either an integer string or a number in
"E notation" (1.2E2) we can convert that rather easily to a "regular"
integer string by "moving" the dot to the right according to the
exponent. The trailing gap is filled with zeros before passing it to
`from_chars` which is now able to handle the range checks for us.

This avoids overflows that can happen when multiplying the
significant by the exponent which, due to integer arithmetic, would be
cumbersome to guard against.

Any situation that could yield a fractional or a too large value can be caught early.
Instead of filling a temporary buffer we can decompose a number like
"x.yyyEz" to "(x * 10^3 + yyy) * 10^(z - 3)"
I.e. we subtract from the exponent what is required as an exponent to
make the fractional into an integer significant.
For the simple case of "xEz" we just do "x * 10^z".

This requires additional range checks before multiplying but avoids
extra memory accesses.
Falls back to C locale where output doesn't match.
@Flamefire Flamefire force-pushed the fix-large-number-icu branch from f20a688 to b1c49f6 Compare January 15, 2025 13:41
@boostorg boostorg deleted a comment from codecov bot Jan 15, 2025
- Date formatting for UInt64
- Error cases
- Practically unreachable cases
@Flamefire Flamefire force-pushed the fix-large-number-icu branch from b1c49f6 to d56bad6 Compare January 16, 2025 08:11
@boostorg boostorg deleted a comment from codecov bot Jan 16, 2025
Copy link

codecov bot commented Jan 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.97%. Comparing base (5f24abe) to head (d56bad6).
Report is 11 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #246      +/-   ##
===========================================
+ Coverage    95.87%   95.97%   +0.09%     
===========================================
  Files          119      119              
  Lines        10410    10611     +201     
===========================================
+ Hits          9981    10184     +203     
+ Misses         429      427       -2     
Files with missing lines Coverage Δ
src/icu/formatter.cpp 82.45% <100.00%> (+1.93%) ⬆️
src/icu/formatter.hpp 100.00% <ø> (ø)
src/icu/numeric.cpp 95.68% <100.00%> (-0.15%) ⬇️
src/util/numeric_conversion.hpp 100.00% <100.00%> (ø)
test/formatting_common.hpp 100.00% <100.00%> (ø)
test/show_config.cpp 100.00% <ø> (ø)
test/test_formatting.cpp 100.00% <100.00%> (ø)
test/test_posix_formatting.cpp 100.00% <100.00%> (ø)
test/test_std_formatting.cpp 100.00% <100.00%> (ø)
test/test_util_numeric_convert.cpp 100.00% <100.00%> (ø)
... and 1 more

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f24abe...d56bad6. Read the comment docs.

@Flamefire Flamefire merged commit ba5ff7b into develop Jan 16, 2025
54 checks passed
@Flamefire Flamefire deleted the fix-large-number-icu branch January 16, 2025 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

std::uint64_t numbers above a certain value are not formatted correctly
2 participants