Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add rounding logic and scale zero fix parse_decimal to match parse_string_to_decimal_native behavior #7179

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

himadripal
Copy link
Contributor

@himadripal himadripal commented Feb 23, 2025

Which issue does this PR close?

Few important consideration -

  • Existing string to decimal conversion uses parse_string_to_decimal_native
  • parse_string_to_decimal_native does not have support for e-notation
  • parse_string_to_decimal_native does rounding at scale, not truncate
  • parse_decimal an existing method has e-notation support and use elsewhere
  • Fix: Support for e notation using parse_decimal in string to decimal conversion #6905 added rounding support in parse_decimal
  • moved string to decimal conversion to use parse_decimal to get support for e-notation.

This PR is a 2nd one to break up #6905 , this one add rounding logic to parse_decimal to match the behavior in existing parse_string_to_decimal_native.

Closes #.

Rationale for this change

At present, string to decimal conversion does not support e-notation, in arrow, parse_string_to_decimal_native is called to get generic string to decimal. parse_decimal on the other hand is used from generic parse method and it has e-notation support. This PR is adding rounding and scale 0 handling to match the behavior or parse_string_to_decimal_native method. Then we can replace parse_string_to_decimal_native call with parse_decimal. This way, we will get e-notation support too.

What changes are included in this PR?

Are there any user-facing changes?

@himadripal himadripal force-pushed the fix_parse_decimal_for_rounding_scale_zero branch from 7e598c9 to bef2992 Compare February 27, 2025 07:53
@alamb alamb changed the title feat: add rounding logic and scale zero fix fro parse_decimal to match parse_string_to_decimal_native behavior feat: add rounding logic and scale zero fix parse_decimal to match parse_string_to_decimal_native behavior Mar 17, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title of this PR now says it adds rounding logic and makes the logic consistent which sounds good. However, I don't see a ticket that describes the problem.

This PR's description still says it

Closes apache/datafusion#10315

and I did see code changes in the e parsing code but I didn't see any tests 🤔 I don't think we can merge code without tests.

I am sorry to be so pedantic, but arrow-rs is used by many projects now and so evaluating and minimizing downstream impacts is very important. I am trying to avoid the overhead of dealing with releasing regressions like

And I also apologize for the length between review cycles, but as we have mentioned many times, our review bandwidth is very limited.

@@ -850,7 +850,16 @@ fn parse_e_notation<T: DecimalType>(
}

if exp < 0 {
result = result.div_wrapping(base.pow_wrapping(-exp as _));
let result_with_scale = result.div_wrapping(base.pow_wrapping(-exp as _));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this change the behavior of parsing e notation? If so I didn't see any tests

Copy link
Contributor Author

@himadripal himadripal Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I missed porting the tests while splitting the large PR. It rounds instead of current behavior - truncate. I'll add the tests.

@@ -598,7 +599,20 @@ mod tests {
0_i128
);
assert_eq!(
parse_decimal::<Decimal128Type>("0", 38, 0)?,
parse_string_to_decimal_native::<Decimal128Type>("0", 0)?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the same behavior in these two functions seems like a reasonable change to me

Copy link
Contributor Author

@himadripal himadripal Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we are able to move to using parse_decimal for casting and deprecate parse_string_to_decimal_native , these tests will be changed to assert the value in the message section of the assert.

@@ -1286,7 +1286,7 @@ mod tests {
assert_eq!("53.002666", lat.value_as_string(1));
assert_eq!("52.412811", lat.value_as_string(2));
assert_eq!("51.481583", lat.value_as_string(3));
assert_eq!("12.123456", lat.value_as_string(4));
assert_eq!("12.123457", lat.value_as_string(4));
Copy link
Contributor Author

@himadripal himadripal Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb you can see the behavior change in this test of arrow-csv reader which uses parse_decimal

@himadripal
Copy link
Contributor Author

he title of this PR now says it adds rounding logic and makes the logic consistent which sounds good. However, I don't see a ticket that describes the problem.

This PR's description still says it

Closes apache/datafusion#10315

Added an issue in arrow-rs #7355

@himadripal
Copy link
Contributor Author

and I did see code changes in the e parsing code but I didn't see any tests 🤔 I don't think we can merge code without tests.

Added e-notation tests

@himadripal
Copy link
Contributor Author

himadripal commented Mar 28, 2025

I am sorry to be so pedantic, but arrow-rs is used by many projects now and so evaluating and minimizing downstream impacts is very important. I am trying to avoid the overhead of dealing with releasing regressions like

I apologize for making this extra overhead. will be careful in future

@himadripal
Copy link
Contributor Author

And I also apologize for the length between review cycles, but as we have mentioned many times, our review bandwidth is very limited.

I understand, will keep this in mind in future.

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @himadripal

result = result.div_wrapping(base.pow_wrapping(-exp as _));
let result_with_scale = result.div_wrapping(base.pow_wrapping(-exp as _));
let result_with_one_scale_up =
result.div_wrapping(base.pow_wrapping(-exp.add_wrapping(1) as _));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this logic is correct, but just for me to understand. E.g. for 12345e-5, would exp be -5? why is this adding 1?

Copy link
Contributor Author

@himadripal himadripal Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exp in the parse_e_notation method is being overriden couple of times based on which direction the decimal needs to shift and if the original string has fractional in it. ( i.e 1.23e-2 has 2 fractional digits).

before this check, exp represents number of digits to be removed or added. In this case, exp = -3

now, result_with_scale = 12
result_with_one_scale_up=123

to round up or down, we need to capture the digit next to last digit in the result, in this case 3. How we get it is
rounding_digit= result_with_one_scale_up - result_with_scale * 10
rounding_digit=123- 12*10 = 3

if rounding_digit >=5 then we add +1 to the result
else result remains intact.

Image 3-31-25 at 5 30 PM
I added a debugging screenshot to help understand it more.

result = result.div_wrapping(base.pow_wrapping(fractionals as u32))
}
//add one if >=5
if rounding_digit >= 5 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering where >= 5 came from?

Copy link
Contributor Author

@himadripal himadripal Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first we figure out what is the rounding_digit - digit which is next to the last digit in the final result (without rounding logic applied), if the value of the rounding_digit is >=5, then we add +1 to round up the result, else it remains same.

"1265E-4" -> with scale 3 -> 0.127
in scale 3 the number would be 0.126 and rounding digit will be 5, as rounding digit >= 5, the result becomes 0.127

1264E-4" -> with scale 3 -> 0.126
here rounding_digit is 4, which is less than 5, so no need to add 1. 

Copy link
Contributor Author

@himadripal himadripal Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>= 5 is being used for rounding to the nearest integer

with scale 1 
2.47 -> 2.5
2.44 -> 2.4

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could add a comment in the code to explain this point to future readers who may have the same question

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @himadripal I do like tests, perhaps we can also add the tests for negative decimals roundings?

Also tests for very big numbers, or very small would be beneficial. What comes to my mind with help of chatGpt

// **Very Large Numbers**
        assert_eq!(round_to_places(1e15, 2), 1e15); // Large integer should remain unchanged
        assert_eq!(round_to_places(9999999999999.987, 2), 10000000000000.00);
        assert_eq!(round_to_places(-1e15, 3), -1e15);

        // **Very Small Numbers (Near Zero)**
        assert_eq!(round_to_places(1e-15, 10), 0.0000000000); // Rounds to zero at precision 10
        assert_eq!(round_to_places(-1e-15, 10), -0.0000000000); // Rounds to zero
        assert_eq!(round_to_places(0.000000000123456, 12), 0.000000000123); // Should retain up to 12 decimal places
        
        // **Extreme Edge Cases**
        assert_eq!(round_to_places(f64::MAX, 2), f64::MAX); // Maximum f64 value should remain the same
        assert_eq!(round_to_places(f64::MIN, 2), f64::MIN); // Minimum f64 value should remain the same
        assert!(round_to_places(f64::NAN, 2).is_nan()); // NaN should remain NaN
        assert_eq!(round_to_places(f64::INFINITY, 2), f64::INFINITY); // Infinity should remain Infinity
        assert_eq!(round_to_places(f64::NEG_INFINITY, 2), f64::NEG_INFINITY); // Negative Infinity should remain unchanged

@alamb
Copy link
Contributor

alamb commented Apr 1, 2025

@himadripal -- I am preparing to create a new release hopefully tomorrow. Can you please address @comphead 's testing suggestions soon so we can get this PR into that release?

@himadripal
Copy link
Contributor Author

himadripal commented Apr 1, 2025

@himadripal -- I am preparing to create a new release hopefully tomorrow. Can you please address @comphead 's testing suggestions soon so we can get this PR into that release?

@comphead and @alamb there are existing edge case tests here

I'll add more from @comphead list today.

One more clarifying points - Although it is not mandatory to go together, my goal for this change was to make scientific notation support in datafusion - datafusion#10315. For that we need to also merge #7191 - this is moving the cast to use parse_decimal from parse_decimal_native.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate next-major-release the PR has API changes and it waiting on the next major version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add rounding logic and scale zero fix in parse_decimal to match parse_string_to_decimal_native behavior
5 participants