-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add rounding logic and scale zero fix parse_decimal to match parse_string_to_decimal_native behavior #7179
base: main
Are you sure you want to change the base?
Conversation
bbd54d4
to
7b33527
Compare
7e598c9
to
bef2992
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title of this PR now says it adds rounding logic and makes the logic consistent which sounds good. However, I don't see a ticket that describes the problem.
This PR's description still says it
Closes apache/datafusion#10315
and I did see code changes in the e
parsing code but I didn't see any tests 🤔 I don't think we can merge code without tests.
I am sorry to be so pedantic, but arrow-rs is used by many projects now and so evaluating and minimizing downstream impacts is very important. I am trying to avoid the overhead of dealing with releasing regressions like
And I also apologize for the length between review cycles, but as we have mentioned many times, our review bandwidth is very limited.
@@ -850,7 +850,16 @@ fn parse_e_notation<T: DecimalType>( | |||
} | |||
|
|||
if exp < 0 { | |||
result = result.div_wrapping(base.pow_wrapping(-exp as _)); | |||
let result_with_scale = result.div_wrapping(base.pow_wrapping(-exp as _)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this change the behavior of parsing e notation? If so I didn't see any tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I missed porting the tests while splitting the large PR. It rounds instead of current behavior - truncate. I'll add the tests.
@@ -598,7 +599,20 @@ mod tests { | |||
0_i128 | |||
); | |||
assert_eq!( | |||
parse_decimal::<Decimal128Type>("0", 38, 0)?, | |||
parse_string_to_decimal_native::<Decimal128Type>("0", 0)?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the same behavior in these two functions seems like a reasonable change to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once we are able to move to using parse_decimal for casting and deprecate parse_string_to_decimal_native , these tests will be changed to assert the value in the message section of the assert.
@@ -1286,7 +1286,7 @@ mod tests { | |||
assert_eq!("53.002666", lat.value_as_string(1)); | |||
assert_eq!("52.412811", lat.value_as_string(2)); | |||
assert_eq!("51.481583", lat.value_as_string(3)); | |||
assert_eq!("12.123456", lat.value_as_string(4)); | |||
assert_eq!("12.123457", lat.value_as_string(4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb you can see the behavior change in this test of arrow-csv reader which uses parse_decimal
Added an issue in arrow-rs #7355 |
Added e-notation tests |
I apologize for making this extra overhead. will be careful in future |
I understand, will keep this in mind in future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @himadripal
result = result.div_wrapping(base.pow_wrapping(-exp as _)); | ||
let result_with_scale = result.div_wrapping(base.pow_wrapping(-exp as _)); | ||
let result_with_one_scale_up = | ||
result.div_wrapping(base.pow_wrapping(-exp.add_wrapping(1) as _)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this logic is correct, but just for me to understand. E.g. for 12345e-5
, would exp
be -5
? why is this adding 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exp
in the parse_e_notation
method is being overriden couple of times based on which direction the decimal needs to shift and if the original string has fractional in it. ( i.e 1.23e-2 has 2 fractional digits).
before this check, exp
represents number of digits to be removed or added. In this case, exp = -3
now, result_with_scale = 12
result_with_one_scale_up=123
to round up or down, we need to capture the digit next to last digit in the result, in this case 3. How we get it is
rounding_digit= result_with_one_scale_up - result_with_scale * 10
rounding_digit=123- 12*10 = 3
if rounding_digit >=5 then we add +1 to the result
else result remains intact.
result = result.div_wrapping(base.pow_wrapping(fractionals as u32)) | ||
} | ||
//add one if >=5 | ||
if rounding_digit >= 5 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering where >= 5
came from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first we figure out what is the rounding_digit - digit which is next to the last digit in the final result (without rounding logic applied), if the value of the rounding_digit is >=5, then we add +1 to round up the result, else it remains same.
"1265E-4" -> with scale 3 -> 0.127
in scale 3 the number would be 0.126 and rounding digit will be 5, as rounding digit >= 5, the result becomes 0.127
1264E-4" -> with scale 3 -> 0.126
here rounding_digit is 4, which is less than 5, so no need to add 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>= 5
is being used for rounding to the nearest integer
with scale 1
2.47 -> 2.5
2.44 -> 2.4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could add a comment in the code to explain this point to future readers who may have the same question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @himadripal I do like tests, perhaps we can also add the tests for negative decimals roundings?
Also tests for very big numbers, or very small would be beneficial. What comes to my mind with help of chatGpt
// **Very Large Numbers**
assert_eq!(round_to_places(1e15, 2), 1e15); // Large integer should remain unchanged
assert_eq!(round_to_places(9999999999999.987, 2), 10000000000000.00);
assert_eq!(round_to_places(-1e15, 3), -1e15);
// **Very Small Numbers (Near Zero)**
assert_eq!(round_to_places(1e-15, 10), 0.0000000000); // Rounds to zero at precision 10
assert_eq!(round_to_places(-1e-15, 10), -0.0000000000); // Rounds to zero
assert_eq!(round_to_places(0.000000000123456, 12), 0.000000000123); // Should retain up to 12 decimal places
// **Extreme Edge Cases**
assert_eq!(round_to_places(f64::MAX, 2), f64::MAX); // Maximum f64 value should remain the same
assert_eq!(round_to_places(f64::MIN, 2), f64::MIN); // Minimum f64 value should remain the same
assert!(round_to_places(f64::NAN, 2).is_nan()); // NaN should remain NaN
assert_eq!(round_to_places(f64::INFINITY, 2), f64::INFINITY); // Infinity should remain Infinity
assert_eq!(round_to_places(f64::NEG_INFINITY, 2), f64::NEG_INFINITY); // Negative Infinity should remain unchanged
@himadripal -- I am preparing to create a new release hopefully tomorrow. Can you please address @comphead 's testing suggestions soon so we can get this PR into that release? |
@comphead and @alamb there are existing edge case tests here I'll add more from @comphead list today. One more clarifying points - Although it is not mandatory to go together, my goal for this change was to make scientific notation support in datafusion - datafusion#10315. For that we need to also merge #7191 - this is moving the cast to use |
Which issue does this PR close?
Few important consideration -
parse_string_to_decimal_native
parse_string_to_decimal_native
does not have support for e-notationparse_string_to_decimal_native
does rounding at scale, not truncateparse_decimal
an existing method has e-notation support and use elsewhereparse_decimal
parse_decimal
to get support for e-notation.This PR is a 2nd one to break up #6905 , this one add rounding logic to parse_decimal to match the behavior in existing
parse_string_to_decimal_native
.Closes #.
Rationale for this change
At present, string to decimal conversion does not support e-notation, in arrow,
parse_string_to_decimal_native
is called to get generic string to decimal. parse_decimal on the other hand is used from generic parse method and it has e-notation support. This PR is adding rounding and scale 0 handling to match the behavior orparse_string_to_decimal_native
method. Then we can replaceparse_string_to_decimal_native
call withparse_decimal
. This way, we will get e-notation support too.What changes are included in this PR?
Are there any user-facing changes?