Fixed bug where sum returns wrong values with long data type #6984

matthewryanwells · 2023-04-04T18:23:58Z

Description

Fixes a bug regarding sum returning an incorrect value when working with large long values because the current code uses the double data type. In addition we also extracted code for checking null source into it's own class so it doesn't need to be checked for.

Issues Resolved

#5537
opensearch-project/sql#1052

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…null values to clean code and solve bug Signed-off-by: Matthew Wells <matthew.wells@improving.com>

Signed-off-by: Matthew Wells <matthew.wells@improving.com>

github-actions · 2023-04-04T18:38:21Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13411/
CommitID: f12ed79
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-04-04T18:38:24Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13410/
CommitID: b4f0f00
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

server/src/main/java/org/opensearch/search/aggregations/metrics/SumIntegralAggregator.java

…' for normal or high precision but slower summation Signed-off-by: Matthew Wells <matthew.wells@improving.com>

github-actions · 2023-06-08T22:39:56Z

Gradle Check (Jenkins) Run Completed with:

RESULT: ❌
URL: https://build.ci.opensearch.org/job/gradle-check/17175/
CommitID: 6ce73e2
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

server/src/internalClusterTest/java/org/opensearch/search/aggregations/metrics/SumLongIT.java

server/src/main/java/org/opensearch/search/aggregations/metrics/SumAggregationBuilder.java

server/src/main/java/org/opensearch/search/aggregations/metrics/SumIntegralAggregator.java

server/src/main/java/org/opensearch/search/aggregations/metrics/SumNullAggregator.java

reta · 2023-06-09T16:07:19Z

Thanks a lot @matthewryanwells for moving it forward

matthewryanwells · 2023-06-09T16:10:46Z

Thanks a lot @matthewryanwells for moving it forward

Thank you! I had some other stuff to work on but now I can focus on this now, with the optional parameter implemented (with a few updates still needed) I can fully focus on a solution that solves the bug now

server/src/internalClusterTest/java/org/opensearch/search/aggregations/metrics/SumLongIT.java

server/src/main/java/org/opensearch/search/aggregations/metrics/SumAggregationBuilder.java

server/src/internalClusterTest/java/org/opensearch/search/aggregations/metrics/SumLongIT.java

…ional parameter, added tests, set correct default value, changed method to enum Signed-off-by: Matthew Wells <matthew.wells@improving.com>

matthewryanwells · 2023-06-13T22:09:11Z

There is still some work I need to do, specifically it seems that the precise sum aggregator is sometimes not returning the correct result but we are getting close to being done

github-actions · 2023-06-13T22:14:11Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/17496/
CommitID: 7d50f50
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

acarbonetto · 2023-06-16T15:42:32Z

@reta I don't know if you have any suggestions to solve this:
The SumPreciseAggregator.java will only aggregate data within a shard. The fetch phase run the index on several different shards, and reduces the results using InternalAggregations.java. Which means if the index is split between multiple shards and processed by multiple threads, the longs are casted to doubles before finally being reduced by the InternalSum processor.
We observed this when running IT tests, since the IT testes run on 5 shards, the data was (around 20%) aggregated correctly and (around 80%) failing because it was casted before aggregation.
To fix this, we would need to return long values from the precise sum aggregation, and update the InternalAggregation to aggregate using a similar mechanism as the SumPreciseAggregator. Right?

reta · 2023-06-16T17:08:17Z

To fix this, we would need to return long values from the precise sum aggregation, and update the InternalAggregation to aggregate using a similar mechanism as the SumPreciseAggregator. Right?

@matthewryanwells I think the exact approach may look more complicated. So we have basically 3 data types (as of today): long, double and unsigned long (was recently merged). The valuesSource has two properties:

isFloatingPoint - means this is aggregation over floating point, we need to use double
isBigInteger - means this is aggregation over unsigned long, we need to use BigInteger
all other cases - means this is aggregation over other types (long / integer / ... ), we need to use BigInteger (since long could overflow)

The SumAggregator / SumPreciseAggregator still use InternalSum to represent the result, which only bears double value - we would loose precision here. One of the options which comes to mind is what if InternalSum bears BigDecimal instead? (InternalSum::reduce still uses CompensatedSum)

Does it make sense?

matthewryanwells · 2023-06-23T16:43:25Z

To fix this, we would need to return long values from the precise sum aggregation, and update the InternalAggregation to aggregate using a similar mechanism as the SumPreciseAggregator. Right?

@matthewryanwells I think the exact approach may look more complicated. So we have basically 3 data types (as of today): long, double and unsigned long (was recently merged). The valuesSource has two properties:

isFloatingPoint - means this is aggregation over floating point, we need to use double

isBigInteger - means this is aggregation over unsigned long, we need to use BigInteger

all other cases - means this is aggregation over other types (long / integer / ... ), we need to use BigInteger (since long could overflow)

The SumAggregator / SumPreciseAggregator still use InternalSum to represent the result, which only bears double value - we would loose precision here. One of the options which comes to mind is what if InternalSum bears BigDecimal instead? (InternalSum::reduce still uses CompensatedSum)

Does it make sense?

The explanation is slightly confusing for me.

For this improvement to work out we would need SumPreciseAggregator to return a BigInteger value and have the aggregation at the shard level (I am assuming that this is InternalSum) also do it's arithmetic using BigIntegers (because both double and BigDecimal lose precision).

Do you know if something like this would be possible/a reasonable thing we could implement?

opensearch-trigger-bot · 2023-07-24T15:21:47Z

This PR is stalled because it has been open for 30 days with no activity. Remove stalled label or comment or this will be closed in 7 days.

reta · 2023-07-24T16:27:10Z

Apologies, @matthewryanwells, I missed your reply

Do you know if something like this would be possible/a reasonable thing we could implement?

I think there is a way to implement that (taking into account that the precise calculation is opt-in), but that would increase basically transfer both double and BigInteger/BigDecimal values, I will try to prototype the solution this week.

reta · 2023-07-28T19:27:41Z

@matthewryanwells some updates

Do you know if something like this would be possible/a reasonable thing we could implement?

I was hacking around and I think we could implement that by passing both double and BigDecimal representation of the sum in InternalSum & related classes.

For this improvement to work out we would need SumPreciseAggregator to return a BigInteger value and have the aggregation at the shard level (I am assuming that this is InternalSum) also do it's arithmetic using BigIntegers (because both double and BigDecimal lose precision).

I have trouble projecting the precise sum to BigInteger, the aggregation over the double values would not be possible I think. May be you could clarify the trick for me here? Thank you.

acarbonetto · 2023-08-01T16:04:55Z

@matthewryanwells has unfortunately paused work on this because of the increased scope and experimental flags requirements. There isn't a huge use-case for the 'fix' (its becoming more of an enhancement now). Hopefully he will find some time to work on it in the near future and we can re-open the ticket.

opensearch-trigger-bot · 2023-09-01T15:21:15Z

This PR is stalled because it has been open for 30 days with no activity. Remove stalled label or comment or this will be closed in 7 days.

opensearch-trigger-bot · 2023-09-09T15:20:13Z

This PR was closed because it has been stalled for 7 days with no activity.

matthewryanwells added 4 commits March 31, 2023 15:47

Split SumAggregator into a seperate function for integer numbers and …

40fddbd

…null values to clean code and solve bug Signed-off-by: Matthew Wells <matthew.wells@improving.com>

fixed test name, added IT test, fixed function descriptions

ced4530

Signed-off-by: Matthew Wells <matthew.wells@improving.com>

removed unused imports

d56446e

Signed-off-by: Matthew Wells <matthew.wells@improving.com>

updated changelog

b4f0f00

Signed-off-by: Matthew Wells <matthew.wells@improving.com>

matthewryanwells requested review from reta, anasalkouz, andrross, Bukhtawar, CEHENKLE, dblock, gbbafna, setiah, kartg, kotwanikunal, mch2, nknize, owaiskazi19, Rishikesh1159, ryanbogan, saratvemulapalli, shwetathareja, dreamer-89, tlfeng, VachaShah and xuezhou25 as code owners April 4, 2023 18:23

updated changelog number to correct PR #

f12ed79

Signed-off-by: Matthew Wells <matthew.wells@improving.com>

matthewryanwells mentioned this pull request Apr 4, 2023

[BUG] Incorrect response from sum aggregation of long values #5537

Open

reta reviewed Apr 4, 2023

View reviewed changes

server/src/main/java/org/opensearch/search/aggregations/metrics/SumIntegralAggregator.java Outdated Show resolved Hide resolved

added optional parameter 'method' that can accept 'kahan' or 'precise…

6ce73e2

…' for normal or high precision but slower summation Signed-off-by: Matthew Wells <matthew.wells@improving.com>