Precision-Loss Decimal Arithmetic #4664

tustvold · 2023-08-07T21:48:42Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently decimal arithmetic may return an overflow error if an arithmetic operation produces an output that exceeds the maximum precision of the storage type.

Describe the solution you'd like

Spark and similar systems instead truncate the precision of the output, if the precision would exceed the maximum of the storage type - https://github.com/apache/arrow/blob/36ddbb531cac9b9e512dfa3776d1d64db588209f/java/gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/DecimalTypeUtil.java#L83

To achieve this the arithmetic kernels would need to be updated to detect this case, and instead perform arithmetic on double-width decimal primitives, before then truncating this down to an appropriate scale/precision.

Describe alternatives you've considered

We could continue to return an error, but this is likely to be surprising for users.

Additional context

#4640 proposes increasing the decimal output scale, which in turn would make this error case more likely.

The fundamental primitive necessary for this is, N-digit division, which will likely be implemented as part of #4663

tustvold · 2023-08-12T20:37:41Z

Having spent a significant amount of time working on this I'm inclined to not move forward with this.

Decimal256 supports up to 76 decimal digits, which is more than most commercial systems, and makes me think supporting precision loss arithmetic, i.e. arithmetic with even greater precision than this, is of limited practical utility:
- MySQL - 65
- MariaDB - 65
- SQL Server - 38
- DuckDB - 38
- Oracle - 38
- Redshift - 38
- Spark - 38
- Postgres - 131072!! (postgres decimals appear to be rather special)
Silently truncating precision is potentially surprising and is inconsistent with the rest of the arithmetic kernels
Users can easily increase/decrease the precision of a calculation by casting the inputs
Performant precision-loss arithmetic is extremely complex
Ensuring good test coverage is very hard given the size of the numbers involved
I personally find it hard to justify spending significant amounts of time on decimals when they don't show up in my workloads

TLDR the current decimal support, whilst likely not perfect, covers most bases, and is at least as feature-full as many commercial DBMSs, and I therefore struggle to justify spending more time working on it

tustvold · 2023-08-15T09:27:44Z

label_issue.py automatically added labels {'arrow'} from #4672

tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Aug 7, 2023

tustvold self-assigned this Aug 7, 2023

This was referenced Aug 7, 2023

Increase output scale for decimal division #4640

Closed

Update Arrow 45.0.0 And Datum Arithmetic, change Decimal Division semantics apache/datafusion#6832

Merged

Faster i256 Division (2-100x) (#4663) #4672

Merged

tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Aug 12, 2023

tustvold added the arrow Changes to the arrow crate label Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precision-Loss Decimal Arithmetic #4664

Precision-Loss Decimal Arithmetic #4664

tustvold commented Aug 7, 2023 •

edited

Loading

tustvold commented Aug 12, 2023 •

edited

Loading

tustvold commented Aug 15, 2023

Precision-Loss Decimal Arithmetic #4664

Precision-Loss Decimal Arithmetic #4664

Comments

tustvold commented Aug 7, 2023 • edited Loading

tustvold commented Aug 12, 2023 • edited Loading

tustvold commented Aug 15, 2023

tustvold commented Aug 7, 2023 •

edited

Loading

tustvold commented Aug 12, 2023 •

edited

Loading