[SPARK-25413] Precision value is going for toss when Avg is done #22401

ajithme · 2018-09-12T08:32:12Z

What changes were proposed in this pull request?

As per the definition, see org.apache.spark.sql.catalyst.analysis.DecimalPrecision
for the sum operation of two decimal types e1 (with precision p1 and scale s1) and e2 (with precision p2 and scale s2)
Operation : e1 + e2
Result Precision : max(s1, s2) + max(p1-s1, p2-s2) + 1
Result Scale : max(s1, s2)
but org.apache.spark.sql.catalyst.expressions.aggregate.Average#sumDataType ignores this fact and always increments the precision by 10, leading to adjusting precision when actually its not needed (result precision < 38 but precision+10 is > 38)

How was this patch tested?

Added test case as per submitter scenario and verified manually

mgaido91 · 2018-09-12T09:10:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala

+     * Precision : max(s1, s2) + max(p1 - s1, p2 - s2) + 1
+     * Scale : max(s1, s2)
+     */
+    case _ @ DecimalType.Fixed(p, s) => DecimalType.adjustPrecisionScale(s + (p - s) + 1, s)


The point is that here the sum operation is executed many times, not only once. So I am not sure that this the right way to deal with it. It would be great to check what other RDBMs do in this case.

Just checked for sqlserver https://docs.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql?view=sql-server-2017

No, this is what SQLServer does for the operation +, not for the avg result. There is a big difference between the intermediate result of avg and +, as here the + operation is executed once per each row (the exact number of times is not known in advance).

Yes i agree. But the point is arbitrarily having precision increased by 10 can cause loss of scale more often and calculating the times is costly. so even if we know times, until we know exact data, this calculation may not be precise.? For example lets take column with datatype decimal(2,1): so here the actual data matters as 2.2+2.2 or 9.9+9.9 may cause result datatype of different precision and scale. As avg = (sum(data)/times), can we have precision and scale of sum(data) restricted as described by + operation.?

No, we can't because we would risk (well, we would likely hit) an overflow. Indeed, I am not sure if you run all the UTs with your change, but I'd expect many failures due to overflow after this.

But division operation will readjust the precision again in average. Can you please give me a example query which can cause overflow as you explained.?

well, in your example, with input data of decimal(2,1), this "buffer" with your change would be a decimal (3, 1).. If your input data contains 21 9.1 items, this would overflow (191.1 doesn't fit a decimal(3,1)).

well i tested as per your suggestion with my PR :

sql("create table if not exists table1(salary decimal(2,1))") (1 to 22).foreach(_ => sql("insert into table1 values(9.1)")) sql("select avg(salary) from table1").show(false) +-----------+ |avg(salary)| +-----------+ |9.10000 | +-----------+

which is expected result and i don't see a overflow as divide will readjust precision. Can you test with my patch for a overflow specifically in case of average.?

You're right, because we are not checking the overflow in the Add operation, so despite we are in an error condition we don't detect it, but it doesn't sound great to me to rely on a currently missing check. I am not sure, though, if in special cases this can anyway cause an issue also with the current missing check.

Moreover, as you can see from the link I have posted, SQLServer - which is the reference for the way we handle decimals here - uses: decimal(38, s) divided by decimal(10, 0). So SQLServer. So I think this is what we should do eventually, but it implies changing the result type.

mgaido91 · 2018-09-12T09:11:22Z

@ajithme it would be great if you could update the description following the Spark's PR template. Moreover, please add relevant UTs for this change. Thanks.

mgaido91 · 2018-09-12T09:42:53Z

What SQLServer does is explained here: https://docs.microsoft.com/en-us/sql/t-sql/functions/avg-transact-sql?view=sql-server-2017. The point is that we would need to change the result type, which I am not sure we can do before 3.0.

cc @cloud-fan @dongjoon-hyun @gatorsmile for advice on this.

cloud-fan · 2018-09-12T14:11:17Z

It's not only about avg, it's also about sum.

I don't think the decision is made randomly, IIRC we did check other databases and pick the best one we can do.

sum is definitely different from +, we can't just follow + here.

ajithme · 2018-09-12T15:02:52Z

i agree with the resolution on + vs sum but i also see that that avg precision and scale cannot be calculated well ahead in this case which satisfies all scenarios. but i am just suggesting this as a temp solution until we decude in change result type, so can you guys suggest me how to handle this.? as our use case is broken beyond 2.3.1

mgaido91 · 2018-09-12T15:16:42Z

@ajithme I don't think this is a good solution. We had to change because of bugs which were present in the previous implementation which could lead to wrong results. Here there is no wrong result, the only difference is a lower precision (which we anyway guarantee to be > 6 digits after the comma in any case/any operation). Do you really need a higher precision?

One thing you may try is to set spark.sql.decimalOperations.allowPrecisionLoss to false, but I am not sure it will help.

ajithme · 2018-09-12T15:56:10Z

its difficult for end user to set this depending on his query ( queries which are similar to SPARK-25413 AND SPARK-24957 where need is on the precision too) but yes, i agree with the point that rounding off scale is better than getting wrong results. So, is there a documentation/pointer on why we have current sumdatatype in average taken as p+10,s, just for curiosity.?

mgaido91 · 2018-09-12T16:11:18Z

is there a documentation/pointer on why we have current sumdatatype in average taken as p+10,s, just for curiosity.?

Not that I know of, sorry. It was introduced a long time ago (#7605) and never changed.

Anyway, I think it is a valid question how we manage decimals in aggregates. I think we should revisit all these aggregation operation to match with SQLServer behavior for decimals. Probably we can target this for 3.0. WDYT @cloud-fan ?

ajithme · 2018-09-12T16:50:54Z

so to summarize the discussion,

current code which does p+10,s is still a mystery, but it breaks usecase mentioned by SPARK-25413
changes by this PR are working because of missing add operation's overflow check, so relying on it is not ok. But we do not have a example where it can break with existing Code + this PR
Its better to follow SQLserver behaviour, yes agree. 3.0 may be too late.?

cloud-fan · 2018-09-13T01:42:48Z

According to the PR introduced the + 10 behavior, it said this follows Hive.

Whatever we want to propose, let's clearly write down the tradeoffs. e.g. maybe too keep larger precision, we are more likely to hit overflow, etc.

mgaido91 · 2018-09-13T07:22:35Z

Just to be clear:

current code which does p+10 breaks usecase mentioned by SPARK-25413

I think setting spark.sql.decimalOperations.allowPrecisionLoss to false can solve the issue and the reason why that flag is there is to be used in situation when a precision loss is not acceptable (ie. you prefer to risk having a NULL in the result rather than loosing any precision). If this is your case, please use it.

Moreover, I don't think the major issue is the +10 compared to SQLServer behavior (SQLServer is using decimal(38, s)). Our major difference with SQLServer behavior is the cast to the result type which SQLServer is missing.

AmplabJenkins · 2018-10-22T06:58:40Z

Can one of the admins verify this patch?

Closes apache#22567 Closes apache#18457 Closes apache#21517 Closes apache#21858 Closes apache#22383 Closes apache#19219 Closes apache#22401 Closes apache#22811 Closes apache#20405 Closes apache#21933 Closes apache#22819 from srowen/ClosePRs. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

Correct the precision for sumDataType

68deebb

mgaido91 reviewed Sep 12, 2018

View reviewed changes

Added testcase

fe3f543

srowen mentioned this pull request Oct 24, 2018

[BUILD] Close stale PRs #22819

Closed

asfgit closed this in 65c653f Oct 25, 2018

[SPARK-25413] Precision value is going for toss when Avg is done #22401

[SPARK-25413] Precision value is going for toss when Avg is done #22401

Uh oh!

Conversation

ajithme commented Sep 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajithme Sep 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Sep 12, 2018

Uh oh!

mgaido91 commented Sep 12, 2018

Uh oh!

cloud-fan commented Sep 12, 2018

Uh oh!

ajithme commented Sep 12, 2018

Uh oh!

mgaido91 commented Sep 12, 2018

Uh oh!

ajithme commented Sep 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgaido91 commented Sep 12, 2018

Uh oh!

ajithme commented Sep 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Sep 13, 2018

Uh oh!

mgaido91 commented Sep 13, 2018

Uh oh!

AmplabJenkins commented Oct 22, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ajithme commented Sep 12, 2018 •

edited

Loading

ajithme Sep 12, 2018 •

edited

Loading

ajithme commented Sep 12, 2018 •

edited

Loading

ajithme commented Sep 12, 2018 •

edited

Loading