Skip to content

Conversation

@ajithme
Copy link
Contributor

@ajithme ajithme commented Sep 12, 2018

What changes were proposed in this pull request?

As per the definition, see org.apache.spark.sql.catalyst.analysis.DecimalPrecision
for the sum operation of two decimal types e1 (with precision p1 and scale s1) and e2 (with precision p2 and scale s2)
Operation : e1 + e2
Result Precision : max(s1, s2) + max(p1-s1, p2-s2) + 1
Result Scale : max(s1, s2)
but org.apache.spark.sql.catalyst.expressions.aggregate.Average#sumDataType ignores this fact and always increments the precision by 10, leading to adjusting precision when actually its not needed (result precision < 38 but precision+10 is > 38)

How was this patch tested?

Added test case as per submitter scenario and verified manually

* Precision : max(s1, s2) + max(p1 - s1, p2 - s2) + 1
* Scale : max(s1, s2)
*/
case _ @ DecimalType.Fixed(p, s) => DecimalType.adjustPrecisionScale(s + (p - s) + 1, s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that here the sum operation is executed many times, not only once. So I am not sure that this the right way to deal with it. It would be great to check what other RDBMs do in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is what SQLServer does for the operation +, not for the avg result. There is a big difference between the intermediate result of avg and +, as here the + operation is executed once per each row (the exact number of times is not known in advance).

Copy link
Contributor Author

@ajithme ajithme Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes i agree. But the point is arbitrarily having precision increased by 10 can cause loss of scale more often and calculating the times is costly. so even if we know times, until we know exact data, this calculation may not be precise.? For example lets take column with datatype decimal(2,1): so here the actual data matters as 2.2+2.2 or 9.9+9.9 may cause result datatype of different precision and scale. As avg = (sum(data)/times), can we have precision and scale of sum(data) restricted as described by + operation.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can't because we would risk (well, we would likely hit) an overflow. Indeed, I am not sure if you run all the UTs with your change, but I'd expect many failures due to overflow after this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But division operation will readjust the precision again in average. Can you please give me a example query which can cause overflow as you explained.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, in your example, with input data of decimal(2,1), this "buffer" with your change would be a decimal (3, 1).. If your input data contains 21 9.1 items, this would overflow (191.1 doesn't fit a decimal(3,1)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well i tested as per your suggestion with my PR :

 sql("create table if not exists table1(salary decimal(2,1))")
 (1 to 22).foreach(_ => sql("insert into table1 values(9.1)"))
 sql("select avg(salary) from table1").show(false)

+-----------+
|avg(salary)|
+-----------+
|9.10000    |
+-----------+

which is expected result and i don't see a overflow as divide will readjust precision. Can you test with my patch for a overflow specifically in case of average.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, because we are not checking the overflow in the Add operation, so despite we are in an error condition we don't detect it, but it doesn't sound great to me to rely on a currently missing check. I am not sure, though, if in special cases this can anyway cause an issue also with the current missing check.

Moreover, as you can see from the link I have posted, SQLServer - which is the reference for the way we handle decimals here - uses: decimal(38, s) divided by decimal(10, 0). So SQLServer. So I think this is what we should do eventually, but it implies changing the result type.

@mgaido91
Copy link
Contributor

@ajithme it would be great if you could update the description following the Spark's PR template. Moreover, please add relevant UTs for this change. Thanks.

@mgaido91
Copy link
Contributor

What SQLServer does is explained here: https://docs.microsoft.com/en-us/sql/t-sql/functions/avg-transact-sql?view=sql-server-2017. The point is that we would need to change the result type, which I am not sure we can do before 3.0.

cc @cloud-fan @dongjoon-hyun @gatorsmile for advice on this.

@cloud-fan
Copy link
Contributor

It's not only about avg, it's also about sum.

I don't think the decision is made randomly, IIRC we did check other databases and pick the best one we can do.

sum is definitely different from +, we can't just follow + here.

@ajithme
Copy link
Contributor Author

ajithme commented Sep 12, 2018

i agree with the resolution on + vs sum but i also see that that avg precision and scale cannot be calculated well ahead in this case which satisfies all scenarios. but i am just suggesting this as a temp solution until we decude in change result type, so can you guys suggest me how to handle this.? as our use case is broken beyond 2.3.1

@mgaido91
Copy link
Contributor

@ajithme I don't think this is a good solution. We had to change because of bugs which were present in the previous implementation which could lead to wrong results. Here there is no wrong result, the only difference is a lower precision (which we anyway guarantee to be > 6 digits after the comma in any case/any operation). Do you really need a higher precision?

One thing you may try is to set spark.sql.decimalOperations.allowPrecisionLoss to false, but I am not sure it will help.

@ajithme
Copy link
Contributor Author

ajithme commented Sep 12, 2018

its difficult for end user to set this depending on his query ( queries which are similar to SPARK-25413 AND SPARK-24957 where need is on the precision too) but yes, i agree with the point that rounding off scale is better than getting wrong results. So, is there a documentation/pointer on why we have current sumdatatype in average taken as p+10,s, just for curiosity.?

@mgaido91
Copy link
Contributor

is there a documentation/pointer on why we have current sumdatatype in average taken as p+10,s, just for curiosity.?

Not that I know of, sorry. It was introduced a long time ago (#7605) and never changed.

Anyway, I think it is a valid question how we manage decimals in aggregates. I think we should revisit all these aggregation operation to match with SQLServer behavior for decimals. Probably we can target this for 3.0. WDYT @cloud-fan ?

@ajithme
Copy link
Contributor Author

ajithme commented Sep 12, 2018

so to summarize the discussion,

  1. current code which does p+10,s is still a mystery, but it breaks usecase mentioned by SPARK-25413
  2. changes by this PR are working because of missing add operation's overflow check, so relying on it is not ok. But we do not have a example where it can break with existing Code + this PR
  3. Its better to follow SQLserver behaviour, yes agree. 3.0 may be too late.?

@cloud-fan
Copy link
Contributor

According to the PR introduced the + 10 behavior, it said this follows Hive.

Whatever we want to propose, let's clearly write down the tradeoffs. e.g. maybe too keep larger precision, we are more likely to hit overflow, etc.

@mgaido91
Copy link
Contributor

Just to be clear:

current code which does p+10 breaks usecase mentioned by SPARK-25413

I think setting spark.sql.decimalOperations.allowPrecisionLoss to false can solve the issue and the reason why that flag is there is to be used in situation when a precision loss is not acceptable (ie. you prefer to risk having a NULL in the result rather than loosing any precision). If this is your case, please use it.

Moreover, I don't think the major issue is the +10 compared to SQLServer behavior (SQLServer is using decimal(38, s)). Our major difference with SQLServer behavior is the cast to the result type which SQLServer is missing.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen srowen mentioned this pull request Oct 24, 2018
@asfgit asfgit closed this in 65c653f Oct 25, 2018
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#22567
Closes apache#18457
Closes apache#21517
Closes apache#21858
Closes apache#22383
Closes apache#19219
Closes apache#22401
Closes apache#22811
Closes apache#20405
Closes apache#21933

Closes apache#22819 from srowen/ClosePRs.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants