-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25413] Precision value is going for toss when Avg is done #22401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * Precision : max(s1, s2) + max(p1 - s1, p2 - s2) + 1 | ||
| * Scale : max(s1, s2) | ||
| */ | ||
| case _ @ DecimalType.Fixed(p, s) => DecimalType.adjustPrecisionScale(s + (p - s) + 1, s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that here the sum operation is executed many times, not only once. So I am not sure that this the right way to deal with it. It would be great to check what other RDBMs do in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is what SQLServer does for the operation +, not for the avg result. There is a big difference between the intermediate result of avg and +, as here the + operation is executed once per each row (the exact number of times is not known in advance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes i agree. But the point is arbitrarily having precision increased by 10 can cause loss of scale more often and calculating the times is costly. so even if we know times, until we know exact data, this calculation may not be precise.? For example lets take column with datatype decimal(2,1): so here the actual data matters as 2.2+2.2 or 9.9+9.9 may cause result datatype of different precision and scale. As avg = (sum(data)/times), can we have precision and scale of sum(data) restricted as described by + operation.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we can't because we would risk (well, we would likely hit) an overflow. Indeed, I am not sure if you run all the UTs with your change, but I'd expect many failures due to overflow after this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But division operation will readjust the precision again in average. Can you please give me a example query which can cause overflow as you explained.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, in your example, with input data of decimal(2,1), this "buffer" with your change would be a decimal (3, 1).. If your input data contains 21 9.1 items, this would overflow (191.1 doesn't fit a decimal(3,1)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well i tested as per your suggestion with my PR :
sql("create table if not exists table1(salary decimal(2,1))")
(1 to 22).foreach(_ => sql("insert into table1 values(9.1)"))
sql("select avg(salary) from table1").show(false)
+-----------+
|avg(salary)|
+-----------+
|9.10000 |
+-----------+
which is expected result and i don't see a overflow as divide will readjust precision. Can you test with my patch for a overflow specifically in case of average.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, because we are not checking the overflow in the Add operation, so despite we are in an error condition we don't detect it, but it doesn't sound great to me to rely on a currently missing check. I am not sure, though, if in special cases this can anyway cause an issue also with the current missing check.
Moreover, as you can see from the link I have posted, SQLServer - which is the reference for the way we handle decimals here - uses: decimal(38, s) divided by decimal(10, 0). So SQLServer. So I think this is what we should do eventually, but it implies changing the result type.
|
@ajithme it would be great if you could update the description following the Spark's PR template. Moreover, please add relevant UTs for this change. Thanks. |
|
What SQLServer does is explained here: https://docs.microsoft.com/en-us/sql/t-sql/functions/avg-transact-sql?view=sql-server-2017. The point is that we would need to change the result type, which I am not sure we can do before 3.0. cc @cloud-fan @dongjoon-hyun @gatorsmile for advice on this. |
|
It's not only about I don't think the decision is made randomly, IIRC we did check other databases and pick the best one we can do.
|
|
i agree with the resolution on + vs sum but i also see that that avg precision and scale cannot be calculated well ahead in this case which satisfies all scenarios. but i am just suggesting this as a temp solution until we decude in change result type, so can you guys suggest me how to handle this.? as our use case is broken beyond 2.3.1 |
|
@ajithme I don't think this is a good solution. We had to change because of bugs which were present in the previous implementation which could lead to wrong results. Here there is no wrong result, the only difference is a lower precision (which we anyway guarantee to be > 6 digits after the comma in any case/any operation). Do you really need a higher precision? One thing you may try is to set |
|
its difficult for end user to set this depending on his query ( queries which are similar to SPARK-25413 AND SPARK-24957 where need is on the precision too) but yes, i agree with the point that rounding off scale is better than getting wrong results. So, is there a documentation/pointer on why we have current sumdatatype in average taken as p+10,s, just for curiosity.? |
Not that I know of, sorry. It was introduced a long time ago (#7605) and never changed. Anyway, I think it is a valid question how we manage decimals in aggregates. I think we should revisit all these aggregation operation to match with SQLServer behavior for decimals. Probably we can target this for 3.0. WDYT @cloud-fan ? |
|
so to summarize the discussion,
|
|
According to the PR introduced the Whatever we want to propose, let's clearly write down the tradeoffs. e.g. maybe too keep larger precision, we are more likely to hit overflow, etc. |
|
Just to be clear:
I think setting Moreover, I don't think the major issue is the |
|
Can one of the admins verify this patch? |
Closes apache#22567 Closes apache#18457 Closes apache#21517 Closes apache#21858 Closes apache#22383 Closes apache#19219 Closes apache#22401 Closes apache#22811 Closes apache#20405 Closes apache#21933 Closes apache#22819 from srowen/ClosePRs. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
As per the definition, see org.apache.spark.sql.catalyst.analysis.DecimalPrecision
for the sum operation of two decimal types e1 (with precision p1 and scale s1) and e2 (with precision p2 and scale s2)
Operation : e1 + e2
Result Precision : max(s1, s2) + max(p1-s1, p2-s2) + 1
Result Scale : max(s1, s2)
but org.apache.spark.sql.catalyst.expressions.aggregate.Average#sumDataType ignores this fact and always increments the precision by 10, leading to adjusting precision when actually its not needed (result precision < 38 but precision+10 is > 38)
How was this patch tested?
Added test case as per submitter scenario and verified manually