Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,13 @@ abstract class AverageLike(child: Expression) extends DeclarativeAggregate {
}

private lazy val sumDataType = child.dataType match {
case _ @ DecimalType.Fixed(p, s) => DecimalType.bounded(p + 10, s)
/*
* In case of sum of decimal ( assuming another decimal of same precision and scale)
* Refer : org.apache.spark.sql.catalyst.analysis.DecimalPrecision
* Precision : max(s1, s2) + max(p1 - s1, p2 - s2) + 1
* Scale : max(s1, s2)
*/
case _ @ DecimalType.Fixed(p, s) => DecimalType.adjustPrecisionScale(s + (p - s) + 1, s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that here the sum operation is executed many times, not only once. So I am not sure that this the right way to deal with it. It would be great to check what other RDBMs do in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is what SQLServer does for the operation +, not for the avg result. There is a big difference between the intermediate result of avg and +, as here the + operation is executed once per each row (the exact number of times is not known in advance).

Copy link
Contributor Author

@ajithme ajithme Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes i agree. But the point is arbitrarily having precision increased by 10 can cause loss of scale more often and calculating the times is costly. so even if we know times, until we know exact data, this calculation may not be precise.? For example lets take column with datatype decimal(2,1): so here the actual data matters as 2.2+2.2 or 9.9+9.9 may cause result datatype of different precision and scale. As avg = (sum(data)/times), can we have precision and scale of sum(data) restricted as described by + operation.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can't because we would risk (well, we would likely hit) an overflow. Indeed, I am not sure if you run all the UTs with your change, but I'd expect many failures due to overflow after this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But division operation will readjust the precision again in average. Can you please give me a example query which can cause overflow as you explained.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, in your example, with input data of decimal(2,1), this "buffer" with your change would be a decimal (3, 1).. If your input data contains 21 9.1 items, this would overflow (191.1 doesn't fit a decimal(3,1)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well i tested as per your suggestion with my PR :

 sql("create table if not exists table1(salary decimal(2,1))")
 (1 to 22).foreach(_ => sql("insert into table1 values(9.1)"))
 sql("select avg(salary) from table1").show(false)

+-----------+
|avg(salary)|
+-----------+
|9.10000    |
+-----------+

which is expected result and i don't see a overflow as divide will readjust precision. Can you test with my patch for a overflow specifically in case of average.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, because we are not checking the overflow in the Add operation, so despite we are in an error condition we don't detect it, but it doesn't sound great to me to rely on a currently missing check. I am not sure, though, if in special cases this can anyway cause an issue also with the current missing check.

Moreover, as you can see from the link I have posted, SQLServer - which is the reference for the way we handle decimals here - uses: decimal(38, s) divided by decimal(10, 0). So SQLServer. So I think this is what we should do eventually, but it implies changing the result type.

case _ => DoubleType
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1192,6 +1192,25 @@ class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with BeforeAnd
}
}
}

test("SPARK-25413 Test scale and precision") {
val expected = new java.math.BigDecimal("37800224355780013.7598204253756364")
sql("create table if not exists table1(salary decimal(31,12))")
sql("insert into table1 values(12345678901234510.1234567890123)")
sql("insert into table1 values(12345678901234520.1234567890123)")
sql("insert into table1 values(12345678901234530.1234567890123)")
sql("insert into table1 values(12345678901234560.1234567890123)")
sql("insert into table1 values(22345678901234560.1234567890123)")
sql("insert into table1 values(32345678901234560.1234567890123)")
sql("insert into table1 values(42345678901234560.1234567890123)")
sql("insert into table1 values(52345678901234560.1234567890123)")
sql("insert into table1 values(62345678901234560.1234567890123)")
sql("insert into table1 values(72345678901234560.1234567890123)")
sql("insert into table1 values(82345678901234560.1234567890123)")
assert(sql("select avg(salary)+10 from table1")
.first()
.getAs[java.math.BigDecimal](0).equals(expected))
}
}

// for SPARK-2180 test
Expand Down