-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support covar_samp and covar_pop #216
Conversation
@@ -69,6 +69,7 @@ object Utils { | |||
case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 2 => ShortType | |||
case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 4 => IntegerType | |||
case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 8 => LongType | |||
case int: ArrowType.Int if int.getBitWidth == 8 * 8 => LongType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is this UInt64? Using LongType
to represent it will overflow, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we map to DecimalType
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This UInt64
is for state field count
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have both partial and final aggregation operators in Comet, it should be okay as Java doesn't process the intermediate results (state), but if we have only partial aggregation in Comet, this Uint64 array as LongType will possibly cause overflow in corner case, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we map to DecimalType(20, 0)
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, how does it work? You will treat UInt64 array as decimal array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the max of UInt64 is 18446744073709551615, I guess we can use DecimalType(20, 0)
to represent the number without overflowing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I mean Spark will treat the actual UInt64 array as decimal one. For example, it will call getDecimal
on an UInt64 array, does it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should enable partial + final Comet aggregation as a whole, i.e., #223.
@@ -205,7 +205,7 @@ class NativeUtil { | |||
case v @ (_: BitVector | _: TinyIntVector | _: SmallIntVector | _: IntVector | | |||
_: BigIntVector | _: Float4Vector | _: Float8Vector | _: VarCharVector | | |||
_: DecimalVector | _: DateDayVector | _: TimeStampMicroTZVector | _: VarBinaryVector | | |||
_: FixedSizeBinaryVector | _: TimeStampMicroVector) => | |||
_: FixedSizeBinaryVector | _: TimeStampMicroVector | _: UInt8Vector) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to handle UInt8Vector
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it used in Covariance
/CovariancePop
state types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's for state field count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is UInt64
? But what you add is UInt8Vector
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems UInt8Vector
is for UInt64
. The name is confusing. https://github.com/apache/arrow/blob/main/java/vector/src/main/java/org/apache/arrow/vector/UInt8Vector.java#L37
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh...the naming of Java Arrow API...
Seems we can't use covariance from DataFusion, because DataFusion has UInt64 for state_fields count, but Spark has Double for count. I will close this PR and implement Comet's own covariance. |
Which issue does this PR close?
Closes #.
Rationale for this change
This PR adds the support for
covar_samp
andcovar_pop
What changes are included in this PR?
How are these changes tested?
new tests