Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support covar_samp and covar_pop #216

Closed
wants to merge 1 commit into from

Conversation

huaxingao
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

This PR adds the support for covar_samp and covar_pop

What changes are included in this PR?

How are these changes tested?

new tests

@@ -69,6 +69,7 @@ object Utils {
case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 2 => ShortType
case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 4 => IntegerType
case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 8 => LongType
case int: ArrowType.Int if int.getBitWidth == 8 * 8 => LongType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this UInt64? Using LongType to represent it will overflow, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we map to DecimalType instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you use it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This UInt64 is for state field count.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have both partial and final aggregation operators in Comet, it should be okay as Java doesn't process the intermediate results (state), but if we have only partial aggregation in Comet, this Uint64 array as LongType will possibly cause overflow in corner case, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we map to DecimalType(20, 0) instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how does it work? You will treat UInt64 array as decimal array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the max of UInt64 is 18446744073709551615, I guess we can use DecimalType(20, 0) to represent the number without overflowing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I mean Spark will treat the actual UInt64 array as decimal one. For example, it will call getDecimal on an UInt64 array, does it work?

Copy link
Member

@viirya viirya Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should enable partial + final Comet aggregation as a whole, i.e., #223.

@@ -205,7 +205,7 @@ class NativeUtil {
case v @ (_: BitVector | _: TinyIntVector | _: SmallIntVector | _: IntVector |
_: BigIntVector | _: Float4Vector | _: Float8Vector | _: VarCharVector |
_: DecimalVector | _: DateDayVector | _: TimeStampMicroTZVector | _: VarBinaryVector |
_: FixedSizeBinaryVector | _: TimeStampMicroVector) =>
_: FixedSizeBinaryVector | _: TimeStampMicroVector | _: UInt8Vector) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to handle UInt8Vector?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it used in Covariance/CovariancePop state types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's for state field count

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is UInt64? But what you add is UInt8Vector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh...the naming of Java Arrow API...

@huaxingao
Copy link
Contributor Author

Seems we can't use covariance from DataFusion, because DataFusion has UInt64 for state_fields count, but Spark has Double for count. I will close this PR and implement Comet's own covariance.

@huaxingao huaxingao closed this Mar 26, 2024
@huaxingao huaxingao deleted the covariance branch March 26, 2024 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants