-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Port Datafusion Covariance to Comet #234
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #234 +/- ##
============================================
+ Coverage 33.32% 33.36% +0.04%
- Complexity 769 776 +7
============================================
Files 107 108 +1
Lines 37037 37179 +142
Branches 8106 8194 +88
============================================
+ Hits 12341 12404 +63
- Misses 22098 22154 +56
- Partials 2598 2621 +23 ☔ View full report in Codecov by Sentry. |
5b2ba99
to
a80cee2
Compare
5324464
to
47604a3
Compare
expr1: Arc<dyn PhysicalExpr>, | ||
expr2: Arc<dyn PhysicalExpr>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there better names for these two children expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark uses left
and right
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave this as is unless you prefer left
and right
.
fn create_accumulator(&self) -> Result<Box<dyn Accumulator>> { | ||
Ok(Box::new(CovarianceAccumulator::try_new( | ||
StatsType::Population, | ||
)?)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the difference between CovariancePop
and Covariance
is only the StatsType
parameter when creating CovarianceAccumulator
?
If so, maybe we only need Covariance
and make StatsType
as parameter when creating Covariance
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/arrow-datafusion-comet/blob/955a0b9b9b026b477b5b527aba133d9e1402bd7e/core/src/execution/datafusion/expressions/covariance.rs#L368
Population Covariance is calculated over the entire dataset(N) whereas Sample Covariance is calculated over a sample (N-1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, but the behavior is decided in CovarianceAccumulator
based on its StatsType
.
I mean this two struct CovariancePop
and Covariance
. They are basically the same, except that name
and StatsType
are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we can simply have one Covariance
and add one more parameter of StatsType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update EXPRESSIONS.md
too.
804dba9
to
263c8af
Compare
PhysicalExpr, | ||
}; | ||
|
||
/// COVAR_SAMP and COVAR_POP aggregate expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to mention why we need to port it in Comet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added.
message CovSample { | ||
Expr child1 = 1; | ||
Expr child2 = 2; | ||
bool null_on_divide_by_zero = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use this yet, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't see it is specified in JVM side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is not used yet. I will have a follow up to address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments.
Merged. Thanks. |
Thanks @viirya |
* feat: Port Datafusion Covariance to Comet * feat: Port Datafusion Covariance to Comet * fmt * update EXPRESSIONS.md * combine COVAR_SAMP and COVAR_POP * fix fmt * address comment --------- Co-authored-by: Huaxin Gao <huaxin.gao@apple.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Which issue does this PR close?
Closes #.
Port
Covariance
in Comet. The reason we can't use DataFusionCovariance
implementation is that SparkCovariance
aggregate expression's state is different from DataFusion's state.Rationale for this change
What changes are included in this PR?
How are these changes tested?