-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement semi/anti join output statistics estimation #9800
Conversation
/benchmark |
Benchmark resultsBenchmarks comparing 349c586 (main) and 29f1cca (PR)
|
/benchmark |
Benchmark resultsBenchmarks comparing 39f4aaf (main) and 29f1cca (PR)
|
Queries running slower seem inconsistent again :) |
Probably this is the case described here as a concern regarding GH runners. |
On the other side, as far as I understand, at this moment this benchmarking is aimed to detect larger scale performance regressions like x3+ (while default benchmark report has 5% diff threshold as "no change") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @korowa -- other than the anti-join calculation this PR looks good to me.
In response to @Dandandan 's comments about potential performance degredation: In general I think our join ordering optimization / statistics are fairly simplistic
Rather than trying to make the built on models for datafusion more and more sophisticated, I think a better longer term strategy is to focus on extensibe APIs (with some reasonable default implementation / cost model behind them)
I don't have much time to drive this kind of refactoring for a while (I am pretty tied up with the functions extraction and other things for InfluxData), though I think it would really help DataFusion's long term story for more complex joins
/// | ||
/// The estimation result is either zero, in cases inputs statistics are non-overlapping | ||
/// or equal to number of rows for outer input. | ||
fn estimate_semi_join_cardinality( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long term it would be really nice to pull these types of calculations into some trait (aka an extensibility API)
| JoinType::LeftAnti | ||
| JoinType::RightAnti => None, | ||
JoinType::LeftSemi | JoinType::LeftAnti => { | ||
let cardinality = estimate_semi_join_cardinality( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't seem correct to me that the same calculation is used for both Semi and Anti joins (shouldn't they be the inverse of each other?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, they were not correct. I've changed estimations a bit -- now disjoint statistics affects only semi-joins (filtering outer table should produce zero rows). For anti-joins, disjoint inputs don't seem to make much sense -- if statistics are non-overlapping the result will be equal to outer num_rows side, otherwise (having no info or overlapping statistics) -- it still will be estimated as outer side, since we know nothing about actual distribution besides min/max, and assuming that all rows will be filtered out is too much (may significantly affect further planning)
Just to aid the discussions on benchmark variance, let's try and run the newer version once more |
/benchmark |
8cdb65e
to
9bd30ea
Compare
I think the SF 10 in-memory bench OOM-ed the run, so we should remove it for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @korowa -- I think this is a good step forward.
In general, I believe that the cardinality estimation / join reordering is too tightly tied into the core logic at the moment. My experience is that these cost models are very hard to get right and different systems will have divergent needs for their models (e.g. accuracy vs robustness)
What do you think about trying to extract the cost models (e.g. cardinality estimation) into some API? For example
pub trait CardinalityEstimator {
fn estimate_join_cardinality(
&self,
join_type: &JoinType,
left_stats: Statistics,
right_stats: Statistics,
on: &JoinOn) -> Option<PartialJoinStatistics> ;
fn estimate_filter_selectivity(..);
..
}
maybe @ozankabak / @metesynnada have some other ideas
TO be clear this is a suggestion for some other potential future PR, not this one |
/benchmark |
Benchmark resultsBenchmarks comparing 179179c (main) and 9bd30ea (PR)
|
So what seem to have happened here is that the step benchmarking the PR changes uses it's version of the I needed to add a distinction between the results since previously SF1/SF10 would overwrite the same file. So the resolution is just to rebase on/merge from main and re-run the benches again. 👀 |
Benchmark resultsBenchmarks comparing 179179c (main) and 9bd30ea (PR)
|
@alamb, I don't have any strong opinion here (probably I'm lacking knowledge of usecases for this), and if I got the idea right -- on one side it might help by adding versatility to DF usage (there is already available, AFAIU, an option to customize physical optimizer, and this API should allow to reuse optimizer with custom cost model), but on the other side, if end goal is an (extensible/customizable) API, providing data required for physical plan optimization, I'm not sure that statistics estimation API will be enough, as there are more attributes affecting phyiscal plan significantly (e.g. partitioning and ordering related attributes), and as a result, to provide all required inputs random external planner needs, we may end up with +- same Maybe it'll be better to start with internal estimator API (maybe not "API", but just set of functions, like we have now across multiple various utility-files, but better organized), and, for now, provide statistics through operators (as it's working right now) using this utility functional? |
* semi/anti join output statistics * fix antijoin cardinality estimation
🤔 In my mind the way a cost based optimizer (CBO) typically works is that there are:
I was thinking if we could decouple the "make some potential plans" and "what would it cost to run this query" parts, we could let people implement their own cost based optimizer (and we could pull the basic cardinality estimation code into the "build in" cost model) I don't have time to pursue the idea now, however |
Which issue does this PR close?
Closes #.
Rationale for this change
While #9676 found that NLJoin input reordering in benchmarks doesn't happen if one of inputs is AntiJoin -- the reason is that output statistics for all Semi/Anti join types is always set as
Absent
. This PR propagates "outer" input statistics to Anti/Semi join output.What changes are included in this PR?
default_filter_statistics
or use similar to regular joins algorithm based on estimated distinct values, which though could be misleading as it assumes that "all values from smaller side are present in larger side" -- not sure if it's fine for filter estimation)Are these changes tested?
Added test coverage for SemiJoin cardinality estimation
Are there any user-facing changes?
Execution plans containing Semi/Anti joins might change