-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply workaround for #5444 to DataFrame::describe
#5468
Conversation
DataFrame::describe
method count result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datafusion/core/src/dataframe.rs
Outdated
.aggregate( | ||
vec![], | ||
fields_iter | ||
vec![RecordBatch::try_new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand this change, but it seems ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't fix the issue 5444, but it did lead me to discover a better way to speed up the operation of describe.
I don't fully understand these changes either, will need to take a closer look as I'm not yet familiar with |
yes,#5444 need find other way to fix. it's should not to close . |
Is the method of speeding up referring to being able to leverage Where can't find the column by name since the name of the Is my understanding correct @jiangzhx ? |
yes, you are right. the problem was caused by this function |
Ok I understand now, thanks. My opinion is to not have this workaround unless its urgently needed (since this would fix a bug in But if can be diligent to remove this workaround when #5444 is fixed (and confirmed to also fix this without workaround) then it seems fine. |
Thanks for your suggestion @Jefffrey . How about this solution? It might be easier to understand.
|
@jiangzhx That certainly is a more straightforward workaround, if you decide to go with it then having a comment explaining why the workaround is there would be helpful too |
// The optimization of AggregateStatistics will rewrite the physical plan | ||
// for the count function and ignore alias functions, | ||
// as shown in https://github.com/apache/arrow-datafusion/issues/5444. | ||
// This logic should be removed when #5444 is fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 -- thank you this comment makes it much clearer what is going on
DataFrame::describe
method count resultDataFrame::describe
while #5444 is not fixed
DataFrame::describe
while #5444 is not fixedDataFrame::describe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thank you @jiangzhx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just some minor nitpicks
Benchmark runs are scheduled for baseline = d0bd28e and contender = 99ef989. 99ef989 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?