-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a describe method on DataFrame like Polars #5226
Conversation
9d0fafe
to
7a622c6
Compare
c0500d9
to
1309267
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jiangzhx -- this looks like a very nice addition to DataFusion ❤️
I left some comments -- let me know if anything isn't clear.
And thanks again!
if field.data_type().is_numeric() { | ||
Field::new(field.name(), DataType::Float64, true) | ||
} else { | ||
Field::new(field.name(), DataType::Utf8, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that the schema for count
and null_count
were always Int64
and the schema for min/max were always Utf8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the describe method return schema like this.
the each column should have same datatype .
for example :
bool_col
oncount/null_count
return Int64 ; error onmin/max
, so makebool_col
datatypeUTF8
;float_col
oncount/null_count
return Int64 ; onmin/max
return float, so makefloat_col
datatypeFloat64
datafusion/core/src/dataframe.rs
Outdated
vec![], | ||
fields_iter | ||
.clone() | ||
.filter(|f| matches!(f.data_type().is_numeric(), true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why restrict the min/max aggregation to numeric fields?
In order to get the min/max values in all columns to work, you could call cast
to cast them to the same datatype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boolean and binary not work with min/max.
filter out DataType::Binary , DataType::Boolean will be better.
!matches!(f.data_type(), DataType::Binary | DataType::Boolean)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
date_string_col, string_col ’s datatype also Binary.
called Result::unwrap()
on an Err
value: Internal("Min/Max accumulator not implemented for type Binary")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @jiangzhx
cc @andygrove (as I think this is a neat thing to expose in datafusion-python)
)), | ||
); | ||
|
||
let describe_record_batch = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Benchmark runs are scheduled for baseline = ea3b965 and contender = 96aa2a6. 96aa2a6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #4974 .