-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support arbitrary user defined partition column in ListingTable
(rather than assuming they are always Dictionary encoded)
#5545
Support arbitrary user defined partition column in ListingTable
(rather than assuming they are always Dictionary encoded)
#5545
Conversation
Let the user decide if they may want to encode partition values for file-based data sources. Dictionary encoding makes sense for string values but is probably pointless or even counterproductive for integer types.
dcfbc1e
to
2a5f22f
Compare
let expected_schema = Schema::new(vec![ | ||
Field::new("id", DataType::Int32, true), | ||
Field::new("bool_col", DataType::Boolean, true), | ||
Field::new("tinyint_col", DataType::Int32, true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats looks like typo -> tinyint_col
should be DataType::UInt8
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema that I get from infer_schema
(see two statements above) contains an Int32
. I haven't touched this code and this is the current state on main
. Not saying this is correct though, it might be a pre-existing bug. Note that the schema wasn't tested before, that's why I added this assertion.
BREAKING: Types for partition columns in file-based sources are no longer dictionary encoded by default. The user MUST choose a dictionary type if they want to achieve this. -> @crepererum could you please provide a code example of how user code should be changed |
Breaking ChangeBeforelet file_scan_config = FileScanConfig {
table_partition_cols: vec![
(
"group".to_owned(),
DataType::Utf8,
),
...
],
...
};
let partitioned_file = PartitionedFile {
partition_values: vec![
ScalarValue::Utf8(Some("foo".to_owned())),
...
],
...
}; After (exact)If you want an exact conversion: let file_scan_config = FileScanConfig {
table_partition_cols: vec![
(
"group".to_owned(),
DataType::Dictionary(
Box::new(DataType::UInt16),
Box::new(DataType::Utf8),
),
),
...
],
...
};
let partitioned_file = PartitionedFile {
partition_values: vec![
ScalarValue::Dictionary(
Box::new(DataType::UInt16),
Box::new(ScalarValue::Utf8(Some("foo".to_owned()))),
),
...
],
...
}; After (alternative)You may just decide that you don't to dictionary-encode at all: let file_scan_config = FileScanConfig {
table_partition_cols: vec![
(
"group".to_owned(),
DataType::Utf8,
),
...
],
...
};
let partitioned_file = PartitionedFile {
partition_values: vec![
ScalarValue::Utf8(Some("foo".to_owned())),
...
],
...
}; or that you want a different dictionary key type: let file_scan_config = FileScanConfig {
table_partition_cols: vec![
(
"group".to_owned(),
DataType::Dictionary(
Box::new(DataType::Int8),
Box::new(DataType::Utf8),
),
),
...
],
...
};
let partitioned_file = PartitionedFile {
partition_values: vec![
ScalarValue::Dictionary(
Box::new(DataType::Int8),
Box::new(ScalarValue::Utf8(Some("foo".to_owned()))),
),
...
],
...
}; Note that in all cases, the types in |
ListingTable
(rather than assuming they are always Dictionary encoded)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original rationale from @rdettai for creating partition columns these as dictionary columns is that the column contains the same value for all rows from a particular file.
I agree the usecase is not as helpful for integer types and larger integer columns so I can see the argument there.
@comphead 's point that this might be a silent API change is quite a good one -- for example if your partition column is a Utf8
this change might slow down your performance significantly without any changes in code.
Maybe we could change the API to force people to change on upgrade:
pub table_partition_cols: Vec<(String, DataType)>,
to something like (to force the API to change)
pub table_partition_cols: Vec<Field>,
And then add some a method and docs to help the upgrade
impl ListingOptions {
/// adds a dictionary encoded partitioning column
with_partition_column(mut self, name: impl Into<String>, datatype: DataType) {
....
}
}
🤔
In any event while reviewing this PR I noticed the docs are not very clear for this feature, so I will make a PR to improve that.
(BTW I can find time to work on the API in the next day or two if needed) |
Added docs here #5576 |
Since the user needs to adjust their code anyways, I'm not sure if such a random method helps. It's hard to find and because the uint16 dict type is so arbitrary, I'm not sure it should exist in the long run. |
Good point re What about a type that made the dictionary encoding explict, something like struct PartitionedColumn {
name: String,
data_type: DataType,
dictionary_index_type: Option<DataType>
} And then have a |
You get a double-encoded dictionary? I don't know, but IMHO |
Yes I agree -- that is what probably should have been done in the first place, but it was not. So now we need to figure out a reasonable way to help avoid subtle regressions. Maybe we can handle it by updating the docs / making it easy to do the right thing (use dictionary) 🤔 |
I've made the following changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the best we are going to do now. I think the docs in PR with the combination of #5576 should help ease the transition
Thank you @crepererum
I will leave this open for another day before merging so that others have a chance to comment if they desire.
cc @yahoNanJing (do you know if Ballista uses these partition columns)?
@@ -64,6 +65,26 @@ use std::{ | |||
|
|||
use super::{ColumnStatistics, Statistics}; | |||
|
|||
/// Convert logical type of partition column to physical type: `Dictionary(UInt16, val_type)`. | |||
/// | |||
/// You CAN use this to specify types for partition columns. However you MAY also choose not to dictionary-encode the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good -- I think is important to try and help people choose when to use these functions, but I can add that to #5576 as a follow on
I also think these functions might be easier to find if they are named something more connected to what they do (dictionary encode). Perhaps wrap_partition_type_in_dict
but that is just a preference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed as suggested
This PR had a merge conflict so I took the liberty of fixing that and pushing |
BTW another reason this change is an improvement, in my mind, is that it allows users to tradeoff between more efficient encoding (e.g. Dict(Int8, Utf8)) and supporting more distinct values (Dict(Int16, Utf8)). I think originally the code always used Dictionary(Int8, ..) and someone had the usecase with more than 256 distinct values (files) so we increased the size to Dict(Int16, ...). So this change now allows people to make that tradeoff explcitly |
Got here as a downstream user who is affected by this change, so posting here in case others are working through the same thing. Thinking through this, I wouldn't totally write off dictionary encoding integers as useless, since there still are benefits to dictionary arrays besides space savings. They essentially mark columns as having low cardinality and provide the set of unique values. Any scalar compute functions run on these columns can be applied to the dictionary while leaving the indices buffer untouched. That is an easy to way to achieve what I would expect out of a "smart" compute engine: when projecting partition columns, project the distinct values rather than the expanded/materialized array. It's possible DataFusion already handles this in a smart way I'm unaware of though. I'd also note that the ideal partition column types are probably run-end encoded arrays ( |
@wjones127 you can still dictionary-encode integers by specifying the column type in |
Noted 👍 I'm not objecting to this change. Just wanted to provide information to any other developers who end up reading this PR and are thinking about how they will adapt their code. |
FWIW this isn't how dictionaries are implemented today, there are various situations where the dictionary will contain values not referenced by an index, and/or the same value repeated multiple times. This is to avoid having to recompute dictionaries which is incredibly expensive. As it currently stands primitive dictionaries will almost always be less efficient both from a memory usage and performance standpoint |
Which issue does this PR close?
-
Rationale for this change
Let the user decide if they may want to encode partition values for file-based data sources. Dictionary encoding makes sense for string values but is probably pointless or even counterproductive for integer types.
What changes are included in this PR?
partition_columns
is now the actual output typeDict(u16, utf8)
Are these changes tested?
Adjusted existing tests.
Are there any user-facing changes?
BREAKING: Types for partition columns in file-based sources are no longer dictionary encoded by default. The user MUST choose a dictionary type if they want to achieve this.