Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arbitrary user defined partition column in ListingTable (rather than assuming they are always Dictionary encoded) #5545

Merged
merged 5 commits into from
Mar 15, 2023

Conversation

crepererum
Copy link
Contributor

Which issue does this PR close?

-

Rationale for this change

Let the user decide if they may want to encode partition values for file-based data sources. Dictionary encoding makes sense for string values but is probably pointless or even counterproductive for integer types.

What changes are included in this PR?

  • the data type specified partition_columns is now the actual output type
  • SQL mapping has been changed to follow that EXCEPT the default for unspecified types which I kept at Dict(u16, utf8)
  • I kept the "buffer reuse" logic that was in there but extended it more dictionary types. I use a type-safe approach here because Arrow buffers may become typed at some point w/ the arrow2 merger

Are these changes tested?

Adjusted existing tests.

Are there any user-facing changes?

BREAKING: Types for partition columns in file-based sources are no longer dictionary encoded by default. The user MUST choose a dictionary type if they want to achieve this.

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Mar 10, 2023
Let the user decide if they may want to encode partition values for
file-based data sources. Dictionary encoding makes sense for string
values but is probably pointless or even counterproductive for integer
types.
@crepererum crepererum force-pushed the crepererum/fix_partition_value_dtype branch from dcfbc1e to 2a5f22f Compare March 10, 2023 12:14
let expected_schema = Schema::new(vec![
Field::new("id", DataType::Int32, true),
Field::new("bool_col", DataType::Boolean, true),
Field::new("tinyint_col", DataType::Int32, true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats looks like typo -> tinyint_col should be DataType::UInt8 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema that I get from infer_schema (see two statements above) contains an Int32. I haven't touched this code and this is the current state on main. Not saying this is correct though, it might be a pre-existing bug. Note that the schema wasn't tested before, that's why I added this assertion.

@comphead
Copy link
Contributor

BREAKING: Types for partition columns in file-based sources are no longer dictionary encoded by default. The user MUST choose a dictionary type if they want to achieve this. -> @crepererum could you please provide a code example of how user code should be changed

@crepererum
Copy link
Contributor Author

Breaking Change

Before

let file_scan_config = FileScanConfig {
    table_partition_cols: vec![
        (
            "group".to_owned(),
            DataType::Utf8,
        ),
        ...
    ],
    ...
};
let partitioned_file = PartitionedFile {
    partition_values: vec![
        ScalarValue::Utf8(Some("foo".to_owned())),
        ...
    ],
    ...
 };

After (exact)

If you want an exact conversion:

let file_scan_config = FileScanConfig {
    table_partition_cols: vec![
        (
            "group".to_owned(),
            DataType::Dictionary(
                Box::new(DataType::UInt16),
                Box::new(DataType::Utf8),
            ),
        ),
        ...
    ],
    ...
};
let partitioned_file = PartitionedFile {
    partition_values: vec![
        ScalarValue::Dictionary(
            Box::new(DataType::UInt16),
            Box::new(ScalarValue::Utf8(Some("foo".to_owned()))),
        ),
        ...
    ],
    ...
 };

After (alternative)

You may just decide that you don't to dictionary-encode at all:

let file_scan_config = FileScanConfig {
    table_partition_cols: vec![
        (
            "group".to_owned(),
            DataType::Utf8,
        ),
        ...
    ],
    ...
};
let partitioned_file = PartitionedFile {
    partition_values: vec![
        ScalarValue::Utf8(Some("foo".to_owned())),
        ...
    ],
    ...
 };

or that you want a different dictionary key type:

let file_scan_config = FileScanConfig {
    table_partition_cols: vec![
        (
            "group".to_owned(),
            DataType::Dictionary(
                Box::new(DataType::Int8),
                Box::new(DataType::Utf8),
            ),
        ),
        ...
    ],
    ...
};
let partitioned_file = PartitionedFile {
    partition_values: vec![
        ScalarValue::Dictionary(
            Box::new(DataType::Int8),
            Box::new(ScalarValue::Utf8(Some("foo".to_owned()))),
        ),
        ...
    ],
    ...
 };

Note that in all cases, the types in FileScanConfig::table_partition_cols and PartitionedFile::partition_values MUST be in-sync (strictly speaking that hasn't changed, before this PR both had been converted to UInt16 dictionary types).

@alamb alamb added the api change Changes the API exposed to users of the crate label Mar 13, 2023
@alamb alamb changed the title refactor: user may choose to dict-encode partition values Support arbitrary user defined partition column in ListingTable (rather than assuming they are always Dictionary encoded) Mar 13, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original rationale from @rdettai for creating partition columns these as dictionary columns is that the column contains the same value for all rows from a particular file.

I agree the usecase is not as helpful for integer types and larger integer columns so I can see the argument there.

@comphead 's point that this might be a silent API change is quite a good one -- for example if your partition column is a Utf8 this change might slow down your performance significantly without any changes in code.

Maybe we could change the API to force people to change on upgrade:

    pub table_partition_cols: Vec<(String, DataType)>,

to something like (to force the API to change)

    pub table_partition_cols: Vec<Field>,

And then add some a method and docs to help the upgrade

impl ListingOptions {
  /// adds a dictionary encoded partitioning column
  with_partition_column(mut self, name: impl Into<String>, datatype: DataType) {
....
}
}

🤔

In any event while reviewing this PR I noticed the docs are not very clear for this feature, so I will make a PR to improve that.

@alamb
Copy link
Contributor

alamb commented Mar 13, 2023

(BTW I can find time to work on the API in the next day or two if needed)

@alamb
Copy link
Contributor

alamb commented Mar 13, 2023

Added docs here #5576

@crepererum
Copy link
Contributor Author

to something like (to force the API to change) ... Field

Field contains other information that is irrelevant here like nullability. We could assert on the correct value though (i.e. not nullable). Also this only concerns on of the two API changes, the other one (PartitionedFile) is still silently breaking. To avoid that we would need to rename partition_values to something else. However I'm not sure if a rename would trigger people to not only follow the rename but also change the types correctly.

And then add some a method and docs to help the upgrade

Since the user needs to adjust their code anyways, I'm not sure if such a random method helps. It's hard to find and because the uint16 dict type is so arbitrary, I'm not sure it should exist in the long run.

@alamb
Copy link
Contributor

alamb commented Mar 13, 2023

Field contains other information that is irrelevant here like nullability. We could assert on the correct value though (i.e. not nullable). Also this only concerns on of the two API changes, the other one (PartitionedFile) is still silently breaking. To avoid that we would need to rename partition_values to something else. However I'm not sure if a rename would trigger people to not only follow the rename but also change the types correctly.

Good point re Field and non relevant columns

What about a type that made the dictionary encoding explict, something like

struct PartitionedColumn { 
  name: String,
  data_type: DataType,
  dictionary_index_type: Option<DataType>
}

And then have a new function that defaults to Dictionary 🤔 But that is kind of ugly as well and now what happens if the data_type was a dictionary in the first place ...

@crepererum
Copy link
Contributor Author

now what happens if the data_type was a dictionary in the first place ...

You get a double-encoded dictionary? I don't know, but IMHO data_type+dictionary_index_type screams "just use DataType instead of inventing yet another type".

@alamb
Copy link
Contributor

alamb commented Mar 13, 2023

You get a double-encoded dictionary? I don't know, but IMHO data_type+dictionary_index_type screams "just use DataType instead of inventing yet another type"

Yes I agree -- that is what probably should have been done in the first place, but it was not. So now we need to figure out a reasonable way to help avoid subtle regressions.

Maybe we can handle it by updating the docs / making it easy to do the right thing (use dictionary) 🤔

@crepererum
Copy link
Contributor Author

crepererum commented Mar 14, 2023

I've made the following changes:

  • partition_type_wrap and partition_value_wrap helper methods are now public
  • docs mention these helpers (likely conflicts w/ Minor: Add more documentation about table_partition_columns #5576 but I can rebase after that one is merged)
  • there's a transition auto-fix that adds the dictionary wrapping to the value if the type demands it but spits out a warning. this is also tested

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the best we are going to do now. I think the docs in PR with the combination of #5576 should help ease the transition

Thank you @crepererum

I will leave this open for another day before merging so that others have a chance to comment if they desire.

cc @yahoNanJing (do you know if Ballista uses these partition columns)?

@@ -64,6 +65,26 @@ use std::{

use super::{ColumnStatistics, Statistics};

/// Convert logical type of partition column to physical type: `Dictionary(UInt16, val_type)`.
///
/// You CAN use this to specify types for partition columns. However you MAY also choose not to dictionary-encode the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good -- I think is important to try and help people choose when to use these functions, but I can add that to #5576 as a follow on

I also think these functions might be easier to find if they are named something more connected to what they do (dictionary encode). Perhaps wrap_partition_type_in_dict but that is just a preference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed as suggested

@alamb
Copy link
Contributor

alamb commented Mar 15, 2023

This PR had a merge conflict so I took the liberty of fixing that and pushing

@alamb alamb merged commit 8b5dbd7 into apache:main Mar 15, 2023
@alamb
Copy link
Contributor

alamb commented Mar 15, 2023

BTW another reason this change is an improvement, in my mind, is that it allows users to tradeoff between more efficient encoding (e.g. Dict(Int8, Utf8)) and supporting more distinct values (Dict(Int16, Utf8)).

I think originally the code always used Dictionary(Int8, ..) and someone had the usecase with more than 256 distinct values (files) so we increased the size to Dict(Int16, ...).

So this change now allows people to make that tradeoff explcitly

@wjones127
Copy link
Member

wjones127 commented Apr 28, 2023

Got here as a downstream user who is affected by this change, so posting here in case others are working through the same thing. Thinking through this, I wouldn't totally write off dictionary encoding integers as useless, since there still are benefits to dictionary arrays besides space savings. They essentially mark columns as having low cardinality and provide the set of unique values. Any scalar compute functions run on these columns can be applied to the dictionary while leaving the indices buffer untouched. That is an easy to way to achieve what I would expect out of a "smart" compute engine: when projecting partition columns, project the distinct values rather than the expanded/materialized array. It's possible DataFusion already handles this in a smart way I'm unaware of though.

I'd also note that the ideal partition column types are probably run-end encoded arrays (RunArray), once they are implemented.

@crepererum
Copy link
Contributor Author

@wjones127 you can still dictionary-encode integers by specifying the column type in FileScanConfig as DataType::Dictionary(Box::new(DataType::Int64), Box::new(DataType::UInt8)) (for example), see #5545 (comment)

@wjones127
Copy link
Member

Noted 👍 I'm not objecting to this change. Just wanted to provide information to any other developers who end up reading this PR and are thinking about how they will adapt their code.

@tustvold
Copy link
Contributor

tustvold commented Jun 12, 2023

provide the set of unique values

FWIW this isn't how dictionaries are implemented today, there are various situations where the dictionary will contain values not referenced by an index, and/or the same value repeated multiple times. This is to avoid having to recompute dictionaries which is incredibly expensive. As it currently stands primitive dictionaries will almost always be less efficient both from a memory usage and performance standpoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate enhancement New feature or request sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants