Support arbitrary user defined partition column in `ListingTable` (rather than assuming they are always Dictionary encoded) #5545

crepererum · 2023-03-10T11:27:40Z

Which issue does this PR close?

-

Rationale for this change

Let the user decide if they may want to encode partition values for file-based data sources. Dictionary encoding makes sense for string values but is probably pointless or even counterproductive for integer types.

What changes are included in this PR?

the data type specified partition_columns is now the actual output type
SQL mapping has been changed to follow that EXCEPT the default for unspecified types which I kept at Dict(u16, utf8)
I kept the "buffer reuse" logic that was in there but extended it more dictionary types. I use a type-safe approach here because Arrow buffers may become typed at some point w/ the arrow2 merger

Are these changes tested?

Adjusted existing tests.

Are there any user-facing changes?

BREAKING: Types for partition columns in file-based sources are no longer dictionary encoded by default. The user MUST choose a dictionary type if they want to achieve this.

Let the user decide if they may want to encode partition values for file-based data sources. Dictionary encoding makes sense for string values but is probably pointless or even counterproductive for integer types.

comphead · 2023-03-10T17:15:21Z

datafusion/core/src/physical_plan/file_format/parquet.rs

+        let expected_schema = Schema::new(vec![
+            Field::new("id", DataType::Int32, true),
+            Field::new("bool_col", DataType::Boolean, true),
+            Field::new("tinyint_col", DataType::Int32, true),


thats looks like typo -> tinyint_col should be DataType::UInt8 ?

The schema that I get from infer_schema (see two statements above) contains an Int32. I haven't touched this code and this is the current state on main. Not saying this is correct though, it might be a pre-existing bug. Note that the schema wasn't tested before, that's why I added this assertion.

comphead · 2023-03-10T17:25:10Z

BREAKING: Types for partition columns in file-based sources are no longer dictionary encoded by default. The user MUST choose a dictionary type if they want to achieve this. -> @crepererum could you please provide a code example of how user code should be changed

crepererum · 2023-03-13T10:22:09Z

Breaking Change

Before

let file_scan_config = FileScanConfig {
    table_partition_cols: vec![
        (
            "group".to_owned(),
            DataType::Utf8,
        ),
        ...
    ],
    ...
};
let partitioned_file = PartitionedFile {
    partition_values: vec![
        ScalarValue::Utf8(Some("foo".to_owned())),
        ...
    ],
    ...
 };

After (exact)

If you want an exact conversion:

let file_scan_config = FileScanConfig {
    table_partition_cols: vec![
        (
            "group".to_owned(),
            DataType::Dictionary(
                Box::new(DataType::UInt16),
                Box::new(DataType::Utf8),
            ),
        ),
        ...
    ],
    ...
};
let partitioned_file = PartitionedFile {
    partition_values: vec![
        ScalarValue::Dictionary(
            Box::new(DataType::UInt16),
            Box::new(ScalarValue::Utf8(Some("foo".to_owned()))),
        ),
        ...
    ],
    ...
 };

After (alternative)

You may just decide that you don't to dictionary-encode at all:

let file_scan_config = FileScanConfig {
    table_partition_cols: vec![
        (
            "group".to_owned(),
            DataType::Utf8,
        ),
        ...
    ],
    ...
};
let partitioned_file = PartitionedFile {
    partition_values: vec![
        ScalarValue::Utf8(Some("foo".to_owned())),
        ...
    ],
    ...
 };

or that you want a different dictionary key type:

let file_scan_config = FileScanConfig {
    table_partition_cols: vec![
        (
            "group".to_owned(),
            DataType::Dictionary(
                Box::new(DataType::Int8),
                Box::new(DataType::Utf8),
            ),
        ),
        ...
    ],
    ...
};
let partitioned_file = PartitionedFile {
    partition_values: vec![
        ScalarValue::Dictionary(
            Box::new(DataType::Int8),
            Box::new(ScalarValue::Utf8(Some("foo".to_owned()))),
        ),
        ...
    ],
    ...
 };

Note that in all cases, the types in FileScanConfig::table_partition_cols and PartitionedFile::partition_values MUST be in-sync (strictly speaking that hasn't changed, before this PR both had been converted to UInt16 dictionary types).

alamb

I think the original rationale from @rdettai for creating partition columns these as dictionary columns is that the column contains the same value for all rows from a particular file.

I agree the usecase is not as helpful for integer types and larger integer columns so I can see the argument there.

@comphead 's point that this might be a silent API change is quite a good one -- for example if your partition column is a Utf8 this change might slow down your performance significantly without any changes in code.

Maybe we could change the API to force people to change on upgrade:

    pub table_partition_cols: Vec<(String, DataType)>,

to something like (to force the API to change)

    pub table_partition_cols: Vec<Field>,

And then add some a method and docs to help the upgrade

impl ListingOptions {
  /// adds a dictionary encoded partitioning column
  with_partition_column(mut self, name: impl Into<String>, datatype: DataType) {
....
}
}

🤔

In any event while reviewing this PR I noticed the docs are not very clear for this feature, so I will make a PR to improve that.

alamb · 2023-03-13T16:19:13Z

(BTW I can find time to work on the API in the next day or two if needed)

alamb · 2023-03-13T16:34:12Z

Added docs here #5576

crepererum · 2023-03-13T16:44:44Z

to something like (to force the API to change) ... Field

Field contains other information that is irrelevant here like nullability. We could assert on the correct value though (i.e. not nullable). Also this only concerns on of the two API changes, the other one (PartitionedFile) is still silently breaking. To avoid that we would need to rename partition_values to something else. However I'm not sure if a rename would trigger people to not only follow the rename but also change the types correctly.

And then add some a method and docs to help the upgrade

Since the user needs to adjust their code anyways, I'm not sure if such a random method helps. It's hard to find and because the uint16 dict type is so arbitrary, I'm not sure it should exist in the long run.

alamb · 2023-03-13T16:50:47Z

Field contains other information that is irrelevant here like nullability. We could assert on the correct value though (i.e. not nullable). Also this only concerns on of the two API changes, the other one (PartitionedFile) is still silently breaking. To avoid that we would need to rename partition_values to something else. However I'm not sure if a rename would trigger people to not only follow the rename but also change the types correctly.

Good point re Field and non relevant columns

What about a type that made the dictionary encoding explict, something like

struct PartitionedColumn { 
  name: String,
  data_type: DataType,
  dictionary_index_type: Option<DataType>
}

And then have a new function that defaults to Dictionary 🤔 But that is kind of ugly as well and now what happens if the data_type was a dictionary in the first place ...

crepererum · 2023-03-13T16:57:45Z

now what happens if the data_type was a dictionary in the first place ...

You get a double-encoded dictionary? I don't know, but IMHO data_type+dictionary_index_type screams "just use DataType instead of inventing yet another type".

alamb · 2023-03-13T17:48:29Z

You get a double-encoded dictionary? I don't know, but IMHO data_type+dictionary_index_type screams "just use DataType instead of inventing yet another type"

Yes I agree -- that is what probably should have been done in the first place, but it was not. So now we need to figure out a reasonable way to help avoid subtle regressions.

Maybe we can handle it by updating the docs / making it easy to do the right thing (use dictionary) 🤔

crepererum · 2023-03-14T09:38:52Z

I've made the following changes:

partition_type_wrap and partition_value_wrap helper methods are now public
docs mention these helpers (likely conflicts w/ Minor: Add more documentation about table_partition_columns #5576 but I can rebase after that one is merged)
there's a transition auto-fix that adds the dictionary wrapping to the value if the type demands it but spits out a warning. this is also tested

alamb

I think this is the best we are going to do now. I think the docs in PR with the combination of #5576 should help ease the transition

Thank you @crepererum

I will leave this open for another day before merging so that others have a chance to comment if they desire.

cc @yahoNanJing (do you know if Ballista uses these partition columns)?

datafusion/core/src/physical_plan/file_format/parquet.rs

alamb · 2023-03-14T10:41:03Z

datafusion/core/src/physical_plan/file_format/mod.rs

@@ -64,6 +65,26 @@ use std::{

 use super::{ColumnStatistics, Statistics};

+/// Convert logical type of partition column to physical type: `Dictionary(UInt16, val_type)`.
+///
+/// You CAN use this to specify types for partition columns. However you MAY also choose not to dictionary-encode the


This looks good -- I think is important to try and help people choose when to use these functions, but I can add that to #5576 as a follow on

I also think these functions might be easier to find if they are named something more connected to what they do (dictionary encode). Perhaps wrap_partition_type_in_dict but that is just a preference

renamed as suggested

…e}_in_dict`

…ion_value_dtype

alamb · 2023-03-15T11:22:24Z

This PR had a merge conflict so I took the liberty of fixing that and pushing

alamb · 2023-03-15T12:36:05Z

BTW another reason this change is an improvement, in my mind, is that it allows users to tradeoff between more efficient encoding (e.g. Dict(Int8, Utf8)) and supporting more distinct values (Dict(Int16, Utf8)).

I think originally the code always used Dictionary(Int8, ..) and someone had the usecase with more than 256 distinct values (files) so we increased the size to Dict(Int16, ...).

So this change now allows people to make that tradeoff explcitly

wjones127 · 2023-04-28T03:06:32Z

Got here as a downstream user who is affected by this change, so posting here in case others are working through the same thing. Thinking through this, I wouldn't totally write off dictionary encoding integers as useless, since there still are benefits to dictionary arrays besides space savings. They essentially mark columns as having low cardinality and provide the set of unique values. Any scalar compute functions run on these columns can be applied to the dictionary while leaving the indices buffer untouched. That is an easy to way to achieve what I would expect out of a "smart" compute engine: when projecting partition columns, project the distinct values rather than the expanded/materialized array. It's possible DataFusion already handles this in a smart way I'm unaware of though.

I'd also note that the ideal partition column types are probably run-end encoded arrays (RunArray), once they are implemented.

crepererum · 2023-04-28T09:26:13Z

@wjones127 you can still dictionary-encode integers by specifying the column type in FileScanConfig as DataType::Dictionary(Box::new(DataType::Int64), Box::new(DataType::UInt8)) (for example), see #5545 (comment)

wjones127 · 2023-04-28T18:10:32Z

Noted 👍 I'm not objecting to this change. Just wanted to provide information to any other developers who end up reading this PR and are thinking about how they will adapt their code.

tustvold · 2023-06-12T16:19:04Z

provide the set of unique values

FWIW this isn't how dictionaries are implemented today, there are various situations where the dictionary will contain values not referenced by an index, and/or the same value repeated multiple times. This is to avoid having to recompute dictionaries which is incredibly expensive. As it currently stands primitive dictionaries will almost always be less efficient both from a memory usage and performance standpoint

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Mar 10, 2023

refactor: user may choose to dict-encode partition values

2a5f22f

Let the user decide if they may want to encode partition values for file-based data sources. Dictionary encoding makes sense for string values but is probably pointless or even counterproductive for integer types.

crepererum force-pushed the crepererum/fix_partition_value_dtype branch from dcfbc1e to 2a5f22f Compare March 10, 2023 12:14

comphead reviewed Mar 10, 2023

View reviewed changes

alamb added the api change Changes the API exposed to users of the crate label Mar 13, 2023

alamb changed the title ~~refactor: user may choose to dict-encode partition values~~ Support arbitrary user defined partition column in ListingTable (rather than assuming they are always Dictionary encoded) Mar 13, 2023

alamb reviewed Mar 13, 2023

View reviewed changes

alamb mentioned this pull request Mar 13, 2023

Minor: Add more documentation about table_partition_columns #5576

Merged

refactor: improve transition

5cac3a8

alamb approved these changes Mar 14, 2023

View reviewed changes

crepererum and others added 3 commits March 14, 2023 12:03

fix: remove leftover dbg

0ee0395

refactor: partition_{type,value}_wrap -> `wrap_partition_{type,valu…

340dd80

…e}_in_dict`

Merge remote-tracking branch 'apache/main' into crepererum/fix_partit…

5301fc1

…ion_value_dtype

alamb merged commit 8b5dbd7 into apache:main Mar 15, 2023

andygrove added the enhancement New feature or request label Apr 7, 2023

wjones127 mentioned this pull request Apr 28, 2023

fix: scan partitioned tables with datafusion delta-io/delta-rs#1303

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support arbitrary user defined partition column in `ListingTable` (rather than assuming they are always Dictionary encoded) #5545

Support arbitrary user defined partition column in `ListingTable` (rather than assuming they are always Dictionary encoded) #5545

crepererum commented Mar 10, 2023

comphead Mar 10, 2023

crepererum Mar 13, 2023

comphead commented Mar 10, 2023

crepererum commented Mar 13, 2023

alamb left a comment

alamb commented Mar 13, 2023

alamb commented Mar 13, 2023

crepererum commented Mar 13, 2023

alamb commented Mar 13, 2023

crepererum commented Mar 13, 2023

alamb commented Mar 13, 2023

crepererum commented Mar 14, 2023 •

edited

Loading

alamb left a comment

alamb Mar 14, 2023

crepererum Mar 14, 2023

alamb commented Mar 15, 2023

alamb commented Mar 15, 2023

wjones127 commented Apr 28, 2023 •

edited

Loading

crepererum commented Apr 28, 2023

wjones127 commented Apr 28, 2023

tustvold commented Jun 12, 2023 •

edited

Loading

Support arbitrary user defined partition column in ListingTable (rather than assuming they are always Dictionary encoded) #5545

Support arbitrary user defined partition column in ListingTable (rather than assuming they are always Dictionary encoded) #5545

Conversation

crepererum commented Mar 10, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

comphead Mar 10, 2023

Choose a reason for hiding this comment

crepererum Mar 13, 2023

Choose a reason for hiding this comment

comphead commented Mar 10, 2023

crepererum commented Mar 13, 2023

Breaking Change

Before

After (exact)

After (alternative)

alamb left a comment

Choose a reason for hiding this comment

alamb commented Mar 13, 2023

alamb commented Mar 13, 2023

crepererum commented Mar 13, 2023

alamb commented Mar 13, 2023

crepererum commented Mar 13, 2023

alamb commented Mar 13, 2023

crepererum commented Mar 14, 2023 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 14, 2023

Choose a reason for hiding this comment

crepererum Mar 14, 2023

Choose a reason for hiding this comment

alamb commented Mar 15, 2023

alamb commented Mar 15, 2023

wjones127 commented Apr 28, 2023 • edited Loading

crepererum commented Apr 28, 2023

wjones127 commented Apr 28, 2023

tustvold commented Jun 12, 2023 • edited Loading

Support arbitrary user defined partition column in `ListingTable` (rather than assuming they are always Dictionary encoded) #5545

Support arbitrary user defined partition column in `ListingTable` (rather than assuming they are always Dictionary encoded) #5545

crepererum commented Mar 14, 2023 •

edited

Loading

wjones127 commented Apr 28, 2023 •

edited

Loading

tustvold commented Jun 12, 2023 •

edited

Loading