Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Table Property delta.dataSkippingStatsColumns #1763

Closed

Conversation

kamcheungting-db
Copy link
Contributor

@kamcheungting-db kamcheungting-db commented May 15, 2023

Description

This PR introduces new table property dataSkippingStatsColumns for users to specify the set of columns that collect file skipping statistics.

The syntax of setting this table property is:

CREATE TABLE <Table Name> (<Column Definition>)
TBLPROPERTIES('delta.dataSkippingStatsColumns' = '[<column-identifier-1>, …, <column-identifier-N>]');

ALTER TABLE <Table Name>
SET TBLPROPERTIES ('delta.dataSkippingStatsColumns' = '[<column-identifier-1>, …, <column-identifier-N>]');

The CREATE TABLE and Alter command handler would validates that all delta.dataSkippingStatsColumns exists.
These two commands would error out, when:

  • If any column of delta.dataSkippingStatsColumns doesn't exist,
  • If any column of delta.dataSkippingStatsColumns doesn't support data skipping,
  • If any column of delta.dataSkippingStatsColumns is partitioned column.

If user drops a column, the corresponding entry inside delta.dataSkippingStatsColumns would also be removed.

If user renames a column, the corresponding entry inside delta.dataSkippingStatsColumns would also be renamed.

The OPTIMIZE ZORDER command also recognizes the delta.dataSkippingStatsColumns.

The delta.dataSkippingStatsColumns contains a list of column identifiers. Each column identifier is represented as:

column-identifier

An identifier is a string used to identify a database object such as a table, view, schema, column. Both regular identifiers and delimited identifiers are case-insensitive. The Regular Column Identifier contains following characters: { letter | digit | '_' } [ , ... ]. If there is special characters, which include !@#$%^&*()_+-={}|[]\:";'<>,.?/, inside the column identifier ` should be used to escape the column name.

Examples

CREATE TABLE

CREATE TABLE T1 (c0 long, `c-1` long, `c!@#$`) using delta
TBLPROPERTIES('delta.dataSkippingStatsColumns' = 'c0, `c-1`, `c!@#$`');

ALTER TABLE

CREATE TABLE T2 (c0 long, `c-1` long, `c!@#$`) using delta;

ALTER TABLE T2 SET TBLPROPERTIES ('delta.dataSkippingStatsColumns' = 'c0, `c-1`, `c!@#$`');

NESTED COLUMN

CREATE TABLE T3 (c0 long, c1 STRUCT <c11: String, c12: long, c13 STRUCT <c131: long, c132: long>>) using delta
TBLPROPERTIES('delta.dataSkippingStatsColumns' = 'c0, c1.c11, c1.c12, c1.c13.c131, c1.c13.c132');

CREATE TABLE T4 (c0 long, c1 STRUCT <c11: String, c12: long, c13 STRUCT <c131: long, c132: long>>) using delta
TBLPROPERTIES('delta.dataSkippingStatsColumns' = 'c0, c1.c11, c1.c12, c1.c13');

CREATE TABLE T5 (c0 long, c1 STRUCT <c11: String, c12: long, c13 STRUCT <c131: long, c132: long>>) using delta
TBLPROPERTIES('delta.dataSkippingStatsColumns' = 'c0, c1');

How was this patch tested?

Unit Test:

  1. validates setting table property procedure:
    1.1. detects non-existing columns,
    1.2. support nested columns,
    1.3. detects unsupported column types,
    1.4. handles columns with special characters,
    1.5. handles columns with invalid datatype in both nested and un-nested column,
    1.6. handles partition columns in both nested and un-nested column.
    1.7. handles duplicated columns in both nested and un-nested columns.
  2. handle drop columns
    2.1. drop nested columns from a table
    2.2. drop flat columns from a table
  3. handle rename columns
    3.1. rename nested columns from a table
    3.2. rename flat columns from a table
  4. optimize z-order can recognize delta statistics columns.
  5. All column mapping modes are supported.

Does this PR introduce any user-facing changes?

Copy link
Contributor

@tdas tdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall this looks great! this really adds a new flexibility for delta users to customize their stats collection.

null,
v => Option(v),
vOpt => vOpt.forall(v => StatisticsCollection.parseDeltaStatsColumnNames(v).isDefined),
"needs to be a (possibly empty) comma-separated list of column names.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is format format of specifying nested columns? And what if the column name has a comma in it? Could you provide an example with all these corner cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested column is represented by . notation.
Following are the examples of the nested columns.

CREATE TABLE T1 (c0 long, c1 STRUCT <c11: String, c12: long, c13 STRUCT <c131: long, c132: long>>) using delta
TBLPROPERTIES('delta.dataSkippingStatsColumns' = 'c0, c1.c11, c1.c12, c1.c13.c131, c1.c13.c132');

CREATE TABLE T2 (c0 long, c1 STRUCT <c11: String, c12: long, c13 STRUCT <c131: long, c132: long>>) using delta
TBLPROPERTIES('delta.dataSkippingStatsColumns' = 'c0, c1.c11, c1.c12, c1.c13');

CREATE TABLE T3 (c0 long, c1 STRUCT <c11: String, c12: long, c13 STRUCT <c131: long, c132: long>>) using delta
TBLPROPERTIES('delta.dataSkippingStatsColumns' = 'c0, c1');

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples of delta.dataSkippingStatsColumns are added to description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few test cases are also added to the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean I should update the helpMessage of this new table property?
Updated now.

* @param statsColPaths the specific set of columns to collect stats on.
* @param mappingMode the column mapping mode of this statistics collection.
* @param parentPath the parent column path of `schema`.
* @return filtered schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

mappingMode: DeltaColumnMappingMode,
parentPath: Seq[String] = Seq.empty): Option[StructType] = {
// Find the unique column names at this nesting depth, each with its path remainders (if any)
val cols = statsColPaths.groupBy(_.head).mapValues(_.map(_.tail))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is a fairly complicated block of code to understand easily. My suggestion would be to add more inline comments. For example, I am staring at this cryptic line above and i dont know what is the type of cols.. is it a map? a Seq?

because depending on that ... the code below can be a quadratic-time operation.. isnt it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// If mapping mode is NoMapping or the dataSchemaName already contains the mapped
// column name, the schema mapping can be skipped.
if (mappingMode == NoMapping || schemaNames.contains(fullPath)) return field
val physicalName = field.metadata.getString("delta.columnMapping.physicalName")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this constant should be defined somewhere. its likely this is being used multiple places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* @param parentPath the parent column path of `schema`.
* @return filtered schema
*/
def filterSchema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method ... a pretty complex one.. is used only internally in this class. so i suggest make this private. we dont want accidental misuse of this method in other places where its not supposed to be use.

i suggest doing the same (mark as private) for all methods here that are currently not accessed from outside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Tagar
Copy link
Contributor

Tagar commented May 22, 2023

Would Delta collect stats on the first 32 columns and columns specified in deltaStatsColumns?

@kamcheungting-db
Copy link
Contributor Author

Would Delta collect stats on the first 32 columns and columns specified in deltaStatsColumns?

These two values are mutual exclusive.
User can only specify either first N columns or named column list.

* The names of specific columns to collect stats on for data skipping. If present, it takes
* precedences over dataSkippingNumIndexedCols config, and the system will only collect stats for
* columns that exactly match those specified. If a nested column is specified, the system will
* collect stats for all leaf fields of that column. If a non-existing column is specified, it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* collect stats for all leaf fields of that column. If a non-existing column is specified, it
* collect stats for all leaf fields of that column. If a non-existent column is specified, it

Copy link
Contributor

@tdas tdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! this is a very highly requested feature in the community, so thank you for doing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants