-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Table Property delta.dataSkippingStatsColumns
#1763
Introduce Table Property delta.dataSkippingStatsColumns
#1763
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall this looks great! this really adds a new flexibility for delta users to customize their stats collection.
null, | ||
v => Option(v), | ||
vOpt => vOpt.forall(v => StatisticsCollection.parseDeltaStatsColumnNames(v).isDefined), | ||
"needs to be a (possibly empty) comma-separated list of column names.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is format format of specifying nested columns? And what if the column name has a comma in it? Could you provide an example with all these corner cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nested column is represented by .
notation.
Following are the examples of the nested columns.
CREATE TABLE T1 (c0 long, c1 STRUCT <c11: String, c12: long, c13 STRUCT <c131: long, c132: long>>) using delta
TBLPROPERTIES('delta.dataSkippingStatsColumns' = 'c0, c1.c11, c1.c12, c1.c13.c131, c1.c13.c132');
CREATE TABLE T2 (c0 long, c1 STRUCT <c11: String, c12: long, c13 STRUCT <c131: long, c132: long>>) using delta
TBLPROPERTIES('delta.dataSkippingStatsColumns' = 'c0, c1.c11, c1.c12, c1.c13');
CREATE TABLE T3 (c0 long, c1 STRUCT <c11: String, c12: long, c13 STRUCT <c131: long, c132: long>>) using delta
TBLPROPERTIES('delta.dataSkippingStatsColumns' = 'c0, c1');
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Examples of delta.dataSkippingStatsColumns
are added to description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few test cases are also added to the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean I should update the helpMessage of this new table property?
Updated now.
* @param statsColPaths the specific set of columns to collect stats on. | ||
* @param mappingMode the column mapping mode of this statistics collection. | ||
* @param parentPath the parent column path of `schema`. | ||
* @return filtered schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
mappingMode: DeltaColumnMappingMode, | ||
parentPath: Seq[String] = Seq.empty): Option[StructType] = { | ||
// Find the unique column names at this nesting depth, each with its path remainders (if any) | ||
val cols = statsColPaths.groupBy(_.head).mapValues(_.map(_.tail)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is a fairly complicated block of code to understand easily. My suggestion would be to add more inline comments. For example, I am staring at this cryptic line above and i dont know what is the type of cols
.. is it a map? a Seq?
because depending on that ... the code below can be a quadratic-time operation.. isnt it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
core/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala
Show resolved
Hide resolved
// If mapping mode is NoMapping or the dataSchemaName already contains the mapped | ||
// column name, the schema mapping can be skipped. | ||
if (mappingMode == NoMapping || schemaNames.contains(fullPath)) return field | ||
val physicalName = field.metadata.getString("delta.columnMapping.physicalName") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this constant should be defined somewhere. its likely this is being used multiple places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* @param parentPath the parent column path of `schema`. | ||
* @return filtered schema | ||
*/ | ||
def filterSchema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method ... a pretty complex one.. is used only internally in this class. so i suggest make this private. we dont want accidental misuse of this method in other places where its not supposed to be use.
i suggest doing the same (mark as private) for all methods here that are currently not accessed from outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
core/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala
Outdated
Show resolved
Hide resolved
Would Delta collect stats on the first 32 columns and columns specified in deltaStatsColumns? |
8722f50
to
468fede
Compare
These two values are mutual exclusive. |
c39a044
to
4250f66
Compare
* The names of specific columns to collect stats on for data skipping. If present, it takes | ||
* precedences over dataSkippingNumIndexedCols config, and the system will only collect stats for | ||
* columns that exactly match those specified. If a nested column is specified, the system will | ||
* collect stats for all leaf fields of that column. If a non-existing column is specified, it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* collect stats for all leaf fields of that column. If a non-existing column is specified, it | |
* collect stats for all leaf fields of that column. If a non-existent column is specified, it |
5b8ba50
to
0d04f47
Compare
e52b0eb
to
61b6c76
Compare
1289575
to
9c328e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great! this is a very highly requested feature in the community, so thank you for doing this.
Description
This PR introduces new table property
dataSkippingStatsColumns
for users to specify the set of columns that collect file skipping statistics.The syntax of setting this table property is:
The CREATE TABLE and Alter command handler would validates that all
delta.dataSkippingStatsColumns
exists.These two commands would error out, when:
delta.dataSkippingStatsColumns
doesn't exist,delta.dataSkippingStatsColumns
doesn't support data skipping,delta.dataSkippingStatsColumns
is partitioned column.If user drops a column, the corresponding entry inside
delta.dataSkippingStatsColumns
would also be removed.If user renames a column, the corresponding entry inside
delta.dataSkippingStatsColumns
would also be renamed.The OPTIMIZE ZORDER command also recognizes the
delta.dataSkippingStatsColumns
.The
delta.dataSkippingStatsColumns
contains a list of column identifiers. Each column identifier is represented as:column-identifier
An identifier is a string used to identify a database object such as a table, view, schema, column. Both regular identifiers and delimited identifiers are case-insensitive. The Regular Column Identifier contains following characters:
{ letter | digit | '_' } [ , ... ]
. If there is special characters, which include!@#$%^&*()_+-={}|[]\:";'<>,.?/
, inside the column identifier ` should be used to escape the column name.Examples
CREATE TABLE
ALTER TABLE
NESTED COLUMN
How was this patch tested?
Unit Test:
1.1. detects non-existing columns,
1.2. support nested columns,
1.3. detects unsupported column types,
1.4. handles columns with special characters,
1.5. handles columns with invalid datatype in both nested and un-nested column,
1.6. handles partition columns in both nested and un-nested column.
1.7. handles duplicated columns in both nested and un-nested columns.
2.1. drop nested columns from a table
2.2. drop flat columns from a table
3.1. rename nested columns from a table
3.2. rename flat columns from a table
Does this PR introduce any user-facing changes?