-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark] Add support for sorting within partitions when Z-ordering #4006
base: master
Are you sure you want to change the base?
[Spark] Add support for sorting within partitions when Z-ordering #4006
Conversation
Signed-off-by: Malte Velin <maltevelin@proprotonmail.ch>
…value. Signed-off-by: Malte Velin <maltevelin@proprotonmail.ch>
…ording to curve. Signed-off-by: Malte Velin <maltevelin@proprotonmail.ch>
Hi @vkorukanti , @tdas , @scottsand-db , could you kindly review this PR when you have a moment? It introduces a configurable enhancement for Z-ordering that significantly improves read performance. I'm tagging you since you've reviewed and/or implemented PRs related to issue #1134. Your feedback would be highly appreciated! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this - @zedtang do you want to also take a look?
} | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
else { | |
} else { |
nit
.internal() | ||
.doc("If enabled, partitions are sorted on Z-order values for MDC. " + | ||
"This co-locates records with the same Z-order values in row groups, " + | ||
"which enables data skipping on the Parquet level.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"which enables data skipping on the Parquet level.") | |
"which facilitates data skipping on the Parquet level.") |
Nit: We don't want to imply that row-group skipping is disabled without this conf enabled.
@@ -74,12 +74,13 @@ trait SpaceFillingCurveClustering extends MultiDimClustering { | |||
val conf = df.sparkSession.sessionState.conf | |||
val numRanges = conf.getConf(DeltaSQLConf.MDC_NUM_RANGE_IDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: we might want to increase the number of ranges when sorting within partitions when the row group size << the file size to increase the cardinality of the evaluated clustering expression
Which Delta project/connector is this regarding?
Description
Resolves #4000 by introducing a new configuration property
spark.databricks.io.skipping.mdc.sortWithinPartitions
that clusters records in row groups, within Parquet files, based on Z-order or Hilbert curve values. This improves data skipping on the Parquet level. Benchmarks included in the issue demonstrate speedups of approximately 8× and 11× on two different datasets. Please refer to the issue for more details.How was this patch tested?
Added test cases in
MultiDimClusteringSuite.scala
for Hilbert and Z-order curves.Does this PR introduce any user-facing changes?
Yes. This PR introduces a new configuration property
spark.databricks.io.skipping.mdc.sortWithinPartitions
. The property defaults tofalse
, ensuring that existing users remain unaffected unless they opt-in by setting it totrue
.Previous Behavior
Z-ordering did not sort data within partitions.
New Behavior
When the property is enabled,
sortWithinPartitions
is applied afterrepartitionByRange
inMultiDimClustering.scala
.