Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Add support for sorting within partitions when Z-ordering #4006

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

maltevelin
Copy link

@maltevelin maltevelin commented Dec 29, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Resolves #4000 by introducing a new configuration property spark.databricks.io.skipping.mdc.sortWithinPartitions that clusters records in row groups, within Parquet files, based on Z-order or Hilbert curve values. This improves data skipping on the Parquet level. Benchmarks included in the issue demonstrate speedups of approximately 8× and 11× on two different datasets. Please refer to the issue for more details.

How was this patch tested?

Added test cases in MultiDimClusteringSuite.scala for Hilbert and Z-order curves.

Does this PR introduce any user-facing changes?

Yes. This PR introduces a new configuration property spark.databricks.io.skipping.mdc.sortWithinPartitions. The property defaults to false, ensuring that existing users remain unaffected unless they opt-in by setting it to true.

Previous Behavior
Z-ordering did not sort data within partitions.

New Behavior
When the property is enabled, sortWithinPartitions is applied after repartitionByRange in MultiDimClustering.scala.

Signed-off-by: Malte Velin <maltevelin@proprotonmail.ch>
…value.

Signed-off-by: Malte Velin <maltevelin@proprotonmail.ch>
…ording to curve.

Signed-off-by: Malte Velin <maltevelin@proprotonmail.ch>
@maltevelin
Copy link
Author

Hi @vkorukanti , @tdas , @scottsand-db , could you kindly review this PR when you have a moment? It introduces a configurable enhancement for Z-ordering that significantly improves read performance. I'm tagging you since you've reviewed and/or implemented PRs related to issue #1134. Your feedback would be highly appreciated!

Copy link
Contributor

@chirag-s-db chirag-s-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this - @zedtang do you want to also take a look?

Comment on lines +96 to +97
}
else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
else {
} else {

nit

.internal()
.doc("If enabled, partitions are sorted on Z-order values for MDC. " +
"This co-locates records with the same Z-order values in row groups, " +
"which enables data skipping on the Parquet level.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"which enables data skipping on the Parquet level.")
"which facilitates data skipping on the Parquet level.")

Nit: We don't want to imply that row-group skipping is disabled without this conf enabled.

@@ -74,12 +74,13 @@ trait SpaceFillingCurveClustering extends MultiDimClustering {
val conf = df.sparkSession.sessionState.conf
val numRanges = conf.getConf(DeltaSQLConf.MDC_NUM_RANGE_IDS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: we might want to increase the number of ranges when sorting within partitions when the row group size << the file size to increase the cardinality of the evaluated clustering expression

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] [Spark] Optionally sort within partitions when Z-ordering
2 participants