[Spark] Add support for sorting within partitions when Z-ordering #4006

maltevelin · 2024-12-29T22:38:17Z

Which Delta project/connector is this regarding?

Description

Resolves #4000 by introducing a new configuration property spark.databricks.io.skipping.mdc.sortWithinPartitions that clusters records in row groups, within Parquet files, based on Z-order or Hilbert curve values. This improves data skipping on the Parquet level. Benchmarks included in the issue demonstrate speedups of approximately 8× and 11× on two different datasets. Please refer to the issue for more details.

How was this patch tested?

Added test cases in MultiDimClusteringSuite.scala for Hilbert and Z-order curves.

Does this PR introduce any user-facing changes?

Yes. This PR introduces a new configuration property spark.databricks.io.skipping.mdc.sortWithinPartitions. The property defaults to false, ensuring that existing users remain unaffected unless they opt-in by setting it to true.

Previous Behavior
Z-ordering did not sort data within partitions.

New Behavior
When the property is enabled, sortWithinPartitions is applied after repartitionByRange in MultiDimClustering.scala.

Signed-off-by: Malte Velin <maltevelin@proprotonmail.ch>

…value. Signed-off-by: Malte Velin <maltevelin@proprotonmail.ch>

…ording to curve. Signed-off-by: Malte Velin <maltevelin@proprotonmail.ch>

maltevelin · 2025-01-09T12:01:53Z

Hi @vkorukanti , @tdas , @scottsand-db , could you kindly review this PR when you have a moment? It introduces a configurable enhancement for Z-ordering that significantly improves read performance. I'm tagging you since you've reviewed and/or implemented PRs related to issue #1134. Your feedback would be highly appreciated!

chirag-s-db

Thanks for doing this - @zedtang do you want to also take a look?

chirag-s-db · 2025-01-17T18:13:22Z

spark/src/main/scala/org/apache/spark/sql/delta/skipping/MultiDimClustering.scala

+    }
+    else {


Suggested change

}

else {

} else {

nit

chirag-s-db · 2025-01-17T18:15:19Z

spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala

+      .internal()
+      .doc("If enabled, partitions are sorted on Z-order values for MDC. " +
+         "This co-locates records with the same Z-order values in row groups, " +
+         "which enables data skipping on the Parquet level.")


Suggested change

"which enables data skipping on the Parquet level.")

"which facilitates data skipping on the Parquet level.")

Nit: We don't want to imply that row-group skipping is disabled without this conf enabled.

chirag-s-db · 2025-01-17T18:17:02Z

spark/src/main/scala/org/apache/spark/sql/delta/skipping/MultiDimClustering.scala

@@ -74,12 +74,13 @@ trait SpaceFillingCurveClustering extends MultiDimClustering {
    val conf = df.sparkSession.sessionState.conf
    val numRanges = conf.getConf(DeltaSQLConf.MDC_NUM_RANGE_IDS)


Aside: we might want to increase the number of ranges when sorting within partitions when the row group size << the file size to increase the cardinality of the evaluated clustering expression

maltevelin added 3 commits December 28, 2024 20:10

Add configuration property to toggle sorting output on Z-order value.

59b0449

Signed-off-by: Malte Velin <maltevelin@proprotonmail.ch>

If configuration property is set to true then sort output on Z-order …

3644342

…value. Signed-off-by: Malte Velin <maltevelin@proprotonmail.ch>

Add unit tests ensuring that records in each partition are sorted acc…

23881d3

…ording to curve. Signed-off-by: Malte Velin <maltevelin@proprotonmail.ch>

maltevelin mentioned this pull request Dec 29, 2024

[Feature Request] [Spark] Optionally sort within partitions when Z-ordering #4000

Open

8 tasks

Merge branch 'master' into optionally-sort-within-partitions

2378ae5

zachschuermann requested review from vkorukanti, tdas, scottsand-db and rahulsmahadev January 15, 2025 17:09

scottsand-db removed their request for review January 15, 2025 17:29

chirag-s-db reviewed Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Add support for sorting within partitions when Z-ordering #4006

[Spark] Add support for sorting within partitions when Z-ordering #4006

maltevelin commented Dec 29, 2024 •

edited

Loading

maltevelin commented Jan 9, 2025

chirag-s-db left a comment

chirag-s-db Jan 17, 2025

chirag-s-db Jan 17, 2025

chirag-s-db Jan 17, 2025

	"which enables data skipping on the Parquet level.")
	"which facilitates data skipping on the Parquet level.")

		@@ -74,12 +74,13 @@ trait SpaceFillingCurveClustering extends MultiDimClustering {
		val conf = df.sparkSession.sessionState.conf
		val numRanges = conf.getConf(DeltaSQLConf.MDC_NUM_RANGE_IDS)

[Spark] Add support for sorting within partitions when Z-ordering #4006

Are you sure you want to change the base?

[Spark] Add support for sorting within partitions when Z-ordering #4006

Conversation

maltevelin commented Dec 29, 2024 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

maltevelin commented Jan 9, 2025

chirag-s-db left a comment

Choose a reason for hiding this comment

chirag-s-db Jan 17, 2025

Choose a reason for hiding this comment

chirag-s-db Jan 17, 2025

Choose a reason for hiding this comment

chirag-s-db Jan 17, 2025

Choose a reason for hiding this comment

maltevelin commented Dec 29, 2024 •

edited

Loading