repartition-based fallback for hash aggregate #11116

binmahone · 2024-07-01T09:16:33Z

this PR closes #8391.

this PR add a config called spark.rapids.sql.agg.fallbackAlgorithm to let user decide a sort-based algorithm or repartition-based algorithm to use when agg cannot be done in a single pass in memory.

This optimization is orthogonal to #10950

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

binmahone · 2024-07-01T09:18:18Z

build

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

binmahone · 2024-07-01T09:44:23Z

build

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

binmahone · 2024-07-01T11:05:28Z

build

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

binmahone · 2024-07-01T12:34:03Z

build

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

binmahone · 2024-07-01T15:20:52Z

build

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

binmahone · 2024-07-02T09:27:21Z

build

binmahone · 2024-07-02T09:44:14Z

build

pxLi · 2024-07-02T09:55:00Z

build

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

binmahone · 2024-07-02T10:37:47Z

build

revans2 · 2024-07-02T15:33:07Z

Can we please get a performance comparison for this change?

revans2

I did a pass through this and I have a few concerns. Mainly that we don't have any performance numbers to share and it is not clear why/if we need to keep both the sort based fallback and the hash based fallback.

revans2 · 2024-07-02T15:36:49Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

+ batches: mutable.ArrayBuffer[SpillableColumnarBatch],
+ metrics: GpuHashAggregateMetrics,
+ concatAndMergeHelper: AggHelper): SpillableColumnarBatch = {
+ // TODO: concatenateAndMerge (and calling code) could output a sequence


Are there plans to deal with this TODO comment? I see that this is a copy and paste, so if there isn't it is fine. I just wanted to check

the TODO is there before my PR. In this PR I refactored tryMergeAggregatedBatches and its related functions into object AggregateUtils, so that tryMergeAggregatedBatches can be called with different parameters. (Previously it's a member function of GpuMergeAggregateIterator and coupled with GpuMergeAggregateIterator's local fields)

revans2 · 2024-07-02T15:42:01Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

 spillableBatch.getColumnarBatch()
 }
 }
 })
 } else {
 // fallback to sort agg, this is the third pass agg
- fallbackIter = Some(buildSortFallbackIterator())
+ aggFallbackAlgorithm.toLowerCase match {


nit: Could we use an enum or something like it here? a string comparison feels potentially problematic.

revans2 · 2024-07-02T15:43:29Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

+
+ def totalSize(): Long = batches.map(_.sizeInBytes).sum
+
+ def isAllBatchesSingleRow: Boolean = {


nit areAllBatchesSingleRow

revans2 · 2024-07-02T15:46:53Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

+
+ def split(): ListBuffer[AggregatePartition] = {
+ withResource(new NvtxWithMetrics("agg repartition", NvtxColor.CYAN, repartitionTime)) { _ =>
+ if (seed >= hashSeed + 100) {


Why would we ever need to repartition the data more than once?

The current code does a single aggregation pass through the data. Once you have done that pass you know the statistics about the data and should be able to make a very good guess about how to combine the data based on the number of shuffle partitions. Is this because there might be a large number of hash collisions? I think in practice that would never happen, but I would like to understand the reasoning here.

It is mostly for integrations tests. In hash_aggregate_test.py there're some cases with where one round of repartition cannot make meet the terminate criterial. By terminate criterial I mean either criterial is met:

the new partition is less than targetMergeBatchSize in size (https://github.com/binmahone/spark-rapids/blob/4cf4a4566008321f6bc9f600365563daa11614cf/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala#L1095) , however in integration tests, with a very small batchsize (250 bytes as in https://github.com/binmahone/spark-rapids/blob/0b50434faba9ca526cfbfea560fd2e50058e7bcd/integration_tests/src/main/python/hash_aggregate_test.py#L35), the new partition is usually larger than 250 bytes (considering the size overhead for each parititon), that leads to:

isAllBatchesSingleRow in https://github.com/binmahone/spark-rapids/blob/4cf4a4566008321f6bc9f600365563daa11614cf/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala#L1094C28-L1094C49, if all batches are single row, we treat it as a terminate state as well. However this state is harder to achieve as it requires more rounds of repartition.

It's also worth mentioning that this PR tends to be conservative in determining the number of new partitions (https://github.com/binmahone/spark-rapids/blob/4cf4a4566008321f6bc9f600365563daa11614cf/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala#L1070). It is designed so for performance consideration. Using a larger partition number for eah repartition may speed up meeting the terminate criterial. But still we cannot guarantee one round of repartition is sufficent.

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

binmahone · 2024-07-03T03:16:45Z

build

winningsix · 2024-07-01T23:34:47Z

docs/additional-functionality/advanced_configs.md

@@ -60,6 +60,7 @@ Name | Description | Default Value | Applicable at
 <a name="shuffle.ucx.activeMessages.forceRndv"></a>spark.rapids.shuffle.ucx.activeMessages.forceRndv|Set to true to force 'rndv' mode for all UCX Active Messages. This should only be required with UCX 1.10.x. UCX 1.11.x deployments should set to false.|false|Startup
 <a name="shuffle.ucx.managementServerHost"></a>spark.rapids.shuffle.ucx.managementServerHost|The host to be used to start the management server|null|Startup
 <a name="shuffle.ucx.useWakeup"></a>spark.rapids.shuffle.ucx.useWakeup|When set to true, use UCX's event-based progress (epoll) in order to wake up the progress thread when needed, instead of a hot loop.|true|Startup
+<a name="sql.agg.fallbackAlgorithm"></a>spark.rapids.sql.agg.fallbackAlgorithm|When agg cannot be done in a single pass, use sort-based fallback or repartition-based fallback.|sort|Runtime


Understood we make sort default to understand regression. Can we set this by default on to gain more exercises? And eventually, we should deprecate sort based fallback within aggregation.

per our offline discussion , this should be unnecessary now, right?

Particular for this one, I still felt to make it default ON and this is more like an alternative to sort-based approach.

winningsix · 2024-07-01T23:35:59Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Arm.scala

@@ -43,7 +43,8 @@ object Arm extends ArmScalaSpecificImpl {
 }

 /** Executes the provided code block and then closes the sequence of resources */
- def withResource[T <: AutoCloseable, V](r: Seq[T])(block: Seq[T] => V): V = {
+ def withResource[T <: AutoCloseable, V](r: Seq[T])


nit: any idea code formatter changed here?

winningsix · 2024-07-02T02:53:33Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

@@ -99,6 +98,7 @@ object AggregateUtils {
 isReductionOnly: Boolean): Long = {
 def typesToSize(types: Seq[DataType]): Long =
 types.map(GpuBatchUtils.estimateGpuMemory(_, nullable = false, rowCount = 1)).sum
+


nit: unnecessary change.

winningsix · 2024-07-04T07:11:03Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

+ // this only happens in test cases). Doing more re-partitioning will not help to reduce
+ // the partition size anymore. In this case we should merge all the batches into one
+ // regardless of the target size.
+ logWarning(s"Unable to merge aggregated batches within " +


Nit: Will this be more friendly to turn this into debug metric.

actually this line is not added by this PR, I just refactored the functions.

winningsix · 2024-07-04T09:22:27Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

- logInfo(s"Falling back to sort-based aggregation with ${aggregatedBatches.size()} batches")
+ private def buildRepartitionFallbackIterator(): Iterator[ColumnarBatch] = {
+ logInfo(s"Falling back to repartition-based aggregation with " +
+ s"${aggregatedBatches.size} batches")
 metrics.numTasksFallBacked += 1


As offline mentioned, remove this one as no more sort fallback.

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

revans2 · 2024-07-10T21:58:22Z

I think there may be something wrong with your metrics for the repartition case. If I run.

spark.conf.set("spark.sql.shuffle.partitions", 2)
spark.conf.set("spark.rapids.sql.agg.singlePassPartialSortEnabled", false)
spark.time(spark.range(0, 3000000000L, 1, 2).selectExpr("CAST(rand(0) * 3000000000 AS LONG) DIV 2 as id", "id % 2 as data").groupBy("id").agg(count(lit(1)), avg(col("data"))).orderBy("id").show())

with repartition then the metrics for aggregations are all very large compared to running it for sort. But the total run time is actually smaller.

binmahone · 2024-07-15T10:54:34Z

Hi @revans2 , do you mean the op time metrics? I did some investigation, and found that for sort-based fallback, op time can be very inaccurate in terms of failing to capture many spill times. E.g. if a spill is triggered by https://github.com/NVIDIA/spark-rapids/blob/branch-24.08/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala#L969, the time for the spill is not counted to op time. If we take a look at NSYS, we can see many NVTX ranges named "device memory sync spill". These ranges do not hava a parent NVTX and seem not captured by op time metrics.

On the other side, op time can be inaccurate for repartition-based fallback as well. (But may not miss as many ranges as sort-based do). Actually, the inaccurancy is rooted in the way we measure op time. Do you think we need to refine how op time is measured, so that we can make sure the sum of all operators' op time is equals to wall time?

binmahone · 2024-07-15T11:07:39Z

I think there may be something wrong with your metrics for the repartition case. If I run.
spark.conf.set("spark.sql.shuffle.partitions", 2)
spark.conf.set("spark.rapids.sql.agg.singlePassPartialSortEnabled", false)
spark.time(spark.range(0, 3000000000L, 1, 2).selectExpr("CAST(rand(0) * 3000000000 AS LONG) DIV 2 as id", "id % 2 as data").groupBy("id").agg(count(lit(1)), avg(col("data"))).orderBy("id").show())
with repartition then the metrics for aggregations are all very large compared to running it for sort. But the total run time is actually smaller.

I also found that, for this synthetic case, sort-based fallback beats repartition-based fallback on my PC (it's about 6.2 min vs. 6.6 min), with following configs:

bin/spark-shell \
       --master 'local[10]' \
       --driver-memory 20g \
       --conf spark.rapids.memory.pinnedPool.size=20G \
       --conf spark.sql.files.maxPartitionBytes=2g \
       --conf spark.driver.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true \
       --conf spark.plugins=com.nvidia.spark.SQLPlugin \
      --conf  spark.rapids.sql.metrics.level='DEBUG' \
      --conf spark.rapids.sql.agg.fallbackAlgorithm='repartition' \
       --conf spark.eventLog.enabled=true \
       --jars /home/hongbin/code/spark-rapids2/dist/target/rapids-4-spark_2.12-24.08.0-SNAPSHOT-cuda11.jar

I also compared the repartition-based fallbacks over the sort-based fallback on NDS, and found that despite the total duration has a little improvement, we CANNOT garantee repartition-based fallbacks always wins. I haven't found a simple rule/heusitic to decided when to use repartition-based and when to use the other, so it would be difficult for us to explain which is better to the users.

For now, I would suggest users to try repartition-based fallback if a lot of buffer spills are observed. However it's still not a rule of thumb because a lot spill also occurred in your synthetic case (where repartition-based fallback is slower).

Any thoughts?

revans2 · 2024-07-15T13:23:09Z

Any thoughts?

I think we need to do some profiling of cases when the partition based code is worse than the sort based code to understand what is happening. Ideally we get it down to something like a micro-benchmark so we can better isolate it when doing profiling. I have a few ideas about what it could be, but this is just speculation.

Sort of a single numeric field can be very fast. It might be fast enough to beat the partitioning code for the same path.
The partitioning code might have a bug in it where it ends up doing extra work, or there are some kernels that are not as optimized as the sort case.
Spilling/repartitioning/sorting has a high enough run to run variance that we see it lose some of the time but overall it is a win.

@binmahone If you get some profiling info I am happy to look into it with you.

binmahone · 2024-07-26T01:56:21Z

Per our offline discussion with @revans2 and @jlowe

Even though current repartition-based fallback has already showcased a significant win over sort-based in our customer query, we need to : 1. further compare repartition-based vs. sort-based on NDS, and check in what situation sort-based will surpass repartition-based (i.e. regression), and if the regression is acceptable. 2. try some more radical improvement for repartition-based, e.g. skip the first pass of aggregation entirely.

With above done, we may able to rip out the sort-based code entirely, and check in this PR.

Suggest to move this PR from 2408 to 2410 to allow above items being done. @GaryShen2008

sameerz · 2024-07-29T23:40:25Z

Please retarget to 24.10

binmahone · 2024-08-05T08:09:49Z

Please retarget to 24.10

got. Meanwhile I'm still refactoring this PR to see if there's more potentials

workable version without tests

f5d21a9

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

binmahone changed the title ~~workable version without tests~~ [FEA] Do a hash based re-partition instead of a sort based fallback for hash aggregate Jul 1, 2024

binmahone marked this pull request as draft July 1, 2024 09:16

binmahone changed the title ~~[FEA] Do a hash based re-partition instead of a sort based fallback for hash aggregate~~ repartition-based fallback for hash aggregate Jul 1, 2024

doc

10b7d20

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

fix scala 2.13

4451c54

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

wjxiz1992 mentioned this pull request Jul 1, 2024

240701 repartition agg nvliyuan/yuali-spark-rapids#36

Merged

fix compile

4da5797

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

fix it

e803c36

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

binmahone added 3 commits July 2, 2024 17:05

enable it

0b50434

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

metric name

a000c9b

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

minor

82cacbf

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

change seed

4cf4a45

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

binmahone requested a review from revans2 July 2, 2024 10:40

binmahone marked this pull request as ready for review July 2, 2024 10:40

revans2 reviewed Jul 2, 2024

View reviewed changes

fix comments

367a273

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

winningsix reviewed Jul 4, 2024

View reviewed changes

minor

74424b8

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

sameerz added the performance A performance related task/issue label Jul 29, 2024

binmahone changed the base branch from branch-24.08 to branch-24.10 August 6, 2024 05:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repartition-based fallback for hash aggregate #11116

repartition-based fallback for hash aggregate #11116

binmahone commented Jul 1, 2024 •

edited

Loading

binmahone commented Jul 1, 2024

binmahone commented Jul 1, 2024

binmahone commented Jul 1, 2024

binmahone commented Jul 1, 2024

binmahone commented Jul 1, 2024

binmahone commented Jul 2, 2024

binmahone commented Jul 2, 2024

pxLi commented Jul 2, 2024

binmahone commented Jul 2, 2024

revans2 commented Jul 2, 2024

revans2 left a comment

revans2 Jul 2, 2024

binmahone Jul 3, 2024

revans2 Jul 2, 2024

revans2 Jul 2, 2024

revans2 Jul 2, 2024

binmahone Jul 3, 2024 •

edited

Loading

binmahone commented Jul 3, 2024

winningsix Jul 1, 2024

binmahone Jul 4, 2024

winningsix Jul 4, 2024

winningsix Jul 1, 2024

winningsix Jul 2, 2024

winningsix Jul 4, 2024

binmahone Jul 4, 2024

winningsix Jul 4, 2024

revans2 commented Jul 10, 2024

binmahone commented Jul 15, 2024 •

edited

Loading

binmahone commented Jul 15, 2024 •

edited

Loading

revans2 commented Jul 15, 2024

binmahone commented Jul 26, 2024 •

edited

Loading

sameerz commented Jul 29, 2024

binmahone commented Aug 5, 2024


		def totalSize(): Long = batches.map(_.sizeInBytes).sum

		def isAllBatchesSingleRow: Boolean = {

repartition-based fallback for hash aggregate #11116

Are you sure you want to change the base?

repartition-based fallback for hash aggregate #11116

Conversation

binmahone commented Jul 1, 2024 • edited Loading

binmahone commented Jul 1, 2024

binmahone commented Jul 1, 2024

binmahone commented Jul 1, 2024

binmahone commented Jul 1, 2024

binmahone commented Jul 1, 2024

binmahone commented Jul 2, 2024

binmahone commented Jul 2, 2024

pxLi commented Jul 2, 2024

binmahone commented Jul 2, 2024

revans2 commented Jul 2, 2024

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binmahone Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

binmahone commented Jul 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 commented Jul 10, 2024

binmahone commented Jul 15, 2024 • edited Loading

binmahone commented Jul 15, 2024 • edited Loading

revans2 commented Jul 15, 2024

binmahone commented Jul 26, 2024 • edited Loading

sameerz commented Jul 29, 2024

binmahone commented Aug 5, 2024

binmahone commented Jul 1, 2024 •

edited

Loading

binmahone Jul 3, 2024 •

edited

Loading

binmahone commented Jul 15, 2024 •

edited

Loading

binmahone commented Jul 15, 2024 •

edited

Loading

binmahone commented Jul 26, 2024 •

edited

Loading