[SPARK-11055][SQL] Use mixing hash-based and sort-based aggregation in TungstenAggregationIterator #9067

viirya · 2015-10-11T18:09:20Z

JIRA: https://issues.apache.org/jira/browse/SPARK-11055

In TungstenAggregationIterator we switch to sort-based aggregation when we can't allocate more memory for hashmap.

However, using external sorter-based aggregation will write too much key-value pairs into disk. We should use mixing hash-based and sort-based aggregation to reduce the key-value pairs needed to write to disk.

…onIterator.

SparkQA · 2015-10-11T23:35:55Z

Test build #43546 has finished for PR 9067 at commit dd98d57.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-12T06:07:55Z

Test build #43557 has finished for PR 9067 at commit 787846b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-10-12T06:11:37Z

retest this please.

SparkQA · 2015-10-12T08:13:17Z

Test build #43559 has finished for PR 9067 at commit 787846b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-10-14T20:14:14Z

Can you explain what you mean by "mixing"?

JoshRosen · 2015-10-14T22:15:14Z

@rxin, my understanding of this patch is that it lets us continue to perform hash-based pre-aggregation on the remainder of the iterator after we've decided to spill and switch to sort-based aggregation. After we've destructed and freed the original hashMap, we'll now loop back around and continue to use a new hash map to aggregate the remainder of the iterator, spilling that map if it also becomes too large.

After this patch, records will have more opportunities to be pre-aggregated before being spilled to disk or sent across the network. I wonder about the case where you're doing map-side pre-aggregation and are experiencing a very low reduction factor: in that case, this patch means that we'll do more work per record, since we'll be hashing and sorting every record, whereas before there was a chance that we'd skip hashing of some records. OTOH, if you have an input consisting of all unique keys then it's best to skip both the hashing AND the sorting and just push the records straight to the reduce side, but that optimization is kind of orthogonal to this one.

JoshRosen · 2015-10-14T22:16:33Z

...re/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala

I think you might actually want to re-assign to the original hashMap by making it into a var. This will let us ensure that UnsafeAggregationIterator.free() is able to free all memory after failures.

rxin · 2015-10-14T22:19:08Z

Ideally we should be able to turn partial aggregation off when we don't see reduction. We had that in Shark, and a lot of query engines do this.

viirya · 2015-10-15T06:28:44Z

@JoshRosen Thanks for explaining this patch. It does exactly what you said.

@rxin Do you think it is ok to add a configuration for turning on/off this feature?

rxin · 2015-10-15T06:42:12Z

It's best to have a feature flag for this, in case it yields worse performance. Eventually we should find a way to make this the default.

Can you do some performance measure to quantify the gain from this change?

…gsten-iterator

SparkQA · 2015-10-26T18:00:29Z

Test build #44359 has finished for PR 9067 at commit aa51d8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-10-28T10:25:47Z

@rxin I ran a simple performance measure as following.

Record count: 1333635318
Record after group by: 259200

SQL query looks like: SELECT SUM(a) as a , SUM(b) as b , SUM(c) as c , SUM(d) as d from table GROUP BY e

4 workers (8 cores), executor memory: 512 MB.

With pre-aggregation enabled:

67720191 microseconds
66424539 microseconds
62959275 microseconds

With pre-aggregation disabled:

69934956 microseconds                                                           
70351959 microseconds                                                           
68437353 microseconds

So looks like it roughly gains about 5% improvement in average.

davies · 2015-10-30T20:03:11Z

@viirya Did not realized that you had don similar things, i created davies@5707f5b, could you review that?

viirya · 2015-10-30T23:31:46Z

@davies ok.

Use mixing hash-based and sort-based aggregation in TungstenAggregati…

dd98d57

…onIterator.

viirya changed the title ~~Use mixing hash-based and sort-based aggregation in TungstenAggregationIterator~~ [SPARK-11055][SQL ] Use mixing hash-based and sort-based aggregation in TungstenAggregationIterator Oct 11, 2015

viirya changed the title ~~[SPARK-11055][SQL ] Use mixing hash-based and sort-based aggregation in TungstenAggregationIterator~~ [SPARK-11055][SQL] Use mixing hash-based and sort-based aggregation in TungstenAggregationIterator Oct 11, 2015

Fix it for failed test.

787846b

JoshRosen reviewed Oct 14, 2015
View reviewed changes

viirya added 2 commits October 26, 2015 23:30

Merge remote-tracking branch 'upstream/master' into sort-and-hash-tun…

8817d04

…gsten-iterator

Add a configration for enabling and disabling pre-aggregation.

aa51d8a

davies mentioned this pull request Oct 31, 2015

[SPARK-11425] [SPARK-11486] Improve hybrid aggregation #9383

Closed

viirya closed this Oct 31, 2015

viirya deleted the sort-and-hash-tungsten-iterator branch December 27, 2023 18:18

[SPARK-11055][SQL] Use mixing hash-based and sort-based aggregation in TungstenAggregationIterator #9067

[SPARK-11055][SQL] Use mixing hash-based and sort-based aggregation in TungstenAggregationIterator #9067

Uh oh!

Conversation

viirya commented Oct 11, 2015

Uh oh!

SparkQA commented Oct 11, 2015

Uh oh!

SparkQA commented Oct 12, 2015

Uh oh!

viirya commented Oct 12, 2015

Uh oh!

SparkQA commented Oct 12, 2015

Uh oh!

rxin commented Oct 14, 2015

Uh oh!

JoshRosen commented Oct 14, 2015

Uh oh!

JoshRosen Oct 14, 2015

Choose a reason for hiding this comment

Uh oh!

rxin commented Oct 14, 2015

Uh oh!

viirya commented Oct 15, 2015

Uh oh!

rxin commented Oct 15, 2015

Uh oh!

SparkQA commented Oct 26, 2015

Uh oh!

viirya commented Oct 28, 2015

Uh oh!

davies commented Oct 30, 2015

Uh oh!

viirya commented Oct 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants