[SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row #7813

yhuai · 2015-07-31T02:00:48Z

This PR adds a base aggregation iterator AggregationIterator, which is used to create SortBasedAggregationIterator (for sort-based aggregation) and UnsafeHybridAggregationIterator (first it tries hash-based aggregation and falls back to the sort-based aggregation (using external sorter) if we cannot allocate memory for the map). With these two iterators, we will not need existing iterators and I am removing those. Also, we can use a single physical Aggregate operator and it internally determines what iterators to used.

https://issues.apache.org/jira/browse/SPARK-9240

SparkQA · 2015-07-31T02:24:01Z

Test build #39133 has finished for PR 7813 at commit 84ceb3a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-31T07:53:39Z

Test build #39171 has finished for PR 7813 at commit 1912097.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-07-31T17:06:49Z

Seems even if we use a single element of an array or a struct (e.g. array[0]) in the grouping expression, we do not get the item until we start the aggregate operator, which caused the problem of java.lang.UnsupportedOperationException: Not supported DataType: ArrayType(StringType,true).

yhuai · 2015-07-31T17:08:35Z

The failed query is SELECT value, myCol from (SELECT key, array(value[0]) AS value FROM tmp_pyang_src_rcfile GROUP BY value[0], key) a LATERAL VIEW explode(value) myTable AS myCol;

SparkQA · 2015-07-31T18:44:11Z

Test build #39240 has finished for PR 7813 at commit 6cedb51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…itial version of the hybrid iterator.

…or of the iterato will read at least one row from a non-empty input iter.

SparkQA · 2015-08-03T02:16:22Z

Test build #39480 has finished for PR 7813 at commit 964f88b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-03T02:18:16Z

Test build #39481 has finished for PR 7813 at commit fee0eef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-08-03T02:32:21Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala

change this back to 32

rxin · 2015-08-03T02:37:46Z

sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala

why do we want a config flag here?

SparkQA · 2015-08-03T04:21:30Z

Test build #39497 has finished for PR 7813 at commit 21fd15f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-03T04:35:47Z

Test build #39498 has finished for PR 7813 at commit 0f1b06f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

SparkQA · 2015-08-03T05:07:16Z

Test build #39502 has finished for PR 7813 at commit c9cf3b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-08-03T05:58:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

this is an error case -- we should throw IllegalStateException, and make it clear that if we hit this path, it's a bug.

Right now it sounds as if this operator just cannot handle a legitimate case.

SparkQA · 2015-08-03T06:38:44Z

Test build #39509 has finished for PR 7813 at commit 74d93c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-03T07:07:43Z

Test build #39517 has finished for PR 7813 at commit e317e2b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-08-03T07:22:56Z

I'm going to merge this. I think this needs more refactoring, but we can do those as follow-ups.

rxin · 2015-08-03T08:03:26Z

...rc/main/scala/org/apache/spark/sql/execution/aggregate/UnsafeHybridAggregationIterator.scala

we should just remove this function and inline it. We don't want an extra iterator overhead to process the rows.

Each iterator actually adds a lot of overhead, and here it doesn't buy you any code reduction (on the contrary it increases complexity due to the extra abstraction).

…w up) This is the followup of #7813. It renames `HybridUnsafeAggregationIterator` to `TungstenAggregationIterator` and makes it only work with `UnsafeRow`. Also, I add a `TungstenAggregate` that uses `TungstenAggregationIterator` and make `SortBasedAggregate` (renamed from `SortBasedAggregate`) only works with `SafeRow`. Author: Yin Huai <yhuai@databricks.com> Closes #7954 from yhuai/agg-followUp and squashes the following commits: 4d2f4fc [Yin Huai] Add comments and free map. 0d7ddb9 [Yin Huai] Add TungstenAggregationQueryWithControlledFallbackSuite to test fall back process. 91d69c2 [Yin Huai] Rename UnsafeHybridAggregationIterator to TungstenAggregateIteraotr and make it only work with UnsafeRow.

…w up) This is the followup of #7813. It renames `HybridUnsafeAggregationIterator` to `TungstenAggregationIterator` and makes it only work with `UnsafeRow`. Also, I add a `TungstenAggregate` that uses `TungstenAggregationIterator` and make `SortBasedAggregate` (renamed from `SortBasedAggregate`) only works with `SafeRow`. Author: Yin Huai <yhuai@databricks.com> Closes #7954 from yhuai/agg-followUp and squashes the following commits: 4d2f4fc [Yin Huai] Add comments and free map. 0d7ddb9 [Yin Huai] Add TungstenAggregationQueryWithControlledFallbackSuite to test fall back process. 91d69c2 [Yin Huai] Rename UnsafeHybridAggregationIterator to TungstenAggregateIteraotr and make it only work with UnsafeRow. (cherry picked from commit 3504bf3) Signed-off-by: Reynold Xin <rxin@databricks.com>

…w up) This is the followup of apache/spark#7813. It renames `HybridUnsafeAggregationIterator` to `TungstenAggregationIterator` and makes it only work with `UnsafeRow`. Also, I add a `TungstenAggregate` that uses `TungstenAggregationIterator` and make `SortBasedAggregate` (renamed from `SortBasedAggregate`) only works with `SafeRow`. Author: Yin Huai <yhuai@databricks.com> Closes #7954 from yhuai/agg-followUp and squashes the following commits: 4d2f4fc [Yin Huai] Add comments and free map. 0d7ddb9 [Yin Huai] Add TungstenAggregationQueryWithControlledFallbackSuite to test fall back process. 91d69c2 [Yin Huai] Rename UnsafeHybridAggregationIterator to TungstenAggregateIteraotr and make it only work with UnsafeRow.

yhuai added 13 commits August 2, 2015 13:11

Create a base iterator class for aggregation iterators and add the in…

3915bac

…itial version of the hybrid iterator.

First round cleanup.

299008c

Check iter.hasNext before we create an iterator because the construct…

af32210

…or of the iterato will read at least one row from a non-empty input iter.

Also check input schema.

f60cc83

wip

d2c45a0

wip

3171f44

wip

f52ee53

UDAFs now supports UnsafeRow.

bd9282b

wip

33b7022

Prepare for fallback!

533d5b2

Add a flag to control what iterator to use.

7fcbd87

wip

b1ea5cf

Implement fallback strategy.

964f88b

yhuai changed the title ~~[SPARK-9240] [SQL] [WIP] Hybrid aggregate operator using unsafe row~~ [SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row Aug 3, 2015

rxin reviewed Aug 3, 2015
View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala Outdated

Copy link

Contributor

rxin Aug 3, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this back to 32

Remove unnecessary change.

21fd15f

rxin reviewed Aug 3, 2015
View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala Outdated

Copy link

Contributor

rxin Aug 3, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want a config flag here?

Remove unnecessary code.

0f1b06f

yhuai added 2 commits August 2, 2015 21:57

Add a little bit more comments.

ba6afbc

Merge remote-tracking branch 'upstream/master' into AggregateOperator

74d93c5

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

Remove unnecessary change.

e317e2b

rxin reviewed Aug 3, 2015
View reviewed changes

asfgit closed this in 1ebd41b Aug 3, 2015

rxin reviewed Aug 3, 2015
View reviewed changes

JoshRosen mentioned this pull request Aug 3, 2015

[SPARK-8160][SPARK-9240][SQL]Support hybrid aggregate using external sorting when memory is not enough #7827

Closed

yhuai mentioned this pull request Aug 5, 2015

[SPARK-9630] [SQL] Clean up new aggregate operators (SPARK-9240 follow up) #7954

Closed

[SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row #7813

[SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row #7813

Uh oh!

Conversation

yhuai commented Jul 31, 2015

Uh oh!

SparkQA commented Jul 31, 2015

Uh oh!

SparkQA commented Jul 31, 2015

Uh oh!

yhuai commented Jul 31, 2015

Uh oh!

yhuai commented Jul 31, 2015

Uh oh!

SparkQA commented Jul 31, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

rxin Aug 3, 2015

Choose a reason for hiding this comment

Uh oh!

rxin Aug 3, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

rxin Aug 3, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

rxin commented Aug 3, 2015

Uh oh!

rxin Aug 3, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai Aug 5, 2015

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants