[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator #14349

ooq · 2016-07-25T20:32:18Z

What changes were proposed in this pull request?

This PR is the first step for the following feature:

For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a ColumnarBatch. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields).
In this JIRA, we support another implementation of fast hashmap, which is backed by a RowBasedKeyValueBatch. We then automatically pick between the two implementations based on certain knobs.

In this first-step PR, implementations for RowBasedKeyValueBatch and RowBasedHashMapGenerator are added.

How was this patch tested?

Unit tests: RowBasedKeyValueBatchSuite

ooq · 2016-07-25T20:34:43Z

This PR is a cleaned version for #14174.

SparkQA · 2016-07-25T22:33:29Z

Test build #62839 has finished for PR 14349 at commit 33978cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal · 2016-07-25T23:19:05Z

LGTM (from #14174)

rxin · 2016-07-27T01:07:20Z

...main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java

+ * [8 bytes pointer to next]
+ * Thus, record length = 4 + 4 + klen + vlen + 8
+ */
+public final class VariableLengthRowBasedKeyValueBatch extends RowBasedKeyValueBatch {


you can write the test suites in scala -- it tends to simplify the code.

rxin · 2016-07-27T01:07:40Z

Merging in master.

cloud-fan · 2017-02-04T13:11:56Z

...rc/main/java/org/apache/spark/sql/catalyst/expressions/FixedLengthRowBasedKeyValueBatch.java

+ *
+ * The format for each record looks like this:
+ * [UnsafeRow for key of length klen] [UnsafeRow for Value of length vlen]
+ * [8 bytes pointer to next]


why do we need a pointer? Since the key-value size is fixed, can we just use (klen + vlen) * n to address the n-th entry?

The 8 byte is left intentionally for some compatibility reason. AFAIK, this is due the fact that sometimes a key can be followed by multiple values. @davies can probably explain it better.

…e to debug level ### What changes were proposed in this pull request? This PR aims to lower `RowBasedKeyValueBatch.spill` warning message to debug level. ```java public final long spill(long size, MemoryConsumer trigger) throws IOException { - logger.warn("Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0."); + logger.debug("Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0."); return 0; } ``` ### Why are the changes needed? Although Apache Spark has been showing a warning message since 2.1.0, there is nothing for a user to be able to do further. This is more like a dev-side debug message. So, we had better lower the level to `DEBUG` from `WARN`. - #14349 ### Does this PR introduce _any_ user-facing change? No behavior change. This is a log message. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49116 from dongjoon-hyun/SPARK-50524. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

ooq added 2 commits July 25, 2016 13:28

SPARK-16524 done.

e40a3c6

Update VectorizedHashMapGenerator.scala

33978cb

ooq mentioned this pull request Jul 25, 2016

[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator #14174

Closed

rxin reviewed Jul 27, 2016
View reviewed changes

asfgit closed this in 738b4cc Jul 27, 2016

cloud-fan reviewed Feb 4, 2017

View reviewed changes

dongjoon-hyun mentioned this pull request Dec 9, 2024

[SPARK-50524][SQL] Lower RowBasedKeyValueBatch.spill warning message to debug level #49116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator #14349

[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator #14349

Uh oh!

ooq commented Jul 25, 2016

Uh oh!

ooq commented Jul 25, 2016 •

edited

Loading

Uh oh!

SparkQA commented Jul 25, 2016

Uh oh!

sameeragarwal commented Jul 25, 2016

Uh oh!

rxin Jul 27, 2016

Uh oh!

rxin commented Jul 27, 2016

Uh oh!

cloud-fan Feb 4, 2017

Uh oh!

ooq Feb 5, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator #14349

[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator #14349

Uh oh!

Conversation

ooq commented Jul 25, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ooq commented Jul 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 25, 2016

Uh oh!

sameeragarwal commented Jul 25, 2016

Uh oh!

rxin Jul 27, 2016

Choose a reason for hiding this comment

Uh oh!

rxin commented Jul 27, 2016

Uh oh!

cloud-fan Feb 4, 2017

Choose a reason for hiding this comment

Uh oh!

ooq Feb 5, 2017

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ooq commented Jul 25, 2016 •

edited

Loading