Skip to content

Conversation

@ooq
Copy link
Contributor

@ooq ooq commented Jul 25, 2016

What changes were proposed in this pull request?

This PR is the first step for the following feature:

For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a ColumnarBatch. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields).
In this JIRA, we support another implementation of fast hashmap, which is backed by a RowBasedKeyValueBatch. We then automatically pick between the two implementations based on certain knobs.

In this first-step PR, implementations for RowBasedKeyValueBatch and RowBasedHashMapGenerator are added.

How was this patch tested?

Unit tests: RowBasedKeyValueBatchSuite

@ooq
Copy link
Contributor Author

ooq commented Jul 25, 2016

This PR is a cleaned version for #14174.

@SparkQA
Copy link

SparkQA commented Jul 25, 2016

Test build #62839 has finished for PR 14349 at commit 33978cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal
Copy link
Member

LGTM (from #14174)

* [8 bytes pointer to next]
* Thus, record length = 4 + 4 + klen + vlen + 8
*/
public final class VariableLengthRowBasedKeyValueBatch extends RowBasedKeyValueBatch {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can write the test suites in scala -- it tends to simplify the code.

@rxin
Copy link
Contributor

rxin commented Jul 27, 2016

Merging in master.

@asfgit asfgit closed this in 738b4cc Jul 27, 2016
*
* The format for each record looks like this:
* [UnsafeRow for key of length klen] [UnsafeRow for Value of length vlen]
* [8 bytes pointer to next]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a pointer? Since the key-value size is fixed, can we just use (klen + vlen) * n to address the n-th entry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 8 byte is left intentionally for some compatibility reason. AFAIK, this is due the fact that sometimes a key can be followed by multiple values. @davies can probably explain it better.

HyukjinKwon pushed a commit that referenced this pull request Dec 10, 2024
…e to debug level

### What changes were proposed in this pull request?

This PR aims to lower `RowBasedKeyValueBatch.spill` warning message to debug level.

```java
   public final long spill(long size, MemoryConsumer trigger) throws IOException {
-    logger.warn("Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.");
+    logger.debug("Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.");
     return 0;
   }
```

### Why are the changes needed?

Although Apache Spark has been showing a warning message since 2.1.0, there is nothing for a user to be able to do further. This is more like a dev-side debug message. So, we had better lower the level to `DEBUG` from `WARN`.
- #14349

### Does this PR introduce _any_ user-facing change?

No behavior change. This is a log message.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49116 from dongjoon-hyun/SPARK-50524.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants