-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator #14349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR is a cleaned version for #14174. |
|
Test build #62839 has finished for PR 14349 at commit
|
|
LGTM (from #14174) |
| * [8 bytes pointer to next] | ||
| * Thus, record length = 4 + 4 + klen + vlen + 8 | ||
| */ | ||
| public final class VariableLengthRowBasedKeyValueBatch extends RowBasedKeyValueBatch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can write the test suites in scala -- it tends to simplify the code.
|
Merging in master. |
| * | ||
| * The format for each record looks like this: | ||
| * [UnsafeRow for key of length klen] [UnsafeRow for Value of length vlen] | ||
| * [8 bytes pointer to next] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a pointer? Since the key-value size is fixed, can we just use (klen + vlen) * n to address the n-th entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 8 byte is left intentionally for some compatibility reason. AFAIK, this is due the fact that sometimes a key can be followed by multiple values. @davies can probably explain it better.
…e to debug level
### What changes were proposed in this pull request?
This PR aims to lower `RowBasedKeyValueBatch.spill` warning message to debug level.
```java
public final long spill(long size, MemoryConsumer trigger) throws IOException {
- logger.warn("Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.");
+ logger.debug("Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.");
return 0;
}
```
### Why are the changes needed?
Although Apache Spark has been showing a warning message since 2.1.0, there is nothing for a user to be able to do further. This is more like a dev-side debug message. So, we had better lower the level to `DEBUG` from `WARN`.
- #14349
### Does this PR introduce _any_ user-facing change?
No behavior change. This is a log message.
### How was this patch tested?
Manual review.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #49116 from dongjoon-hyun/SPARK-50524.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR is the first step for the following feature:
For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a
ColumnarBatch. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields).In this JIRA, we support another implementation of fast hashmap, which is backed by a
RowBasedKeyValueBatch. We then automatically pick between the two implementations based on certain knobs.In this first-step PR, implementations for
RowBasedKeyValueBatchandRowBasedHashMapGeneratorare added.How was this patch tested?
Unit tests:
RowBasedKeyValueBatchSuite