[SPARK-16412][SQL][WIP] Generate Java code that gets an array in each column of CachedBatch when DataFrame.cache() is called #14091
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Waiting #11956 to be merged.
This PR generates Java code to directly get an array of each column from
CachedBatchwhenDataFrame.cache()is called. This is done in whole stage code generation.When DataFrame.cache()
is called, data is stored as column-oriented storage (columnar cache) inCachedBatch`. This PR avoid conversion from column-oriented storage to row-oriented storage. This PR handles an array type that is stored into a column.This PR generates code both for row-oriented storage and column-oriented storage only if
InMemoryColumnarTableScanexists in a plan sub-tree. A decision is performed by checking an given iterator isColumnaIteratorat runtimeThis PR generates Java code for columnar cache only if types in all columns, which are accessed in operations, are primitive or an array
I will add benchmark suites into here
Motivating example
Generated code
Before applying this PR
After applying this PR
How was this patch tested?
Added new tests into
DataFrameCacheSuite.scala