[SPARK-15380][SQL][WIP] Generate code that stores a float/double value in each column from ColumnarBatch when DataFrame.cache() is used #13171
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR generates Java code to store a computed float/double value of each column into ```ColumnarBatch` when DataFrame.cache() is called. This is done in whole stage code generation.
Even when data is read from ParquetReader (data is kept in
ColumnarBatch), the computed value is stored intoUnsafeRowfor now. Then, the data is stored intoCachedBatchfor DataFrame.cache(). This leads to data format conversions from columnar storage to row-oriented storage and from row-oriented storage to columnar storage. This PR avoid conversions by storing the computed value into a columnar storege.This PR handles only float and double that are stored in a column without compression. Another PR will handle other primitive types that may be stored in a column in a compressed format. This is for ease of review by reducing the size of PR
This PR will consist of three parts.
CachedBatchwhen the original value is read fromColumnarStorage.CachedBatchfordf.cache()from theColumnarStoragedf.cache(), 1. will not occur.This PR generates Java code for columnar cache only if types in all columns, which are accessed in operations, are primitive
Motivating example:
Generated code
How was this patch tested?
Not tested yet