-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26509][SQL] Parquet DELTA_BYTE_ARRAY is not supported in Spark 2.x's Vectorized Reader #23988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… 2.x's Vectorized Reader
|
ok to test |
|
Test build #103094 has finished for PR 23988 at commit
|
|
cc @rdblue since this is a new Parquet feature. |
|
Test build #103100 has finished for PR 23988 at commit
|
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package org.apache.spark.sql.execution.datasources.parquet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing empty line.
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package org.apache.spark.sql.execution.datasources.parquet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing empty line.
| assert(columnChunkMetadataList.length === 3) | ||
| assert(columnChunkMetadataList(0).getEncodings.contains(Encoding.DELTA_BINARY_PACKED)) | ||
| assert(columnChunkMetadataList(1).getEncodings.contains(Encoding.DELTA_BINARY_PACKED)) | ||
| assert(columnChunkMetadataList(2).getEncodings.contains(Encoding.DELTA_BYTE_ARRAY)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I have see DELTA_BYTE_ARRAY will used for the types BINARY and FIXED_LEN_BYTE_ARRAY and they are also handled differently in VectorizedDeltaByteArrayReader#readBinary.
Is it possible to test both cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 947c6f7
| for (int i = 0; i < total; i++) { | ||
| Binary binary = deltaByteArrayReader.readBytes(); | ||
| ByteBuffer buffer = binary.toByteBuffer(); | ||
| if (buffer.hasArray()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit uncertain here but I have tried binary.getBytes() and it worked.
I know currently the buffer is backed by a byte array.
So my questions:
- Can we use
binary.getBytes()for both cases? - What would be its disadvantages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Find the discussion here: #21070 (comment).
|
ping @nandorKollar |
|
Can one of the admins verify this patch? |
|
ping @nandorKollar |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Implement Parquet delta encoding for vectorized interface, which is needed for V2 pages. The implementation simply delegates the decoding to the Parquet implementation.
How was this patch tested?
Added new test case for delta encoding, ran unit tests