[SPARK-26509][SQL] Parquet DELTA_BYTE_ARRAY is not supported in Spark 2.x's Vectorized Reader #23988

nandorKollar · 2019-03-06T13:16:28Z

What changes were proposed in this pull request?

Implement Parquet delta encoding for vectorized interface, which is needed for V2 pages. The implementation simply delegates the decoding to the Parquet implementation.

How was this patch tested?

Added new test case for delta encoding, ran unit tests

… 2.x's Vectorized Reader

HyukjinKwon · 2019-03-06T14:32:23Z

ok to test

SparkQA · 2019-03-06T14:48:46Z

Test build #103094 has finished for PR 23988 at commit 5a5e382.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class VectorizedDeltaBinaryPackedReader extends ValuesReader implements VectorizedValuesReader
public class VectorizedDeltaByteArrayReader extends ValuesReader implements VectorizedValuesReader

dongjoon-hyun · 2019-03-06T17:29:00Z

cc @rdblue since this is a new Parquet feature.

SparkQA · 2019-03-06T21:07:09Z

Test build #103100 has finished for PR 23988 at commit 35fc302.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2019-03-26T13:16:09Z

...va/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java

+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.parquet;


Nit: missing empty line.

attilapiros · 2019-03-26T13:17:20Z

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java

+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.parquet;


Nit: missing empty line.

attilapiros · 2019-03-26T13:43:28Z

...src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala

+        assert(columnChunkMetadataList.length === 3)
+        assert(columnChunkMetadataList(0).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
+        assert(columnChunkMetadataList(1).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
+        assert(columnChunkMetadataList(2).getEncodings.contains(Encoding.DELTA_BYTE_ARRAY))


As I have see DELTA_BYTE_ARRAY will used for the types BINARY and FIXED_LEN_BYTE_ARRAY and they are also handled differently in VectorizedDeltaByteArrayReader#readBinary.

Is it possible to test both cases?

Done in 947c6f7

attilapiros · 2019-03-26T14:30:58Z

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java

+    for (int i = 0; i < total; i++) {
+      Binary binary = deltaByteArrayReader.readBytes();
+      ByteBuffer buffer = binary.toByteBuffer();
+      if (buffer.hasArray()) {


I am a bit uncertain here but I have tried binary.getBytes() and it worked.
I know currently the buffer is backed by a byte array.

So my questions:

Can we use binary.getBytes() for both cases?

What would be its disadvantages?

Find the discussion here: #21070 (comment).

HyukjinKwon · 2019-09-17T00:09:54Z

ping @nandorKollar

AmplabJenkins · 2020-01-16T03:41:46Z

Can one of the admins verify this patch?

kiszk · 2020-01-16T06:57:40Z

ping @nandorKollar

github-actions · 2020-05-02T00:11:01Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-26509][SQL] Parquet DELTA_BYTE_ARRAY is not supported in Spark…

5a5e382

… 2.x's Vectorized Reader

Fix style errors

35fc302

attilapiros reviewed Mar 26, 2019

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

xuanyuanking mentioned this pull request Jan 22, 2020

[SPARK-26509][SQL] Support DELTA_BYTE_ARRAY encoding for Parquet vectorized reader #27316

Closed

github-actions bot added the Stale label May 2, 2020

github-actions bot closed this May 3, 2020

[SPARK-26509][SQL] Parquet DELTA_BYTE_ARRAY is not supported in Spark 2.x's Vectorized Reader #23988

[SPARK-26509][SQL] Parquet DELTA_BYTE_ARRAY is not supported in Spark 2.x's Vectorized Reader #23988

Uh oh!

Conversation

nandorKollar commented Mar 6, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Mar 6, 2019

Uh oh!

SparkQA commented Mar 6, 2019

Uh oh!

dongjoon-hyun commented Mar 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 6, 2019

Uh oh!

attilapiros Mar 26, 2019

Choose a reason for hiding this comment

Uh oh!

attilapiros Mar 26, 2019

Choose a reason for hiding this comment

Uh oh!

attilapiros Mar 26, 2019

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Jan 22, 2020

Choose a reason for hiding this comment

Uh oh!

attilapiros Mar 26, 2019

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Jan 22, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 17, 2019

Uh oh!

AmplabJenkins commented Jan 16, 2020

Uh oh!

kiszk commented Jan 16, 2020

Uh oh!

github-actions bot commented May 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

dongjoon-hyun commented Mar 6, 2019 •

edited

Loading