Skip to content

Conversation

@nandorKollar
Copy link
Contributor

What changes were proposed in this pull request?

Implement Parquet delta encoding for vectorized interface, which is needed for V2 pages. The implementation simply delegates the decoding to the Parquet implementation.

How was this patch tested?

Added new test case for delta encoding, ran unit tests

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Mar 6, 2019

Test build #103094 has finished for PR 23988 at commit 5a5e382.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class VectorizedDeltaBinaryPackedReader extends ValuesReader implements VectorizedValuesReader
  • public class VectorizedDeltaByteArrayReader extends ValuesReader implements VectorizedValuesReader

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 6, 2019

cc @rdblue since this is a new Parquet feature.

@SparkQA
Copy link

SparkQA commented Mar 6, 2019

Test build #103100 has finished for PR 23988 at commit 35fc302.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.sql.execution.datasources.parquet;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing empty line.

* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.sql.execution.datasources.parquet;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing empty line.

assert(columnChunkMetadataList.length === 3)
assert(columnChunkMetadataList(0).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
assert(columnChunkMetadataList(1).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
assert(columnChunkMetadataList(2).getEncodings.contains(Encoding.DELTA_BYTE_ARRAY))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I have see DELTA_BYTE_ARRAY will used for the types BINARY and FIXED_LEN_BYTE_ARRAY and they are also handled differently in VectorizedDeltaByteArrayReader#readBinary.

Is it possible to test both cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 947c6f7

for (int i = 0; i < total; i++) {
Binary binary = deltaByteArrayReader.readBytes();
ByteBuffer buffer = binary.toByteBuffer();
if (buffer.hasArray()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit uncertain here but I have tried binary.getBytes() and it worked.
I know currently the buffer is backed by a byte array.

So my questions:

  • Can we use binary.getBytes() for both cases?
  • What would be its disadvantages?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Find the discussion here: #21070 (comment).

@HyukjinKwon
Copy link
Member

ping @nandorKollar

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@kiszk
Copy link
Member

kiszk commented Jan 16, 2020

ping @nandorKollar

@github-actions
Copy link

github-actions bot commented May 2, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 2, 2020
@github-actions github-actions bot closed this May 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants