[SPARK-11044][SQL] Parquet writer version fixed as version1 #9060

HyukjinKwon · 2015-10-10T07:02:30Z

https://issues.apache.org/jira/browse/SPARK-11044

Spark writes a parquet file only with writer version1 ignoring the writer version given by user.

So, in this PR, it keeps the writer version if given or sets version1 as default.

srowen · 2015-10-10T09:13:25Z

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystWriteSupport.scala

Can you just use setIfUnset here?

Yeap I just updated. Thanks.

JoshRosen · 2015-10-18T09:22:59Z

/cc @liancheng

HyukjinKwon · 2015-11-07T02:01:35Z

@liancheng I assume you missed this.

liancheng · 2015-11-11T09:48:03Z

@HyukjinKwon Oh yeah, sorry. Finally got sometime to clean my review queue :)

I wonder is there an easy way to add a test case for this? At first I thought WriterVersion corresponds to the the version field of the Thrift struct FileMetaData described in parquet-format, but it's not. I only found that when WriterVersion is set to v2, the Thrift field PageHeader.type is set to DATA_PAGE_V2.

liancheng · 2015-11-11T09:48:07Z

ok to test

HyukjinKwon · 2015-11-11T10:00:39Z

I will try to find and test them first tommorow before adding a commit!

SparkQA · 2015-11-11T11:52:57Z

Test build #45626 has finished for PR 9060 at commit 2eee7e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2015-11-12T02:04:22Z

@liancheng I gave some tries to figure out the version but.. as you said, it is pretty tricky to check the writer version as it only changes the version of data page which we could know only within the internal of Parquet. This is also because the writer version changes encoding types of each data page but this encoding type is only recorded in datapage header which is not the part of footer.

Would this be too inappropriate if we write Parquet files with both version1 and version2 and then, check if the sizes of both are equal?

Since encoding types are different, both the sizes should be also different.

liancheng · 2015-11-12T10:54:13Z

I think we can check for column encoding information, which is accessible from Parquet footers. For example, PARQUET_2_0 uses RLE_DICTIONARY while PARQUET_1_0 uses PLAIN_DICTIONARY (see here).

The parquet-meta CLI tool can be a reference for how to inspect related metadata.

HyukjinKwon · 2015-11-12T11:11:10Z

Thank you very much. I will try in that way.

liancheng · 2015-11-12T11:16:14Z

You may construct a Parquet file consists of a single column with dictionary encoding using:

val path = "file:///tmp/parquet/dict"
sqlContext.range(1 << 16).selectExpr("(id % 4) AS i").coalesce(1).write.mode("overwrite").parquet(path)

And here are instructions of building and installing the parquet-tools CLI tool. Then you can inspect Parquet metadata using:

$ parquet-meta /tmp/parquet/dict

file:        file:/private/tmp/parquet/dict/part-r-00000-88498608-9eed-4728-b96a-b60bc5ebc2a8.gz.parquet
creator:     parquet-mr version 1.6.0
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"i","type":"long","nullable":true,"metadata":{}}]}

file schema: root
----------------------------------------------------------------------------------------------------------------------------------------------
i:           OPTIONAL INT64 R:0 D:1

row group 1: RC:65536 TS:16615 OFFSET:4
----------------------------------------------------------------------------------------------------------------------------------------------
i:            INT64 GZIP DO:0 FPO:4 SZ:198/16615/83.91 VC:65536 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY

The ENC:... part in the last line is column encoding information.

HyukjinKwon · 2015-11-12T13:12:39Z

Thanks! I will follow the way.

HyukjinKwon · 2015-11-13T02:47:31Z

Fortunately, I worked around parquet tools once and looked through Parquet codes several times before :).

Thank you very much for your help. This could be done much more easily than I though because of your help.

SparkQA · 2015-11-13T05:03:01Z

Test build #45810 has finished for PR 9060 at commit 2d1d343.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2015-11-13T05:13:48Z

Test build #45811 has finished for PR 9060 at commit 7e80ad6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-13T07:33:07Z

Test build #45831 has finished for PR 9060 at commit 78449ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class TypedColumn[-T, U](\n * class JavaTrackStateDStream[KeyType, ValueType, StateType, EmittedType](\n

liancheng · 2015-11-13T09:45:05Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

Nit: Remove this empty line.

liancheng · 2015-11-13T09:48:51Z

LGTM except for a few minor styling issues. I can merge it right after you fix them.

HyukjinKwon · 2015-11-16T00:36:36Z

I accidentally saw TODO Adds test case for reading dictionary encoded decimals written as 'FIXED_LEN_BYTE_ARRAY'.

I will also add this test in the following PR for using the overloaded writeMetaFile.

SparkQA · 2015-11-16T02:32:50Z

Test build #45964 has finished for PR 9060 at commit cea5034.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-11-16T13:30:45Z

@marmbrus Is this one OK for branch-1.6?

liancheng · 2015-11-16T13:32:21Z

@HyukjinKwon Thanks! I've merged this one to master. And yes, please feel free to add the decimal test case(s).

marmbrus · 2015-11-16T17:25:27Z

Sure

liancheng · 2015-11-16T19:08:57Z

Merging to branch-1.6.

https://issues.apache.org/jira/browse/SPARK-11044 Spark writes a parquet file only with writer version1 ignoring the writer version given by user. So, in this PR, it keeps the writer version if given or sets version1 as default. Author: hyukjinkwon <gurwls223@gmail.com> Author: HyukjinKwon <gurwls223@gmail.com> Closes #9060 from HyukjinKwon/SPARK-11044. (cherry picked from commit 7f8eb3b) Signed-off-by: Cheng Lian <lian@databricks.com>

…metadata and add a test for FIXED_LEN_BYTE_ARRAY As discussed #9660 #9060, I cleaned up unused imports, added a test for fixed-length byte array and used a common function for writing metadata for Parquet. For the test for fixed-length byte array, I have tested and checked the encoding types with [parquet-tools](https://github.com/Parquet/parquet-mr/tree/master/parquet-tools). Author: hyukjinkwon <gurwls223@gmail.com> Closes #9754 from HyukjinKwon/SPARK-11694-followup.

…metadata and add a test for FIXED_LEN_BYTE_ARRAY As discussed apache/spark#9660 apache/spark#9060, I cleaned up unused imports, added a test for fixed-length byte array and used a common function for writing metadata for Parquet. For the test for fixed-length byte array, I have tested and checked the encoding types with [parquet-tools](https://github.com/Parquet/parquet-mr/tree/master/parquet-tools). Author: hyukjinkwon <gurwls223@gmail.com> Closes #9754 from HyukjinKwon/SPARK-11694-followup.

[SPARK-11044][SQL] Apply the writer version if given.

5e72fbc

srowen reviewed Oct 10, 2015
View reviewed changes

[SPARK-11044][SQL] Use function that sets if unset.

2eee7e3

[SPARK-11044][SQL] Add a test checking encoding types.

2d1d343

HyukjinKwon added 2 commits November 13, 2015 11:57

[SPARK-11044][SQL] Resolve conflicts

65f2647

[SPARK-11044][SQL] Add comments

7e80ad6

[SPARK-11044][SQL] Fix typos

78449ec

liancheng reviewed Nov 13, 2015
View reviewed changes

[SPARK-11044][SQL] Remove empty newlines and correct comments.

cea5034

asfgit closed this in 7f8eb3b Nov 16, 2015

HyukjinKwon mentioned this pull request Nov 17, 2015

[SPARK-11694][FOLLOW-UP] Clean up imports, use a common function for metadata and add a test for FIXED_LEN_BYTE_ARRAY #9754

Closed

HyukjinKwon deleted the SPARK-11044 branch September 23, 2016 18:28

[SPARK-11044][SQL] Parquet writer version fixed as version1 #9060

[SPARK-11044][SQL] Parquet writer version fixed as version1 #9060

Uh oh!

Conversation

HyukjinKwon commented Oct 10, 2015

Uh oh!

srowen Oct 10, 2015

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 11, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Oct 18, 2015

Uh oh!

HyukjinKwon commented Nov 7, 2015

Uh oh!

liancheng commented Nov 11, 2015

Uh oh!

liancheng commented Nov 11, 2015

Uh oh!

HyukjinKwon commented Nov 11, 2015

Uh oh!

SparkQA commented Nov 11, 2015

Uh oh!

HyukjinKwon commented Nov 12, 2015

Uh oh!

liancheng commented Nov 12, 2015

Uh oh!

HyukjinKwon commented Nov 12, 2015

Uh oh!

liancheng commented Nov 12, 2015

Uh oh!

HyukjinKwon commented Nov 12, 2015

Uh oh!

HyukjinKwon commented Nov 13, 2015

Uh oh!

SparkQA commented Nov 13, 2015

Uh oh!

SparkQA commented Nov 13, 2015

Uh oh!

SparkQA commented Nov 13, 2015

Uh oh!

liancheng Nov 13, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Nov 13, 2015

Uh oh!

HyukjinKwon commented Nov 16, 2015

Uh oh!

SparkQA commented Nov 16, 2015

Uh oh!

liancheng commented Nov 16, 2015

Uh oh!

liancheng commented Nov 16, 2015

Uh oh!

marmbrus commented Nov 16, 2015

Uh oh!

liancheng commented Nov 16, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants