Skip to content

Conversation

@HyukjinKwon
Copy link
Member

https://issues.apache.org/jira/browse/SPARK-11044

Spark writes a parquet file only with writer version1 ignoring the writer version given by user.

So, in this PR, it keeps the writer version if given or sets version1 as default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just use setIfUnset here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap I just updated. Thanks.

@JoshRosen
Copy link
Contributor

/cc @liancheng

@HyukjinKwon
Copy link
Member Author

@liancheng I assume you missed this.

@liancheng
Copy link
Contributor

@HyukjinKwon Oh yeah, sorry. Finally got sometime to clean my review queue :)

I wonder is there an easy way to add a test case for this? At first I thought WriterVersion corresponds to the the version field of the Thrift struct FileMetaData described in parquet-format, but it's not. I only found that when WriterVersion is set to v2, the Thrift field PageHeader.type is set to DATA_PAGE_V2.

@liancheng
Copy link
Contributor

ok to test

@HyukjinKwon
Copy link
Member Author

I will try to find and test them first tommorow before adding a commit!

@SparkQA
Copy link

SparkQA commented Nov 11, 2015

Test build #45626 has finished for PR 9060 at commit 2eee7e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@liancheng I gave some tries to figure out the version but.. as you said, it is pretty tricky to check the writer version as it only changes the version of data page which we could know only within the internal of Parquet. This is also because the writer version changes encoding types of each data page but this encoding type is only recorded in datapage header which is not the part of footer.

Would this be too inappropriate if we write Parquet files with both version1 and version2 and then, check if the sizes of both are equal?

Since encoding types are different, both the sizes should be also different.

@liancheng
Copy link
Contributor

I think we can check for column encoding information, which is accessible from Parquet footers. For example, PARQUET_2_0 uses RLE_DICTIONARY while PARQUET_1_0 uses PLAIN_DICTIONARY (see here).

The parquet-meta CLI tool can be a reference for how to inspect related metadata.

@HyukjinKwon
Copy link
Member Author

Thank you very much. I will try in that way.

@liancheng
Copy link
Contributor

You may construct a Parquet file consists of a single column with dictionary encoding using:

val path = "file:///tmp/parquet/dict"
sqlContext.range(1 << 16).selectExpr("(id % 4) AS i").coalesce(1).write.mode("overwrite").parquet(path)

And here are instructions of building and installing the parquet-tools CLI tool. Then you can inspect Parquet metadata using:

$ parquet-meta /tmp/parquet/dict

file:        file:/private/tmp/parquet/dict/part-r-00000-88498608-9eed-4728-b96a-b60bc5ebc2a8.gz.parquet
creator:     parquet-mr version 1.6.0
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"i","type":"long","nullable":true,"metadata":{}}]}

file schema: root
----------------------------------------------------------------------------------------------------------------------------------------------
i:           OPTIONAL INT64 R:0 D:1

row group 1: RC:65536 TS:16615 OFFSET:4
----------------------------------------------------------------------------------------------------------------------------------------------
i:            INT64 GZIP DO:0 FPO:4 SZ:198/16615/83.91 VC:65536 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY

The ENC:... part in the last line is column encoding information.

@HyukjinKwon
Copy link
Member Author

Thanks! I will follow the way.

@HyukjinKwon
Copy link
Member Author

Fortunately, I worked around parquet tools once and looked through Parquet codes several times before :).

Thank you very much for your help. This could be done much more easily than I though because of your help.

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45810 has finished for PR 9060 at commit 2d1d343.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45811 has finished for PR 9060 at commit 7e80ad6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45831 has finished for PR 9060 at commit 78449ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class TypedColumn[-T, U](\n * class JavaTrackStateDStream[KeyType, ValueType, StateType, EmittedType](\n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Remove this empty line.

@liancheng
Copy link
Contributor

LGTM except for a few minor styling issues. I can merge it right after you fix them.

@HyukjinKwon
Copy link
Member Author

I accidentally saw TODO Adds test case for reading dictionary encoded decimals written as 'FIXED_LEN_BYTE_ARRAY'.

I will also add this test in the following PR for using the overloaded writeMetaFile.

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #45964 has finished for PR 9060 at commit cea5034.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

@marmbrus Is this one OK for branch-1.6?

@asfgit asfgit closed this in 7f8eb3b Nov 16, 2015
@liancheng
Copy link
Contributor

@HyukjinKwon Thanks! I've merged this one to master. And yes, please feel free to add the decimal test case(s).

@marmbrus
Copy link
Contributor

Sure

@liancheng
Copy link
Contributor

Merging to branch-1.6.

asfgit pushed a commit that referenced this pull request Nov 16, 2015
https://issues.apache.org/jira/browse/SPARK-11044

Spark writes a parquet file only with writer version1 ignoring the writer version given by user.

So, in this PR, it keeps the writer version if given or sets version1 as default.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: HyukjinKwon <gurwls223@gmail.com>

Closes #9060 from HyukjinKwon/SPARK-11044.

(cherry picked from commit 7f8eb3b)
Signed-off-by: Cheng Lian <lian@databricks.com>
asfgit pushed a commit that referenced this pull request Nov 17, 2015
…metadata and add a test for FIXED_LEN_BYTE_ARRAY

As discussed #9660 #9060, I cleaned up unused imports, added a test for fixed-length byte array and used a common function for writing metadata for Parquet.

For the test for fixed-length byte array, I have tested and checked the encoding types with [parquet-tools](https://github.com/Parquet/parquet-mr/tree/master/parquet-tools).

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #9754 from HyukjinKwon/SPARK-11694-followup.
kiszk pushed a commit to kiszk/spark-gpu that referenced this pull request Dec 26, 2015
…metadata and add a test for FIXED_LEN_BYTE_ARRAY

As discussed apache/spark#9660 apache/spark#9060, I cleaned up unused imports, added a test for fixed-length byte array and used a common function for writing metadata for Parquet.

For the test for fixed-length byte array, I have tested and checked the encoding types with [parquet-tools](https://github.com/Parquet/parquet-mr/tree/master/parquet-tools).

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #9754 from HyukjinKwon/SPARK-11694-followup.
@HyukjinKwon HyukjinKwon deleted the SPARK-11044 branch September 23, 2016 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants