-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11044][SQL] Parquet writer version fixed as version1 #9060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just use setIfUnset here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap I just updated. Thanks.
|
/cc @liancheng |
|
@liancheng I assume you missed this. |
|
@HyukjinKwon Oh yeah, sorry. Finally got sometime to clean my review queue :) I wonder is there an easy way to add a test case for this? At first I thought |
|
ok to test |
|
I will try to find and test them first tommorow before adding a commit! |
|
Test build #45626 has finished for PR 9060 at commit
|
|
@liancheng I gave some tries to figure out the version but.. as you said, it is pretty tricky to check the writer version as it only changes the version of data page which we could know only within the internal of Parquet. This is also because the writer version changes encoding types of each data page but this encoding type is only recorded in datapage header which is not the part of footer. Would this be too inappropriate if we write Parquet files with both version1 and version2 and then, check if the sizes of both are equal? Since encoding types are different, both the sizes should be also different. |
|
I think we can check for column encoding information, which is accessible from Parquet footers. For example, The parquet-meta CLI tool can be a reference for how to inspect related metadata. |
|
Thank you very much. I will try in that way. |
|
You may construct a Parquet file consists of a single column with dictionary encoding using: val path = "file:///tmp/parquet/dict"
sqlContext.range(1 << 16).selectExpr("(id % 4) AS i").coalesce(1).write.mode("overwrite").parquet(path)And here are instructions of building and installing the parquet-tools CLI tool. Then you can inspect Parquet metadata using: The |
|
Thanks! I will follow the way. |
|
Fortunately, I worked around parquet tools once and looked through Parquet codes several times before :). Thank you very much for your help. This could be done much more easily than I though because of your help. |
|
Test build #45810 has finished for PR 9060 at commit
|
|
Test build #45811 has finished for PR 9060 at commit
|
|
Test build #45831 has finished for PR 9060 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Remove this empty line.
|
LGTM except for a few minor styling issues. I can merge it right after you fix them. |
|
I accidentally saw I will also add this test in the following PR for using the overloaded |
|
Test build #45964 has finished for PR 9060 at commit
|
|
@marmbrus Is this one OK for branch-1.6? |
|
@HyukjinKwon Thanks! I've merged this one to master. And yes, please feel free to add the decimal test case(s). |
|
Sure |
|
Merging to branch-1.6. |
https://issues.apache.org/jira/browse/SPARK-11044 Spark writes a parquet file only with writer version1 ignoring the writer version given by user. So, in this PR, it keeps the writer version if given or sets version1 as default. Author: hyukjinkwon <gurwls223@gmail.com> Author: HyukjinKwon <gurwls223@gmail.com> Closes #9060 from HyukjinKwon/SPARK-11044. (cherry picked from commit 7f8eb3b) Signed-off-by: Cheng Lian <lian@databricks.com>
…metadata and add a test for FIXED_LEN_BYTE_ARRAY As discussed #9660 #9060, I cleaned up unused imports, added a test for fixed-length byte array and used a common function for writing metadata for Parquet. For the test for fixed-length byte array, I have tested and checked the encoding types with [parquet-tools](https://github.com/Parquet/parquet-mr/tree/master/parquet-tools). Author: hyukjinkwon <gurwls223@gmail.com> Closes #9754 from HyukjinKwon/SPARK-11694-followup.
…metadata and add a test for FIXED_LEN_BYTE_ARRAY As discussed apache/spark#9660 apache/spark#9060, I cleaned up unused imports, added a test for fixed-length byte array and used a common function for writing metadata for Parquet. For the test for fixed-length byte array, I have tested and checked the encoding types with [parquet-tools](https://github.com/Parquet/parquet-mr/tree/master/parquet-tools). Author: hyukjinkwon <gurwls223@gmail.com> Closes #9754 from HyukjinKwon/SPARK-11694-followup.
https://issues.apache.org/jira/browse/SPARK-11044
Spark writes a parquet file only with writer version1 ignoring the writer version given by user.
So, in this PR, it keeps the writer version if given or sets version1 as default.