-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25366][SQL]Zstd and brotli CompressionCodec are not supported for parquet files #22358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
but if there are the codecs found, we support those compressions, no? |
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer none, uncompressed, snappy, gzip, lzo, brotli(need install ...), lz4, zstd(need install ...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installation may not be able to solve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
none, uncompressed, snappy, gzip, lzo, brotli(need install brotli-codec), lz4, zstd(since Hadoop 2.9.0)
https://jira.apache.org/jira/browse/HADOOP-13578
https://github.com/rdblue/brotli-codec
https://jira.apache.org/jira/browse/HADOOP-13126
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it,thanks @wangyum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hadoop-2.9.x is officially supported in Spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok.
|
It is using reflection to acquire hadoop classes for compression which are not in the (hadoop-common-2.6.5.jar, hadoop-common-2.7.0.jar, hadoop-common-3.1.0.jar).
|
|
Thanks, if there are the codecs found, we support those compressions, but how do I find it? @HyukjinKwon |
|
That's probably something we should document, or improve the error message. Ideally, we should fix the error message from Parquet. Don't you think? |
|
yeah, the error message is output from external jar(parquet-common-1.10.0.jar), |
|
Test build #95785 has finished for PR 22358 at commit
|
|
If the codecs are found, then we support it. One thing we should do might be to document to explicitly provide the codec but I am not sure how many users are confused about it. |
|
just fyi about related talks: #21070 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought if you remove it from here the user would not be able to use zstd or brotli even if it is installed/enabled/available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, removing is not a good idea.
Thanks.
1db036a to
5c478b9
Compare
|
Test build #95852 has finished for PR 22358 at commit
|
|
I am 0 on this since it is worth |
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just add few lines for brotli and zstd below and leave the original text as is.
5c478b9 to
dd86d3f
Compare
|
Test build #95930 has finished for PR 22358 at commit
|
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs install -> needs to install
|
I'm okay but I would close this if no committer agree with (approves) this for some long time. |
dd86d3f to
64aef6b
Compare
|
Test build #95969 has finished for PR 22358 at commit
|
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon How about adding a link? Users may not know where to download it.
`brotliCodec` -> [`brotli-codec`](https://github.com/rdblue/brotli-codec)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the link looks expected to be rather permanent, it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more clear to say "zstd requires ZStandardCodec to be installed".
64aef6b to
39eaf1d
Compare
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
brotliCodec -> BrotliCodec
|
Test build #96312 has finished for PR 22358 at commit
|
39eaf1d to
0e5d0bc
Compare
|
Test build #96314 has finished for PR 22358 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a bit of documentation is OK.

What changes were proposed in this pull request?
Hadoop2.6 and hadoop2.7 do not contain zstd and brotli compressioncodec ,hadoop 3.1 also contains only zstd compressioncodec .
So I think we should remove zstd and brotil for the time being.
set
spark.sql.parquet.compression.codec=brotli:Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.BrotliCodec was not found
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)
set
spark.sql.parquet.compression.codec=zstd:Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.ZStandardCodec was not found
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)
How was this patch tested?
Exist unit test