-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The parquet files written with polars::prelude::ParquetWriter
are malformed
#3929
Comments
I tried writing your example and I could read with pyarrow. We write parquet version v2, I don't know if that is problematic? P.S. I think we must let the user decide which parquet version is used. |
Yeap, this is related to the use polars::frame::DataFrame;
use polars::prelude::NamedFrom;
use polars::prelude::ParquetCompression;
use polars::prelude::{df, ParquetWriter};
use polars::series::Series;
fn main() {
ParquetWriter::new(std::fs::File::create("parquet_writer_test.parquet").unwrap())
.with_statistics(true)
.with_compression(ParquetCompression::Snappy)
.finish(
&mut df!(
"a" => &[1, 2, 3],
"b" => &["a", "b", "c"],
)
.unwrap(),
)
.unwrap();
} Read from pyarrow and pyspark import pyarrow.parquet as pq
import pyspark.sql
m = pq.read_metadata("parquet_writer_test.parquet")
print(m.to_dict())
spark = pyspark.sql.SparkSession.builder.config(
# see https://stackoverflow.com/a/62024670/931303
"spark.sql.parquet.enableVectorizedReader",
"false",
).getOrCreate()
result = spark.read.parquet("parquet_writer_test.parquet").collect() read from parquet-tools (parquet-mr): docker run --rm -it -v $(pwd)/parquet_writer_test.parquet:/tmp/file.parquet nathanhowell/parquet-tools rowcount /tmp/file.parquet removing |
|
@jorgecarleitao, @ritchie46: It does work when the Snappy compression is used. So, it is good that there is at least this possibility of writing compatible parquet files. But still this is a bug. Maybe it should be by default uncompressed or gzip compressed. |
imo this is not a bug on our side -
imo there are a couple of things on the parquet-mr:
with that said, the reason we use lz4 is because it offers a great performance. We use LZ4raw because parquet format itself strongly suggests that. So, we are left with the tradeoff - either we default to something that is less performant that parquet-mr supports or something more performant that parquet-mr does not support. I am fine either way - defaults are always hard to choose from. |
@jorgecarleitao, @ritchie46: It is not only I've seen that So, yes, it is not a bug in the code, is more of a usability and interoperability issue on Polars side. I mean, even if the I suggest two possible things to do:
|
Sorry, but I don't want to do that. If all tools do that we never have progress. It is a reasonable default that saves a lot of time time in the most common cases. I agree with you that it should be documented properly and that we can give portability advice. |
@ritchie46 Ok. I understand that the first and most important for Polars right now is "speed". It needs to beat everybody. I've been doing distributed data storage and processing since a good while and never used the To set things straight, I'm just stating my point of view and trying to support it with arguments. I'm glad that there is a workaround for this. Also, having it written in the documentation would be of tremendous help. |
I have updated the documentation in #3926 |
@ritchie46: Will the text added in #3926 in regards to |
@andrei-ionescu : that is the polars-book, which lives in a separate repo https://github.com/pola-rs/polars-book. You can file an issue there, or even better, raise a pull request with the change. The text in #3926 has been added to the Python doc string, and so it is visible in the API reference at https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html I am closing this issue, on the assumption that this is sufficient information. Feel free to re-open or create a new issue if anything is missing. |
What language are you using?
Rust
Which feature gates did you use?
"polars-io", "parquet", "lazy", "dtype-struct"
Have you tried latest version of polars?
What version of polars are you using?
0.22.8
What operating system are you using polars on?
macOS Monterey 12.3.1
What language version are you using
Describe your bug.
The parquet files written with
polars::prelude::ParquetWriter
are malformed.What are the steps to reproduce the behavior?
Use the following code to write a parquet file:
The use
parquet-tools
to try to get the row countor show the content of the file:
What is the actual behavior?
The written parquet files are malformed and cannot be read by other readers. The
parquet-tools
utility could not read the file neither Apache Spark.What is the expected behavior?
Parquet files produced by
polars::prelude::ParquetWriter
to be readable.The text was updated successfully, but these errors were encountered: