The parquet files written with `polars::prelude::ParquetWriter` are malformed #3929

andrei-ionescu · 2022-07-07T12:27:46Z

What language are you using?

Rust

Which feature gates did you use?

"polars-io", "parquet", "lazy", "dtype-struct"

Have you tried latest version of polars?

[yes]

What version of polars are you using?

0.22.8

What operating system are you using polars on?

macOS Monterey 12.3.1

What language version are you using

$ rustc --version
rustc 1.64.0-nightly (7b46aa594 2022-07-05)

$ cargo --version
cargo 1.64.0-nightly (c0bbd42ce 2022-07-03)

Describe your bug.

The parquet files written with polars::prelude::ParquetWriter are malformed.

What are the steps to reproduce the behavior?

Use the following code to write a parquet file:

    ParquetWriter::new(
        std::fs::File::create("parquet_writer_test.parquet").unwrap()
    )
        .with_statistics(true)
        .finish(
            &mut df!(
                "a" => &[1, 2, 3],
                "b" => &["a", "b", "c"],
            ).unwrap()
        )
        .unwrap();

The use parquet-tools to try to get the row count

$ parquet-tools rowcount parquet_writer_test.parquet
java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file DeprecatedRawLocalFileStatus{path=file:./parquet_writer_test.parquet; isDirectory=false; length=713; replication=1; blocksize=33554432; modification_time=1657196476000; access_time=1657196476000; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

or show the content of the file:

$ parquet-tools cat parquet_writer_test.parquet
java.io.IOException: can not read class org.apache.parquet.format.FileMetaData: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT32, encodings:[PLAIN, RLE], path_in_schema:[a], codec:null, num_values:3, total_uncompressed_size:52, total_compressed_size:53, data_page_offset:4, statistics:Statistics(null_count:0, max_value:03 00 00 00, min_value:01 00 00 00))

What is the actual behavior?

The written parquet files are malformed and cannot be read by other readers. The parquet-tools utility could not read the file neither Apache Spark.

What is the expected behavior?

Parquet files produced by polars::prelude::ParquetWriter to be readable.

The text was updated successfully, but these errors were encountered:

ritchie46 · 2022-07-07T14:04:05Z

I tried writing your example and I could read with pyarrow. We write parquet version v2, I don't know if that is problematic?

P.S. I think we must let the user decide which parquet version is used.

jorgecarleitao · 2022-07-07T14:26:09Z

Yeap, this is related to the LZ4 codec mess that parquet is on. @andrei-ionescu , could you use .with_compression(ParquetCompression::Snappy) instead?

use polars::frame::DataFrame;
use polars::prelude::NamedFrom;
use polars::prelude::ParquetCompression;
use polars::prelude::{df, ParquetWriter};
use polars::series::Series;

fn main() {
    ParquetWriter::new(std::fs::File::create("parquet_writer_test.parquet").unwrap())
        .with_statistics(true)
        .with_compression(ParquetCompression::Snappy)
        .finish(
            &mut df!(
                "a" => &[1, 2, 3],
                "b" => &["a", "b", "c"],
            )
            .unwrap(),
        )
        .unwrap();
}

Read from pyarrow and pyspark

import pyarrow.parquet as pq
import pyspark.sql


m = pq.read_metadata("parquet_writer_test.parquet")
print(m.to_dict())

spark = pyspark.sql.SparkSession.builder.config(
    # see https://stackoverflow.com/a/62024670/931303
    "spark.sql.parquet.enableVectorizedReader",
    "false",
).getOrCreate()

result = spark.read.parquet("parquet_writer_test.parquet").collect()

read from parquet-tools (parquet-mr):

docker run --rm -it -v $(pwd)/parquet_writer_test.parquet:/tmp/file.parquet nathanhowell/parquet-tools rowcount /tmp/file.parquet

removing .with_compression(ParquetCompression::Snappy) causes us to use Lz4Raw, which despite being in parquet's thrift definition some readers still do not support.

jorgecarleitao · 2022-07-07T14:28:06Z

which despite being in parquet's thrift definition

*and strongly recommend switching to lz4raw

andrei-ionescu · 2022-07-07T15:27:47Z

@jorgecarleitao, @ritchie46: It does work when the Snappy compression is used. So, it is good that there is at least this possibility of writing compatible parquet files. But still this is a bug. Maybe it should be by default uncompressed or gzip compressed.

jorgecarleitao · 2022-07-07T15:44:37Z

imo this is not a bug on our side -

we declare the file as v2 and a valid codec (Lz4raw)
pyarrow correctly read the file
parquet-mr identifies this codec as null
parquet-mr bails because codec must be defined

imo there are a couple of things on the parquet-mr:

LZ4raw is interpreted as null by parquet-mr, imo this is a bug in parquet-mr
LZ4raw is a valid codec and parquet-mr does not support it, imo this is a feature request to parquet-mr

with that said, the reason we use lz4 is because it offers a great performance. We use LZ4raw because parquet format itself strongly suggests that. So, we are left with the tradeoff - either we default to something that is less performant that parquet-mr supports or something more performant that parquet-mr does not support.

I am fine either way - defaults are always hard to choose from.

andrei-ionescu · 2022-07-07T17:31:50Z

@jorgecarleitao, @ritchie46: It is not only parquet-tools (aka parquet-mr) that does not work. I tried in Apache Spark and the same issue is there and possible any other well established tool that has a latency in upgrading to latest versions.

I've seen that lz4raw is the newest addition on the parquet-format, 16 months ago (apache/parquet-format#168) and released in the 2.9.0 version of it, and although lz4raw is part of parquet-format the frameworks and tools working with parquet are running with old reliable versions of parquet (for example, Apache Spark currently uses
1.12.3 as seen here: https://github.com/apache/spark/blob/master/pom.xml#L134).

So, yes, it is not a bug in the code, is more of a usability and interoperability issue on Polars side. I mean, even if the parquet-mr is fixed, there are multiple other parquet based tools that also need to switch to 2.9.0 to be able to read those files, and this switch will eventually come but very slow.

I suggest two possible things to do:

Switch the default from lz4raw to something more common like uncompressed, snappy, etc
Have it properly stated in the documentation about this default parquet format used at write time by Polars

ritchie46 · 2022-07-07T18:05:38Z

Switch the default from lz4raw to something more common like uncompressed, snappy, etc

Sorry, but I don't want to do that. If all tools do that we never have progress. It is a reasonable default that saves a lot of time time in the most common cases.

I agree with you that it should be documented properly and that we can give portability advice.

andrei-ionescu · 2022-07-07T23:08:07Z

@ritchie46 Ok. I understand that the first and most important for Polars right now is "speed". It needs to beat everybody. I've been doing distributed data storage and processing since a good while and never used the lz4 codec. All use cases I've encountered did use snappy and the ratio of feature set, interoperability, compatibility vs performance is very good. As far as I know snappy is the next after lz4 in terms of performance. And I've done processing and storage multiple times over PB of data.

To set things straight, I'm just stating my point of view and trying to support it with arguments. I'm glad that there is a workaround for this.

Also, having it written in the documentation would be of tremendous help.

ritchie46 · 2022-07-08T07:01:21Z

I have updated the documentation in #3926

andrei-ionescu · 2022-07-12T10:24:18Z

@ritchie46: Will the text added in #3926 in regards to lz4raw vs snappy be present in this doc page: parquet.html#read--write?

zundertj · 2022-07-17T16:27:21Z

@andrei-ionescu : that is the polars-book, which lives in a separate repo https://github.com/pola-rs/polars-book. You can file an issue there, or even better, raise a pull request with the change. The text in #3926 has been added to the Python doc string, and so it is visible in the API reference at https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html

I am closing this issue, on the assumption that this is sufficient information. Feel free to re-open or create a new issue if anything is missing.

andrei-ionescu added the bug Something isn't working label Jul 7, 2022

joshuataylor mentioned this issue Jul 8, 2022

add compression for writing parquet elixir-explorer/explorer#293

Merged

ritchie46 added invalid A bug report that is not actually a bug and removed bug Something isn't working labels Jul 8, 2022

zundertj closed this as completed Jul 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The parquet files written with `polars::prelude::ParquetWriter` are malformed #3929

The parquet files written with `polars::prelude::ParquetWriter` are malformed #3929

andrei-ionescu commented Jul 7, 2022

ritchie46 commented Jul 7, 2022 •

edited

Loading

jorgecarleitao commented Jul 7, 2022

jorgecarleitao commented Jul 7, 2022

andrei-ionescu commented Jul 7, 2022

jorgecarleitao commented Jul 7, 2022

andrei-ionescu commented Jul 7, 2022

ritchie46 commented Jul 7, 2022

andrei-ionescu commented Jul 7, 2022 •

edited

Loading

ritchie46 commented Jul 8, 2022

andrei-ionescu commented Jul 12, 2022

zundertj commented Jul 17, 2022

The parquet files written with polars::prelude::ParquetWriter are malformed #3929

The parquet files written with polars::prelude::ParquetWriter are malformed #3929

Comments

andrei-ionescu commented Jul 7, 2022

What language are you using?

Which feature gates did you use?

Have you tried latest version of polars?

What version of polars are you using?

What operating system are you using polars on?

What language version are you using

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

ritchie46 commented Jul 7, 2022 • edited Loading

jorgecarleitao commented Jul 7, 2022

jorgecarleitao commented Jul 7, 2022

andrei-ionescu commented Jul 7, 2022

jorgecarleitao commented Jul 7, 2022

andrei-ionescu commented Jul 7, 2022

ritchie46 commented Jul 7, 2022

andrei-ionescu commented Jul 7, 2022 • edited Loading

ritchie46 commented Jul 8, 2022

andrei-ionescu commented Jul 12, 2022

zundertj commented Jul 17, 2022

The parquet files written with `polars::prelude::ParquetWriter` are malformed #3929

The parquet files written with `polars::prelude::ParquetWriter` are malformed #3929

ritchie46 commented Jul 7, 2022 •

edited

Loading

andrei-ionescu commented Jul 7, 2022 •

edited

Loading