Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The parquet files written with polars::prelude::ParquetWriter are malformed #3929

Closed
andrei-ionescu opened this issue Jul 7, 2022 · 11 comments
Labels
invalid A bug report that is not actually a bug

Comments

@andrei-ionescu
Copy link
Contributor

What language are you using?

Rust

Which feature gates did you use?

"polars-io", "parquet", "lazy", "dtype-struct"

Have you tried latest version of polars?

  • [yes]

What version of polars are you using?

0.22.8

What operating system are you using polars on?

macOS Monterey 12.3.1

What language version are you using

$ rustc --version
rustc 1.64.0-nightly (7b46aa594 2022-07-05)

$ cargo --version
cargo 1.64.0-nightly (c0bbd42ce 2022-07-03)

Describe your bug.

The parquet files written with polars::prelude::ParquetWriter are malformed.

What are the steps to reproduce the behavior?

Use the following code to write a parquet file:

    ParquetWriter::new(
        std::fs::File::create("parquet_writer_test.parquet").unwrap()
    )
        .with_statistics(true)
        .finish(
            &mut df!(
                "a" => &[1, 2, 3],
                "b" => &["a", "b", "c"],
            ).unwrap()
        )
        .unwrap();

The use parquet-tools to try to get the row count

$ parquet-tools rowcount parquet_writer_test.parquet
java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file DeprecatedRawLocalFileStatus{path=file:./parquet_writer_test.parquet; isDirectory=false; length=713; replication=1; blocksize=33554432; modification_time=1657196476000; access_time=1657196476000; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

or show the content of the file:

$ parquet-tools cat parquet_writer_test.parquet
java.io.IOException: can not read class org.apache.parquet.format.FileMetaData: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT32, encodings:[PLAIN, RLE], path_in_schema:[a], codec:null, num_values:3, total_uncompressed_size:52, total_compressed_size:53, data_page_offset:4, statistics:Statistics(null_count:0, max_value:03 00 00 00, min_value:01 00 00 00))

What is the actual behavior?

The written parquet files are malformed and cannot be read by other readers. The parquet-tools utility could not read the file neither Apache Spark.

What is the expected behavior?

Parquet files produced by polars::prelude::ParquetWriter to be readable.

@andrei-ionescu andrei-ionescu added the bug Something isn't working label Jul 7, 2022
@ritchie46
Copy link
Member

ritchie46 commented Jul 7, 2022

I tried writing your example and I could read with pyarrow. We write parquet version v2, I don't know if that is problematic?

P.S. I think we must let the user decide which parquet version is used.

@jorgecarleitao
Copy link
Collaborator

Yeap, this is related to the LZ4 codec mess that parquet is on. @andrei-ionescu , could you use .with_compression(ParquetCompression::Snappy) instead?

use polars::frame::DataFrame;
use polars::prelude::NamedFrom;
use polars::prelude::ParquetCompression;
use polars::prelude::{df, ParquetWriter};
use polars::series::Series;

fn main() {
    ParquetWriter::new(std::fs::File::create("parquet_writer_test.parquet").unwrap())
        .with_statistics(true)
        .with_compression(ParquetCompression::Snappy)
        .finish(
            &mut df!(
                "a" => &[1, 2, 3],
                "b" => &["a", "b", "c"],
            )
            .unwrap(),
        )
        .unwrap();
}

Read from pyarrow and pyspark

import pyarrow.parquet as pq
import pyspark.sql


m = pq.read_metadata("parquet_writer_test.parquet")
print(m.to_dict())

spark = pyspark.sql.SparkSession.builder.config(
    # see https://stackoverflow.com/a/62024670/931303
    "spark.sql.parquet.enableVectorizedReader",
    "false",
).getOrCreate()

result = spark.read.parquet("parquet_writer_test.parquet").collect()

read from parquet-tools (parquet-mr):

docker run --rm -it -v $(pwd)/parquet_writer_test.parquet:/tmp/file.parquet nathanhowell/parquet-tools rowcount /tmp/file.parquet

removing .with_compression(ParquetCompression::Snappy) causes us to use Lz4Raw, which despite being in parquet's thrift definition some readers still do not support.

@jorgecarleitao
Copy link
Collaborator

which despite being in parquet's thrift definition

*and strongly recommend switching to lz4raw

@andrei-ionescu
Copy link
Contributor Author

@jorgecarleitao, @ritchie46: It does work when the Snappy compression is used. So, it is good that there is at least this possibility of writing compatible parquet files. But still this is a bug. Maybe it should be by default uncompressed or gzip compressed.

@jorgecarleitao
Copy link
Collaborator

imo this is not a bug on our side -

  1. we declare the file as v2 and a valid codec (Lz4raw)
  2. pyarrow correctly read the file
  3. parquet-mr identifies this codec as null
  4. parquet-mr bails because codec must be defined

imo there are a couple of things on the parquet-mr:

  1. LZ4raw is interpreted as null by parquet-mr, imo this is a bug in parquet-mr
  2. LZ4raw is a valid codec and parquet-mr does not support it, imo this is a feature request to parquet-mr

with that said, the reason we use lz4 is because it offers a great performance. We use LZ4raw because parquet format itself strongly suggests that. So, we are left with the tradeoff - either we default to something that is less performant that parquet-mr supports or something more performant that parquet-mr does not support.

I am fine either way - defaults are always hard to choose from.

@andrei-ionescu
Copy link
Contributor Author

@jorgecarleitao, @ritchie46: It is not only parquet-tools (aka parquet-mr) that does not work. I tried in Apache Spark and the same issue is there and possible any other well established tool that has a latency in upgrading to latest versions.

I've seen that lz4raw is the newest addition on the parquet-format, 16 months ago (apache/parquet-format#168) and released in the 2.9.0 version of it, and although lz4raw is part of parquet-format the frameworks and tools working with parquet are running with old reliable versions of parquet (for example, Apache Spark currently uses
1.12.3 as seen here: https://github.com/apache/spark/blob/master/pom.xml#L134).

So, yes, it is not a bug in the code, is more of a usability and interoperability issue on Polars side. I mean, even if the parquet-mr is fixed, there are multiple other parquet based tools that also need to switch to 2.9.0 to be able to read those files, and this switch will eventually come but very slow.

I suggest two possible things to do:

  • Switch the default from lz4raw to something more common like uncompressed, snappy, etc
  • Have it properly stated in the documentation about this default parquet format used at write time by Polars

@ritchie46
Copy link
Member

Switch the default from lz4raw to something more common like uncompressed, snappy, etc

Sorry, but I don't want to do that. If all tools do that we never have progress. It is a reasonable default that saves a lot of time time in the most common cases.

I agree with you that it should be documented properly and that we can give portability advice.

@andrei-ionescu
Copy link
Contributor Author

andrei-ionescu commented Jul 7, 2022

@ritchie46 Ok. I understand that the first and most important for Polars right now is "speed". It needs to beat everybody. I've been doing distributed data storage and processing since a good while and never used the lz4 codec. All use cases I've encountered did use snappy and the ratio of feature set, interoperability, compatibility vs performance is very good. As far as I know snappy is the next after lz4 in terms of performance. And I've done processing and storage multiple times over PB of data.

To set things straight, I'm just stating my point of view and trying to support it with arguments. I'm glad that there is a workaround for this.

Also, having it written in the documentation would be of tremendous help.

@ritchie46 ritchie46 added invalid A bug report that is not actually a bug and removed bug Something isn't working labels Jul 8, 2022
@ritchie46
Copy link
Member

I have updated the documentation in #3926

@andrei-ionescu
Copy link
Contributor Author

@ritchie46: Will the text added in #3926 in regards to lz4raw vs snappy be present in this doc page: parquet.html#read--write?

@zundertj
Copy link
Collaborator

@andrei-ionescu : that is the polars-book, which lives in a separate repo https://github.com/pola-rs/polars-book. You can file an issue there, or even better, raise a pull request with the change. The text in #3926 has been added to the Python doc string, and so it is visible in the API reference at https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html

I am closing this issue, on the assumption that this is sufficient information. Feel free to re-open or create a new issue if anything is missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid A bug report that is not actually a bug
Projects
None yet
Development

No branches or pull requests

4 participants