Skip to content

Existing _common_metadata should be deleted when ParquetOutputCommitter fails to write summary files #1875

@asfimport

Description

@asfimport

ParquetOutputCommitter only deletes _metadata when fails to write summary files. This may leave inconsistent existing _common_metadata out there.

This issue can be reproduced via the following Spark shell snippet:

import sqlContext.implicits._

val path = "file:///tmp/foo"
(0 until 3).map(i => Tuple1((s"a_$i", s"b_$i"))).toDF().coalesce(1).write.mode("overwrite").parquet(path)
(0 until 3).map(i => Tuple1((s"a_$i", s"b_$i", s"c_$i"))).toDF().coalesce(1).write.mode("append").parquet(path)

The 2nd write job fails to write the summary file because two written Parquet files contain different user-defined metadata (Spark SQL schema). We can find out that there is an _common_metadata left there:

$ tree /tmp/foo
/tmp/foo
├── _SUCCESS
├── _common_metadata
├── part-r-00000-1c8bcb7f-84cf-43e3-9cd6-04d371322d95.gz.parquet
└── part-r-00000-d759c53f-d12f-4555-9b27-8b03a8343b17.gz.parquet

Check its schema, the nested group contains only 2 fields, which is wrong:

$ parquet-schema /tmp/foo/_common_metadata
message root {
  optional group _1 {
    optional binary _1 (UTF8);
    optional binary _2 (UTF8);
  }
}

Reporter: Cheng Lian / @liancheng
Assignee: Cheng Lian / @liancheng

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-359. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions