Existing _common_metadata should be deleted when ParquetOutputCommitter fails to write summary files

`ParquetOutputCommitter` only deletes `_metadata` when fails to write summary files. This may leave inconsistent existing `_common_metadata` out there.

This issue can be reproduced via the following Spark shell snippet:
```
import sqlContext.implicits._

val path = "file:///tmp/foo"
(0 until 3).map(i => Tuple1((s"a_$i", s"b_$i"))).toDF().coalesce(1).write.mode("overwrite").parquet(path)
(0 until 3).map(i => Tuple1((s"a_$i", s"b_$i", s"c_$i"))).toDF().coalesce(1).write.mode("append").parquet(path)
```
The 2nd write job fails to write the summary file because two written Parquet files contain different user-defined metadata (Spark SQL schema). We can find out that there is an `_common_metadata` left there:
```
$ tree /tmp/foo
/tmp/foo
├── _SUCCESS
├── _common_metadata
├── part-r-00000-1c8bcb7f-84cf-43e3-9cd6-04d371322d95.gz.parquet
└── part-r-00000-d759c53f-d12f-4555-9b27-8b03a8343b17.gz.parquet
```
Check its schema, the nested group contains only 2 fields, which is wrong:
```
$ parquet-schema /tmp/foo/_common_metadata
message root {
  optional group _1 {
    optional binary _1 (UTF8);
    optional binary _2 (UTF8);
  }
}
```

**Reporter**: [Cheng Lian](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=lian+cheng) / @liancheng
**Assignee**: [Cheng Lian](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=lian+cheng) / @liancheng
#### Related issues:
- [It should be possible to merge summary files, and control which files are generated](https://github.com/apache/parquet-java/issues/1893) (depends upon)
#### PRs and other links:
- [PR #258](https://github.com/apache/parquet-mr/pull/258)

<sub>**Note**: *This issue was originally created as [PARQUET-359](https://issues.apache.org/jira/browse/PARQUET-359). Please see the [migration documentation](https://issues.apache.org/jira/browse/PARQUET-2502) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Existing _common_metadata should be deleted when ParquetOutputCommitter fails to write summary files #1875

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Existing _common_metadata should be deleted when ParquetOutputCommitter fails to write summary files #1875

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions