Skip to content

Missing parquet-file due to wrong state in writer #13508

@fpetersen-gl

Description

@fpetersen-gl

Apache Iceberg version

1.9.1 (latest release)

Query engine

None

Please describe the bug 🐞

Description of the problem

If the connection to the underlying IO (we're using S3) is cut in the wrong moment, a parquet-file cannot be written correctly, but its writer assumes it was.
This writer is in our case wrapped by a FanoutWriter, which happily returns this file as result after closing all of them, making it appear in the metadata.
The next call to close all writers being part of the FanoutWriter ignores the failed writer, as it is already marked as closed.

Result: A snapshot is being written which references the missing parquet-file.

Code analysis

Setting the internal state to closed in ParquetWriter.close() happens already before any code is executed that could potentially break. There's writer.end(metadata), which can throw an IOException, but flushRowGroup(true) before can throw an UncheckedIOException, thus leaving the writer in a wrong state.

Possible solution

First naive thing that comes to my mind would be to move this.closed = true to the very end of the method. This would only change the state iff all procedures for closing the writer have been executed successfully.
I'll try to come up with a test to reproduce this issue, will update the ticket afterwards.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions