Add a serializer for FileScanTask 

For batch/bounded mode, Java Serializable works well as there is no concern of schema evolution. If we are going to support the streaming read with long-running jobs, we need to consider schema evolution for checkpoint state. Otherwise, change in the code might break the Java serialization and ability to restore from checkpoint.

Here are some high-level thoughts.

* Move most of the schema defined in `DataFile` to parent interface `ContentFile`. Extend the schema with the additional fields in `DataFile` and `DeleteFile`.
* Add a schema to `FileScanTask` where `ResidualEvaluator` and `PartitionSpec` fields will be defined as string type.
* `CombinedScanTask` schema is straightforward. it should be just a collection of `FileScanTask`
* Add `ScanTasks` util class in iceberg-core that handles the serialization and deserialization of `FileScanTask` and `CombinedScanTask`

One challenge is how to plugin custom field serializers for `ResidualEvaluator` and `PartitionSpec`.

Overall, this seems like a large change. not sure if there is a simpler way.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a serializer for FileScanTask #1698

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add a serializer for FileScanTask #1698

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions