Skip to content

Add a serializer for FileScanTask  #1698

@stevenzwu

Description

@stevenzwu

For batch/bounded mode, Java Serializable works well as there is no concern of schema evolution. If we are going to support the streaming read with long-running jobs, we need to consider schema evolution for checkpoint state. Otherwise, change in the code might break the Java serialization and ability to restore from checkpoint.

Here are some high-level thoughts.

  • Move most of the schema defined in DataFile to parent interface ContentFile. Extend the schema with the additional fields in DataFile and DeleteFile.
  • Add a schema to FileScanTask where ResidualEvaluator and PartitionSpec fields will be defined as string type.
  • CombinedScanTask schema is straightforward. it should be just a collection of FileScanTask
  • Add ScanTasks util class in iceberg-core that handles the serialization and deserialization of FileScanTask and CombinedScanTask

One challenge is how to plugin custom field serializers for ResidualEvaluator and PartitionSpec.

Overall, this seems like a large change. not sure if there is a simpler way.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions