-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Description
For batch/bounded mode, Java Serializable works well as there is no concern of schema evolution. If we are going to support the streaming read with long-running jobs, we need to consider schema evolution for checkpoint state. Otherwise, change in the code might break the Java serialization and ability to restore from checkpoint.
Here are some high-level thoughts.
- Move most of the schema defined in
DataFileto parent interfaceContentFile. Extend the schema with the additional fields inDataFileandDeleteFile. - Add a schema to
FileScanTaskwhereResidualEvaluatorandPartitionSpecfields will be defined as string type. CombinedScanTaskschema is straightforward. it should be just a collection ofFileScanTask- Add
ScanTasksutil class in iceberg-core that handles the serialization and deserialization ofFileScanTaskandCombinedScanTask
One challenge is how to plugin custom field serializers for ResidualEvaluator and PartitionSpec.
Overall, this seems like a large change. not sure if there is a simpler way.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels