Skip to content

Provide differert read interface for reader #1047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks
ZENOTME opened this issue Mar 7, 2025 · 5 comments
Open
3 tasks

Provide differert read interface for reader #1047

ZENOTME opened this issue Mar 7, 2025 · 5 comments
Labels
enhancement New feature or request

Comments

@ZENOTME
Copy link
Contributor

ZENOTME commented Mar 7, 2025

Is your feature request related to a problem or challenge?

For now, our arrow reader accepts the FileScanTask and returns the RecordBatchStream to the user. After #630, the reader can process the delete file and merge it with the data file, which it's good to ready to use out of the box. However, for some compute engines, they hope to process delete file by themselves so that they can utilize the existing join executor and storage to spill the data. This require to read the delete file directly rather than process the delete file internally.

Based on this, I suggest providing different read interface so that it satisfy different requirement:

  • read: process data and delete file of FileScanTask internally
  • read_data: read data file of FileScanTask internally
  • read_pos_delete: read position delete file of FileScanTask and return result directly
  • read_eq_delete: read equality delete file of FileScanTask and return result directly

Describe the solution you'd like

No response

Willingness to contribute

  • I can contribute to this feature independently
  • I would be willing to contribute to this feature with guidance from the Iceberg Rust community
  • I cannot contribute to this feature at this time
@ZENOTME ZENOTME added the enhancement New feature or request label Mar 7, 2025
@ZENOTME
Copy link
Contributor Author

ZENOTME commented Mar 7, 2025

How do you think? cc @liurenjie1024 @Xuanwo @Fokko @sdd

@Xuanwo
Copy link
Member

Xuanwo commented Mar 7, 2025

Hi, I believe that's related to #1036

@sdd
Copy link
Contributor

sdd commented Mar 7, 2025

Seems like a reasonable idea to me. If my 5 open PRs for delete file read support get reviewed and merged then implementing what you need would be pretty trivial on top of them :-)

@liurenjie1024
Copy link
Contributor

Thanks @ZENOTME for raising this. I think what's missing is a FileReader which accepts following arguements:

  1. File path
  2. File range
  3. Expected schema
  4. Arrow batch size

This reader need to convert files(parquet, orc, avro) into arrow record batch, which handles things like missing column, type promotion, etc, which are caused by schema evolution.

With this api, it would be easy to implement the read_data, read_pos_delete, read_eq_delete you mentioned. But I'm not sure if we acutally need to provided these apis. I think the FileReader + FileScanTask has provided enough flexibility for compute engines. For example, it can choose to join data file with pos deletions and eq deletions in logical plan, or they could choose to implement their own file scan operator.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Mar 14, 2025

Thanks @ZENOTME for raising this. I think what's missing is a FileReader which accepts following arguements:

  1. File path
  2. File range
  3. Expected schema
  4. Arrow batch size

This reader need to convert files(parquet, orc, avro) into arrow record batch, which handles things like missing column, type promotion, etc, which are caused by schema evolution.

With this api, it would be easy to implement the read_data, read_pos_delete, read_eq_delete you mentioned. But I'm not sure if we acutally need to provided these apis. I think the FileReader + FileScanTask has provided enough flexibility for compute engines. For example, it can choose to join data file with pos deletions and eq deletions in logical plan, or they could choose to implement their own file scan operator.

In this design, does ArrowReader reuse FileReader?

  • If so, I think we may need to refactor some logic of ArrowReader
  • Otherwise, FileReader is an independent component and it may be more convenient to maintain.

And for delete file(pos delete, equality delete), do we need to handle things like missing column, type promotion? 🤔 Seems for pos delete and eq delete without value, we can't fulfill the value if they miss. So in here we may need the read_data, read_pos_delete, read_eq_delete to separate the handle way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants