Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Store DataFile in FileScanTask instead #607

Closed
wants to merge 1 commit into from

Conversation

Xuanwo
Copy link
Member

@Xuanwo Xuanwo commented Sep 6, 2024

This PR will store DataFile in FileScanTask directly so users can use the information from DataFile more easily like format, type, size and record_counts.

Signed-off-by: Xuanwo <github@xuanwo.io>
@Xuanwo
Copy link
Member Author

Xuanwo commented Sep 6, 2024

cc @liurenjie1024 and @ZENOTME, I'm unsure why FileScanTask requires Serialize and Deserialize. Is it safe to eliminate this requirement?


I have seen #377 (comment). I encountered a situation where I need to estimate the data foramt, data size and data count, which requires most of the content of the data file. It doesn't seem ideal to add the same fields in FileScanTask and re-expose them. Any thoughts?

@Xuanwo
Copy link
Member Author

Xuanwo commented Sep 6, 2024

Also inviting @sdd to join the discussion since you recently worked on the file scan.

@ZENOTME
Copy link
Contributor

ZENOTME commented Sep 6, 2024

I'm unsure why FileScanTask requires Serialize and Deserialize. Is it safe to eliminate this requirement?

As in #377 (comment), I think FileScanTask need to be able to Serialize and Deserialize.

I have seen #377 (comment). I encountered a situation where I need to estimate the data foramt, data size and data count, which requires most of the content of the data file. It doesn't seem ideal to add the same fields in FileScanTask and re-expose them. Any thoughts?

The reason why we only contain data file paths is that I find that the reader only needs the file path in ManifestEntry. Lots of information in DataFile is about statistics, like partition, lower_bounds, and upper_bounds, and this information in mainly used in the planning phase to prune the file.
Can you elaborate on the information needed? If what we need is: data format, data size, and data count, I think it's ok to store these three extra fields in FileScanTask now.
But if there is more information needed, I think we can consider containing DataFile in FileScanTask and make it Serializable and Deserializable.

@sdd
Copy link
Contributor

sdd commented Sep 6, 2024

Thanks for the mention, @Xuanwo 👍🏼

I have a couple of points:

Firstly, FileScanTask definitely needs to be Serialize and Deserialize. I'm already using this, and I don't think I'l be the only one. In cases such as mine where iceberg-rust forms the core of an Apache Arrow Flight service, it is crucial to be able to distribute tasks from the planner / orchestrator (where I run scan.plan_files()) to multiple executors on other machines that can together stream results (using reader.read()) back to the API consumer.

Because of this sharing of FileScanTask over the wire to other machines, I'm also quite sensitive to the size of this. In situations where the file plan could contain over a thousand tasks, I don't want to have the size of these tasks massively bloated. DataFile can be pretty big, and it's size is unbounded as it grows unavoidably with both the number of columns in the table, as well as the number of row groups in the file.

Thirdly, even though we don't use it yet, DataFile contains key_metadata, which is certainly not something we'd want to stream back to consumers.

You could argue that if we did switch to having DataFile in FileScanTask then I could simply extract out just the file path, projected field ids, predicate and schema and send them in a custom serializable struct instead. The problem with that is that reader::read() expects a stream of FileScanTasks and I'd be unable to reconstruct them exactly on the executor nodes.

So, on the whole, I'm against having DataFile in the FileScanTask.

@sdd
Copy link
Contributor

sdd commented Sep 6, 2024

One option would be to have an alternative method to scan.plan_files that returns a stream of more detailed objects that themselves embed FileScanTask. This new method could be aimed towards use cases where the consumer is in the same process or at least on the same machine, and so bandwidth and privacy are less of a concern.

If we have both FileScanTask and, for want of a better name, RichFileScanTask, implement a ScanTask trait, then perhaps reader.read() could take a stream of impl ScanTask so that it would work with both streams?

@sdd
Copy link
Contributor

sdd commented Sep 6, 2024

Or we can have a scan.plan_files_rich method that returns richer objects, with a scan.to_file_scan_task_stream method that takes a stream of fat objects and transforms it into a stream of FileScanTasks. reader.read() can then stay unchanged, and scan.plan_files would be refactored into a thin wrapper that calls scan.plan_files_rich and pipes it through scan.to_file_scan_task_stream. That way the changes would be non-breaking for existing users.

@Xuanwo
Copy link
Member Author

Xuanwo commented Sep 7, 2024

Thank you @ZENOTME and @sdd for the prompt response! I now understand the context and background. I realize that FileScanTask must be Serializable and Deserializable.

Given this context, we have two options:

  • Add fields such as data_file_content, data_file_path, data_file_format, data_file_size_in_bytes, and data_file_record_count to FileScanTask.
  • Propose a new API that can return something similar (perhaps FileScanPlan?) to FileScanTask with more information but not serializable.

What do you think?

@ZENOTME
Copy link
Contributor

ZENOTME commented Sep 7, 2024

Add fields such as data_file_content, data_file_path, data_file_format, data_file_size_in_bytes, and data_file_record_count to FileScanTask.

I prefer this if this solution works in your scene.

Propose a new API that can return something similar (perhaps FileScanPlan?) to FileScanTask with more information but not serializable.

I think we should propose this until there is a real requirement.

@Xuanwo
Copy link
Member Author

Xuanwo commented Sep 7, 2024

Great, let's me follow the option 1 first. We can discuss about option 2 while needed.

@Xuanwo Xuanwo closed this Sep 7, 2024
@Xuanwo Xuanwo deleted the file_scan_task branch September 7, 2024 11:36
@amitgilad3
Copy link

amitgilad3 commented Oct 4, 2024

Hi @sdd @Xuanwo & @ZENOTME - For the past few days i have been looking into compaction and i saw this discussion , i have been looking into the java implementation of the FileScanTask class and in the java implementation it has a DataFile object which is used when working with the rewrite in order to retrieve information about partition and file.specId now i know @sdd raised a legit reason why not to have it in the FileScanTask but i wondering what would be the preferred approach so i can try to add the necessary data to allow with implementing compaction:

  1. add the fields to FileScanTask like partition & specId?
  2. Propose a new API that can return something similar (perhaps FileScanPlan?) to FileScanTask with more information but not serializable?
  3. any other solution ?

also i know that there are many other features that are missing to implement this but i would like to assist were possible

@Xuanwo
Copy link
Member Author

Xuanwo commented Oct 5, 2024

Hi, @amitgilad3 would you like to start a new issue for this discussion? It's possible that comments on closed PR being missed.

Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants