-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Store DataFile in FileScanTask instead #607
Conversation
Signed-off-by: Xuanwo <github@xuanwo.io>
cc @liurenjie1024 and @ZENOTME, I'm unsure why I have seen #377 (comment). I encountered a situation where I need to estimate the data foramt, data size and data count, which requires most of the content of the data file. It doesn't seem ideal to add the same fields in FileScanTask and re-expose them. Any thoughts? |
Also inviting @sdd to join the discussion since you recently worked on the file scan. |
As in #377 (comment), I think FileScanTask need to be able to Serialize and Deserialize.
The reason why we only contain data file paths is that I find that the reader only needs the file path in ManifestEntry. Lots of information in DataFile is about statistics, like partition, lower_bounds, and upper_bounds, and this information in mainly used in the planning phase to prune the file. |
Thanks for the mention, @Xuanwo 👍🏼 I have a couple of points: Firstly, Because of this sharing of Thirdly, even though we don't use it yet, You could argue that if we did switch to having So, on the whole, I'm against having |
One option would be to have an alternative method to If we have both |
Or we can have a |
Thank you @ZENOTME and @sdd for the prompt response! I now understand the context and background. I realize that Given this context, we have two options:
What do you think? |
I prefer this if this solution works in your scene.
I think we should propose this until there is a real requirement. |
Great, let's me follow the option 1 first. We can discuss about option 2 while needed. |
Hi @sdd @Xuanwo & @ZENOTME - For the past few days i have been looking into compaction and i saw this discussion , i have been looking into the java implementation of the FileScanTask class and in the java implementation it has a DataFile object which is used when working with the rewrite in order to retrieve information about partition and file.specId now i know @sdd raised a legit reason why not to have it in the FileScanTask but i wondering what would be the preferred approach so i can try to add the necessary data to allow with implementing compaction:
also i know that there are many other features that are missing to implement this but i would like to assist were possible |
Hi, @amitgilad3 would you like to start a new issue for this discussion? It's possible that comments on closed PR being missed. Thanks in advance. |
This PR will store DataFile in FileScanTask directly so users can use the information from DataFile more easily like
format
,type
,size
andrecord_counts
.