Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove get_scan_files and ExecutionPlan::file_scan_config (#7357) #7487

Merged
merged 2 commits into from
Sep 10, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Sep 6, 2023

Which issue does this PR close?

Closes #7357

Rationale for this change

This was added by @yahoNanJing in #5572. It was then evolved in #7175 to try to avoid strongly coupling to AvroExec. However, this is causing issues trying to split apart the crates (#7357) and it is unclear how best to adapt the design #7425 (comment).

Given this API does not appear to be being used, I wonder if we can do the simplest possible thing, and just remove it

What changes are included in this PR?

Removes the API

Are these changes tested?

Are there any user-facing changes?

@tustvold tustvold added the api change Changes the API exposed to users of the crate label Sep 6, 2023
@github-actions github-actions bot added the core Core DataFusion crate label Sep 6, 2023
Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Let's get @not-my-profile's view on that, i.e. if he has a good reason to keep this API.

@not-my-profile
Copy link
Contributor

get_scan_files is public API. So just because it's not used internally doesn't mean that it's not used at all.

The question is if @yahoNanJing's motivation still applies:

Currently the TreeNodeRewriter is as a visitor to transform a node to another. However, sometimes we don't need to do the transformation and what we want is only to collect some info from the node. To achieve this, it's better to introduce another visitor for collecting info and keep the node unchanged.

As I said in #7425 (comment) I'm not familiar enough with datafusion to determine that.

it is unclear how best to adapt the design

I think that #7485 is a decent solution if we decide to keep the function.

alamb
alamb previously approved these changes Sep 6, 2023
@alamb alamb dismissed their stale review September 6, 2023 16:26

It is used in ballista

@alamb
Copy link
Contributor

alamb commented Sep 6, 2023

It appears that get_scan_files is used in ballista: https://github.com/search?q=repo%3Aapache%2Farrow-ballista%20get_scan_files&type=code

@tustvold
Copy link
Contributor Author

tustvold commented Sep 6, 2023

It appears to only be used as an argument for https://github.com/apache/arrow-ballista/blob/948d0777f972144cc242d0398fd61fadf34cec73/ballista/scheduler/src/cluster/mod.rs#L679 which seems like it could definitely be achieved in a different way

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked more into the ballista code and I think the files are used for more than just their file count. Specifically the URL is used here:

https://github.com/apache/arrow-ballista/blob/948d0777f972144cc242d0398fd61fadf34cec73/ballista/scheduler/src/cluster/mod.rs#L625C1-L626C1

I am thinking, however, that we could effectively port get_files_for_scan into Ballista... Let me give that a try

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this API into Ballista apache/datafusion-ballista#877 so I think it is ok to remove from DataFusion. @yahoNanJing please advise if there are reasons to avoid this

@tustvold tustvold merged commit 63ccd4a into apache:main Sep 10, 2023
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Detach ExecutionPlan and DataSource
4 participants