Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split datafusion-physical-plan sub-module #1754

Closed
Tracked by #4181 ...
jimexist opened this issue Feb 5, 2022 · 5 comments · Fixed by #7432
Closed
Tracked by #4181 ...

split datafusion-physical-plan sub-module #1754

jimexist opened this issue Feb 5, 2022 · 5 comments · Fixed by #7432
Assignees
Labels
enhancement New feature or request

Comments

@jimexist
Copy link
Member

jimexist commented Feb 5, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@alamb
Copy link
Contributor

alamb commented Mar 2, 2023

My eventual plan is to have a dependency hierarchy like:

core --> (physical_plan) --> (datasource) --> (execution)    

I am trying to extract execution first, then I will extract datasource, and then physical plan

@alamb
Copy link
Contributor

alamb commented Mar 2, 2023

End state:

  • SessionContext and SessionState stay in datafusion-core
  • RuntimeEnv and TaskContext are moved into datafusion-execution

Specific dependencies I need to do this:

  • move object registries and table_factories off RuntimeEnv and into SessionState directly (this is so they can be populated from the SessionConfig constructor instead) and RuntimeEnv can move into datafusion-execution
  • move TaskContext into datafusion_execution
  • remove SessionState reference from file_format (replace with TaskContext)
  • Move ExecutionProps on to TaskContext (needed for planning)
  • remove SessionState reference from TableProvider (replace with TaskContext)
  • Move datasource into its own crate
  • move physical plan into its own crate

@alamb
Copy link
Contributor

alamb commented Mar 27, 2023

Update 2023-03-27:

I have moved TaskContext into datafusion-execution 🎉 and updated the list in #1754 (comment)

To move datasource into its own crate I need to do the following two non trivial API changes (added to the list above)

  • remove SessionState reference from file_format (replace with TaskContext)
  • remove SessionState reference from TableProvider (replace with TaskContext)

Here are the file_format uses (I don't think this will be major)

    async fn infer_stats(
        &self,
        _state: &SessionState,
        _store: &Arc<dyn ObjectStore>,
        _table_schema: SchemaRef,
        _object: &ObjectMeta,
    ) -> Result<Statistics> {
        Ok(Statistics::default())
    }

The TableProvider will be major I think:

    /// Create an ExecutionPlan that will scan the table.
    /// The table provider will be usually responsible of grouping
    /// the source data into partitions that can be efficiently
    /// parallelized or distributed.
    async fn scan(
        &self,
        state: &SessionState,
        projection: Option<&Vec<usize>>,
        filters: &[Expr],
        // limit can be used to reduce the amount scanned
        // from the datasource as a performance optimization.
        // If set, it contains the amount of rows needed by the `LogicalPlan`,
        // The datasource should return *at least* this number of rows if available.
        limit: Option<usize>,
    ) -> Result<Arc<dyn ExecutionPlan>>;

@alamb
Copy link
Contributor

alamb commented Jun 5, 2023

I am actively working on this -- after a few tries my current plan is to to end up with this dependency diagram (aka datasource and listing table need to depend on physical_plan I think)

More PR to come

(core) --> (datasource) --> (physical_plan)
    \---------(physical_optimizer) --/

@alamb
Copy link
Contributor

alamb commented Jun 6, 2023

I tried extracting physical_plan again after #6516 and here is the list of things I needed to do (which can be done as smaller PRs):

Also I need to figure out how to remove all uses of SessionContext in tests (maybe by moving tests to sql integration), but that in theory could come as a follow on PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants