Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break datafusion crate into smaller crates #1750

Closed
23 of 34 tasks
jimexist opened this issue Feb 5, 2022 · 15 comments
Closed
23 of 34 tasks

Break datafusion crate into smaller crates #1750

jimexist opened this issue Feb 5, 2022 · 15 comments
Assignees
Labels
enhancement New feature or request

Comments

@jimexist
Copy link
Member

jimexist commented Feb 5, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

a new feature: to break datafusion crate into separate smaller crates.

It helps with code management and dependency reasoning

Describe the solution you'd like


Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@jimexist jimexist added the enhancement New feature or request label Feb 5, 2022
@jimexist
Copy link
Member Author

jimexist commented Feb 5, 2022

cc @alamb @houqp @Dandandan what do you think?

@alamb
Copy link
Contributor

alamb commented Feb 5, 2022

I like this idea @jimexist 👍 There is some related commentary / information here: #348 as well

Some other crates that might be useful to consider;

  • datafusion_core (DataFusionError, DFSchema, etc)
  • datafusion_datasource (the built in parquet, avro, csv and json readers and supporting logic)

@andygrove
Copy link
Member

This will be helpful for use cases where users are looking to use DataFusion in a similar fashion to Calcite, for query parsing and planning, but not for execution. I like this idea.

@houqp
Copy link
Member

houqp commented Feb 6, 2022

Thank you @jimexist for taking on this! I think this is the right path forward.

cc @jorgecarleitao and @yjshen since they have proposed the similar ideas before.

@yahoNanJing
Copy link
Contributor

yahoNanJing commented Feb 7, 2022

Hi @jimexist, is it possible to make an exclusive crate for the data source so that it will be easy for the integration of the different kinds of remote object store?

Suppose I hope to introduce the hdfs as one of the remote object store. To be less intrusive, it's better to make an independent datafusion-objectstore-hdfs crate which depends on the datasource crate of datafusion. Then it would be much easier for other crates to decide whether to depend on the datafusion-hdfs crate or not without cyclic dependencies.

@jimexist
Copy link
Member Author

jimexist commented Feb 7, 2022

Hi @jimexist, is it possible to make an exclusive crate for the data source so that it will be easy for the integration of the different kinds of remote object store?

Suppose I hope to introduce the hdfs as one of the remote object store. To be less intrusive, it's better to make an independent datafusion-objectstore-hdfs crate which depends on the datasource crate of datafusion. Then it would be much easier for other crates to decide whether to depend on the datafusion-hdfs crate or not without cyclic dependencies.

yes it's a good idea and i believe it's included in the list already

@yahoNanJing
Copy link
Contributor

Thanks @jimexist

@jimexist
Copy link
Member Author

jimexist commented Feb 8, 2022

my current plan is to finish items for:

  • datafusion-common
  • datafusion-expr

before release 7:

and finish the rest after the release, to cap the amount of changes in a release

@Igosuki
Copy link
Contributor

Igosuki commented Feb 9, 2022

@jimexist could this lead to a smaller ballista client crate as well ? Potentially, this could greatly speed up compilation of programs who just want to be datafusion clients and not run the entire stack.

@jimexist
Copy link
Member Author

jimexist commented Feb 9, 2022

Contributor

i'm not sure as of now - but probably if we split up logical/physical planning further - but that won't happen soon

@alamb
Copy link
Contributor

alamb commented Feb 9, 2022

Related comment: #1762 (comment)

@jimexist
Copy link
Member Author

i wanted to continue iterating on this split but obviously i didn't have time to keep up. a lot has happened in the last few months. i wonder if this is still relevant or it shall be closed now. @alamb and @andygrove any suggestions?

@alamb
Copy link
Contributor

alamb commented Jul 20, 2022

@jimexist welcome back!

I would say that the main "break datafusion into smaller crates" has been completed. There are still some items in the description of this ticket that might not be done - I would personally recommend filing a new issue with any items that you think would still be good to work on and then closing this one.

@alamb
Copy link
Contributor

alamb commented Oct 24, 2022

Closing this ticket and we can track further splitting in follow on issues. Again 👏

@alamb
Copy link
Contributor

alamb commented Nov 11, 2022

I am thinking of trying to push us to the next level here #4181

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants