Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Is Ballista a standalone system or framework #1916

Closed
alamb opened this issue Mar 3, 2022 · 21 comments
Closed

Discussion: Is Ballista a standalone system or framework #1916

alamb opened this issue Mar 3, 2022 · 21 comments

Comments

@alamb
Copy link
Contributor

alamb commented Mar 3, 2022

This ticket is to try and pull some of the great discussion from #1881 into a ticket to try and get consensus on a topic that is larger than user defined functions (the topic of #1881 )

The main question is about how Ballista is/should be used. Is it a framework like DataFusion or a standalone system (like Spark)? The answer to this question would help drive technical choices such as #1881

It would also be great to update the Ballista docs when we have consensus to this question

Originally posted by @gaojun2048 in #1881 (comment)

I feel ideally they should use the same programing interface (SQL or DataFrame), DataFusion provide computation on a single node and Ballista add a distributed layer. With this assumption, DF is the compute core wouldn't it make sense to have udf support in DF?

I don’t know if my understanding is wrong. I always think that DF is just a computing library, which cannot be directly deployed in production. Those who use DF will use DF as a dependency of the project and then develop their computing engine based on DF. For example, Ballista is a distributed computing engine developed based on DF. Ballista is a mature computing engine just like Presto/spark. People who use Ballista only need to download and deploy Ballista to their machines to start the ballista service. They rarely care about how Ballista is implemented, so a A udf plugin that supports dynamic loading allows these people to define their own udf functions without modifying Ballista's source code.

@realno says #1881 (comment)

Currently DF does more than just a computing library, it has full support for SQL parsing, dataframe API and planners - it is a complete single node compute engine. Ballista still uses most of the fundamental implementation from DF. The current effort splitting it to separate creates is a step towards making it just a library, I am curious what the direction the community is thinking here @alamb . If this is where we are going I think we can think about merging the SQL, DataFrame, ExecutionContext and Planner logic between DF and Ballista.

@alamb
Copy link
Contributor Author

alamb commented Mar 3, 2022

For what it is worth, my mental model is similar to @gaojun2048 -- that DataFusion is mostly designed to be a library to build other systems -- for example the datafusion-cli and datafusion python bindings are in my mind "systems" built on datafusion. I realize the line is a little blurry here as pointed out by @realno and that DataFusion itself has some things I would normally expect to live outside such a library (such as CREATE TABLE support)

Ballista I have always thought of as planning to be a standalone system (that people would deploy without having to build it from source).

I am curious to hear what others think

cc @edrevo @andygrove @thinkharderdev @matthewmturner @liukun4515 @Ted-Jiang @yjshen

@thinkharderdev
Copy link
Contributor

I agree that the goal should be for Ballista to be a standalone system (or at least something that CAN be deployed as a standalone system), but one of the unique and valuable aspects of DataFusion is the extensibility. It is a really important differentiator to other solutions in this space and I think we should ensure that the extensibility of DataFusion extends to Ballista.

@andygrove
Copy link
Member

I also agree that Ballista should be a standalone system, with a client API that can be used as a library.

@Igosuki
Copy link
Contributor

Igosuki commented Mar 4, 2022

Extensibility is a major selling point, the only thing I'd be worried about is having stable interfaces.
As for Ballista my only problem right now is that I have to embark the entirety of datafusion as a library to be able to send even the simples sql query using the client.

@yahoNanJing
Copy link
Contributor

Actually I think Ballista should act as a distributed computing framework like Spark Core. Like Spark is based on the RDD, Ballista is based on the ExecutionPlan for the DAG.

Based on this framework, Ballista should also include several kinds of deployments. Currently, only standalone mode is provided. In the future, it's possible to introduce more resource managers, like Yarn, Mesos, etc.

For the SQL part, I think it should be an independent part. The core of Ballista should not depend on the SQL.
Picture1

@yjshen
Copy link
Member

yjshen commented Mar 4, 2022

I agree ballista should be a standalone system like Spark SQL or Presto.

I think Ballista should act as a distributed computing framework like Spark Core. Like Spark is based on the RDD, Ballista is based on the ExecutionPlan for the DAG.

So it's not a Spark Core on RDD but Spark Core on SQL physical plan, right?

@EricJoy2048
Copy link
Member

I also agree that Ballista should be a standalone system, with a client API that can be used as a library.

I agree with you.

@realno
Copy link
Contributor

realno commented Mar 4, 2022

+1 Ballista should be a standalone system.

I think there is another side of this question- do we consider DataFusion as a pure library? Would this change how things are organized?
If so, what is the API to interface with other clients, logical plan? physical plan? And where would SQL parsing, optimizer and planner go?

First thought came to mind is to have physical plan be the interface, different clients have the option to extend SQL parser, optimizer and planner as needed. Most can reuse the existing implementation, an example for Ballista is to extend the planner to support udf.

@matthewmturner
Copy link
Contributor

i think i agree with @realno. for example the datafusion python bindings are effectively reusing the rust implementation which thus far for my (limited) use cases has worked well. the python bindings also have udfs but i havent had chance to look into how theyre implemented and how that may contrast to ballistas needs.

@yjshen
Copy link
Member

yjshen commented Mar 4, 2022

I think there is another side of this question- do we consider DataFusion as a pure library? Would this change how things are organized? If so, what is the API to interface with other clients, logical plan? physical plan? And where would SQL parsing, optimizer and planner go?

Yes, I think so. And actually, we are already heading in that direction. The first step to make each module substitutable is to make each part, SQL parsing, optimizing, and executing in its own crate. @jimexist is leading the effort in #1750. Once this separating of functionalities is done, the next step would be introducing interfaces between adjacent functionalities, depending on how would we substitute modules.

One of the possibilities I'm sure is, once we have #1887 or even https://github.com/andygrove/substrait-rs, we can link Apache Calcite to DataFusion's physical plan execution layer.

@realno
Copy link
Contributor

realno commented Mar 4, 2022

Yes, I think so. And actually, we are already heading in that direction. The first step to make each module substitutable is to make each part, SQL parsing, optimizing, and executing in its own crate. @jimexist is leading the effort in #1750. Once this separating of functionalities is done, the next step would be introducing interfaces between adjacent functionalities, depending on how would we substitute modules.

I think this is a good direction.

@liukun4515
Copy link
Contributor

One of the possibilities I'm sure is, once we have #1887 or even https://github.com/andygrove/substrait-rs, we can link Apache Calcite to DataFusion's physical plan execution layer.

good point.

@liukun4515
Copy link
Contributor

I also agree that Ballista should be a standalone system, with a client API that can be used as a library.

agree with that and same opinion in this #1881 (comment)

@Ted-Jiang
Copy link
Member

standalone system +1
Look forward to it going into production one day.

@andygrove
Copy link
Member

I am now back working on Ballista after a bit of a break and can now contribute a bit more to this discussion.

Although I see Ballista as a standalone system that users will likely install as Docker images or from a tarball, I think it is still important to publish the crates to crates.io so that cargo install is another installation option.

It is also important that we publish the ballista crate, which provides the client API, so that we can call it from other projects such as from Ballista Python bindings (which I am starting work on now, based on the work in datafusion-python).

I noticed that we cannot publish the Ballista crates from the recent 7.0.0 release tarball and have filed #1980 to fix that for the next release.

@rdettai
Copy link
Contributor

rdettai commented Mar 11, 2022

I understand that Ballista is currently heading toward being standalone system, but I am wondering that is what the ecosystem needs.

I feel that being a plugable library is a big part of Datafusion's success. But the projects that are embedding Datafusion today as a single node compute engine, are they not going to need to be distributed tomorrow? If Ballista is really designed as a standalone system, those growing projects might use it as an example on how to distribute the Datafusion query plan, but they might not be able to reuse much code.

Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄

@alamb
Copy link
Contributor Author

alamb commented Mar 12, 2022

Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄

DataFusion is not JVM based, which could be an interesting differentiator.

I think making a generic embedded distributing framework will be challenging as there are so many differing dimensions to consider (catalog structure, local caching, etc) that may be different

Comparatively I think a singe node column oriented analytic query engine is a fairly well understood pattern (though I do think the DataFusion implementation is very good :bowtie: )

One thing I personally hope is that Ballista drives features into DataFusion so that making a new distributed engine using DataFusion becomes easier over time.

Some examples of this technical flow I think are:

  1. The extraction of datafusion-proto struct serialization by @carols10cents
  2. The object store abstraction from @yjshen
  3. The listing table provider from @rdettai
  4. Making planning async
  5. The work that @mingmwang is doing to enable intra-processc concurrency

@rdettai
Copy link
Contributor

rdettai commented Mar 13, 2022

I think making a generic embedded distributing framework will be challenging as there are so many differing dimensions to consider (catalog structure, local caching, etc) that may be different

Agreed! I was more thinking about designing the engine in modules that can be re-used more easily by other distributed system. For instance using [https://github.com/substrait-io/substrait] to serialize the tasks for the workers instead of the current custom protos could help re-use ballista workers with minimal effort from other systems.

@realno
Copy link
Contributor

realno commented Mar 13, 2022

Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄

I feel some opportunities/differentiators for Ballista are the following:

  1. non-JVM - this brings a lot of benefit such as lower footprint, memory efficiency, and no GC cost (Rust specific)
  2. A chance for more modern design principles - for example Spark was originally architected to best deployed to bare metal, it is hard to make some changes to be more cloud friendly
  3. Utilize modern resource management and orchestration technologies - reusing mature tools like k8s will simplify Ballista's implementation (it probably doesn't need a very complex resource management system anymore) and integrate easily with modern systems (cloud native and simpler scaling model and multi-tenancy)
  4. Using Arrow as the backbone opens doors for more advanced use case such as ML - it may be efficiently integrated with Pandas or Tensorflow through Arrow.

We heavily use systems like Spark for Analytics and ML, the above points are pain points that worth consider switching. I feel the pivot point for having a new reference distributed compute platform is getting closer. :)

@Igosuki
Copy link
Contributor

Igosuki commented Mar 13, 2022 via email

@alamb
Copy link
Contributor Author

alamb commented Apr 25, 2022

I think the discussion on this issue is closed (Ballista is a system seems to be the prevailing consensus). Closing ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests