Discussion: Is Ballista a standalone system or framework #1916

alamb · 2022-03-03T19:17:05Z

This ticket is to try and pull some of the great discussion from #1881 into a ticket to try and get consensus on a topic that is larger than user defined functions (the topic of #1881 )

The main question is about how Ballista is/should be used. Is it a framework like DataFusion or a standalone system (like Spark)? The answer to this question would help drive technical choices such as #1881

It would also be great to update the Ballista docs when we have consensus to this question

Originally posted by @gaojun2048 in #1881 (comment)

I feel ideally they should use the same programing interface (SQL or DataFrame), DataFusion provide computation on a single node and Ballista add a distributed layer. With this assumption, DF is the compute core wouldn't it make sense to have udf support in DF?

I don’t know if my understanding is wrong. I always think that DF is just a computing library, which cannot be directly deployed in production. Those who use DF will use DF as a dependency of the project and then develop their computing engine based on DF. For example, Ballista is a distributed computing engine developed based on DF. Ballista is a mature computing engine just like Presto/spark. People who use Ballista only need to download and deploy Ballista to their machines to start the ballista service. They rarely care about how Ballista is implemented, so a A udf plugin that supports dynamic loading allows these people to define their own udf functions without modifying Ballista's source code.

@realno says #1881 (comment)

Currently DF does more than just a computing library, it has full support for SQL parsing, dataframe API and planners - it is a complete single node compute engine. Ballista still uses most of the fundamental implementation from DF. The current effort splitting it to separate creates is a step towards making it just a library, I am curious what the direction the community is thinking here @alamb . If this is where we are going I think we can think about merging the SQL, DataFrame, ExecutionContext and Planner logic between DF and Ballista.

alamb · 2022-03-03T19:29:49Z

For what it is worth, my mental model is similar to @gaojun2048 -- that DataFusion is mostly designed to be a library to build other systems -- for example the datafusion-cli and datafusion python bindings are in my mind "systems" built on datafusion. I realize the line is a little blurry here as pointed out by @realno and that DataFusion itself has some things I would normally expect to live outside such a library (such as CREATE TABLE support)

Ballista I have always thought of as planning to be a standalone system (that people would deploy without having to build it from source).

I am curious to hear what others think

cc @edrevo @andygrove @thinkharderdev @matthewmturner @liukun4515 @Ted-Jiang @yjshen

thinkharderdev · 2022-03-03T19:50:04Z

I agree that the goal should be for Ballista to be a standalone system (or at least something that CAN be deployed as a standalone system), but one of the unique and valuable aspects of DataFusion is the extensibility. It is a really important differentiator to other solutions in this space and I think we should ensure that the extensibility of DataFusion extends to Ballista.

andygrove · 2022-03-03T20:56:38Z

I also agree that Ballista should be a standalone system, with a client API that can be used as a library.

Igosuki · 2022-03-04T00:42:17Z

Extensibility is a major selling point, the only thing I'd be worried about is having stable interfaces.
As for Ballista my only problem right now is that I have to embark the entirety of datafusion as a library to be able to send even the simples sql query using the client.

yahoNanJing · 2022-03-04T02:39:51Z

Actually I think Ballista should act as a distributed computing framework like Spark Core. Like Spark is based on the RDD, Ballista is based on the ExecutionPlan for the DAG.

Based on this framework, Ballista should also include several kinds of deployments. Currently, only standalone mode is provided. In the future, it's possible to introduce more resource managers, like Yarn, Mesos, etc.

For the SQL part, I think it should be an independent part. The core of Ballista should not depend on the SQL.

yjshen · 2022-03-04T04:26:41Z

I agree ballista should be a standalone system like Spark SQL or Presto.

I think Ballista should act as a distributed computing framework like Spark Core. Like Spark is based on the RDD, Ballista is based on the ExecutionPlan for the DAG.

So it's not a Spark Core on RDD but Spark Core on SQL physical plan, right?

EricJoy2048 · 2022-03-04T05:43:15Z

I also agree that Ballista should be a standalone system, with a client API that can be used as a library.

I agree with you.

realno · 2022-03-04T06:22:17Z

+1 Ballista should be a standalone system.

I think there is another side of this question- do we consider DataFusion as a pure library? Would this change how things are organized?
If so, what is the API to interface with other clients, logical plan? physical plan? And where would SQL parsing, optimizer and planner go?

First thought came to mind is to have physical plan be the interface, different clients have the option to extend SQL parser, optimizer and planner as needed. Most can reuse the existing implementation, an example for Ballista is to extend the planner to support udf.

matthewmturner · 2022-03-04T06:56:22Z

i think i agree with @realno. for example the datafusion python bindings are effectively reusing the rust implementation which thus far for my (limited) use cases has worked well. the python bindings also have udfs but i havent had chance to look into how theyre implemented and how that may contrast to ballistas needs.

yjshen · 2022-03-04T06:58:56Z

I think there is another side of this question- do we consider DataFusion as a pure library? Would this change how things are organized? If so, what is the API to interface with other clients, logical plan? physical plan? And where would SQL parsing, optimizer and planner go?

Yes, I think so. And actually, we are already heading in that direction. The first step to make each module substitutable is to make each part, SQL parsing, optimizing, and executing in its own crate. @jimexist is leading the effort in #1750. Once this separating of functionalities is done, the next step would be introducing interfaces between adjacent functionalities, depending on how would we substitute modules.

One of the possibilities I'm sure is, once we have #1887 or even https://github.com/andygrove/substrait-rs, we can link Apache Calcite to DataFusion's physical plan execution layer.

realno · 2022-03-04T07:26:59Z

Yes, I think so. And actually, we are already heading in that direction. The first step to make each module substitutable is to make each part, SQL parsing, optimizing, and executing in its own crate. @jimexist is leading the effort in #1750. Once this separating of functionalities is done, the next step would be introducing interfaces between adjacent functionalities, depending on how would we substitute modules.

I think this is a good direction.

liukun4515 · 2022-03-04T07:45:31Z

One of the possibilities I'm sure is, once we have #1887 or even https://github.com/andygrove/substrait-rs, we can link Apache Calcite to DataFusion's physical plan execution layer.

good point.

liukun4515 · 2022-03-04T07:48:14Z

I also agree that Ballista should be a standalone system, with a client API that can be used as a library.

agree with that and same opinion in this #1881 (comment)

Ted-Jiang · 2022-03-06T14:00:23Z

standalone system +1
Look forward to it going into production one day.

andygrove · 2022-03-10T14:10:00Z

I am now back working on Ballista after a bit of a break and can now contribute a bit more to this discussion.

Although I see Ballista as a standalone system that users will likely install as Docker images or from a tarball, I think it is still important to publish the crates to crates.io so that cargo install is another installation option.

It is also important that we publish the ballista crate, which provides the client API, so that we can call it from other projects such as from Ballista Python bindings (which I am starting work on now, based on the work in datafusion-python).

I noticed that we cannot publish the Ballista crates from the recent 7.0.0 release tarball and have filed #1980 to fix that for the next release.

rdettai · 2022-03-11T08:15:53Z

I understand that Ballista is currently heading toward being standalone system, but I am wondering that is what the ecosystem needs.

I feel that being a plugable library is a big part of Datafusion's success. But the projects that are embedding Datafusion today as a single node compute engine, are they not going to need to be distributed tomorrow? If Ballista is really designed as a standalone system, those growing projects might use it as an example on how to distribute the Datafusion query plan, but they might not be able to reuse much code.

Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄

alamb · 2022-03-12T18:58:32Z

Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄

DataFusion is not JVM based, which could be an interesting differentiator.

I think making a generic embedded distributing framework will be challenging as there are so many differing dimensions to consider (catalog structure, local caching, etc) that may be different

Comparatively I think a singe node column oriented analytic query engine is a fairly well understood pattern (though I do think the DataFusion implementation is very good )

One thing I personally hope is that Ballista drives features into DataFusion so that making a new distributed engine using DataFusion becomes easier over time.

Some examples of this technical flow I think are:

The extraction of datafusion-proto struct serialization by @carols10cents
The object store abstraction from @yjshen
The listing table provider from @rdettai
Making planning async
The work that @mingmwang is doing to enable intra-processc concurrency

rdettai · 2022-03-13T19:04:58Z

I think making a generic embedded distributing framework will be challenging as there are so many differing dimensions to consider (catalog structure, local caching, etc) that may be different

Agreed! I was more thinking about designing the engine in modules that can be re-used more easily by other distributed system. For instance using [https://github.com/substrait-io/substrait] to serialize the tasks for the workers instead of the current custom protos could help re-use ballista workers with minimal effort from other systems.

realno · 2022-03-13T21:50:55Z

Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄

I feel some opportunities/differentiators for Ballista are the following:

non-JVM - this brings a lot of benefit such as lower footprint, memory efficiency, and no GC cost (Rust specific)
A chance for more modern design principles - for example Spark was originally architected to best deployed to bare metal, it is hard to make some changes to be more cloud friendly
Utilize modern resource management and orchestration technologies - reusing mature tools like k8s will simplify Ballista's implementation (it probably doesn't need a very complex resource management system anymore) and integrate easily with modern systems (cloud native and simpler scaling model and multi-tenancy)
Using Arrow as the backbone opens doors for more advanced use case such as ML - it may be efficiently integrated with Pandas or Tensorflow through Arrow.

We heavily use systems like Spark for Analytics and ML, the above points are pain points that worth consider switching. I feel the pivot point for having a new reference distributed compute platform is getting closer. :)

Igosuki · 2022-03-13T23:08:42Z

Datafusion has a huge advantage, but the user story needs to be improved. Also, horizontal scalability and using it for ML in distributed is still an issue because it needs to be able to map partitions and have multi-stage jobs. Le dim. 13 mars 2022 à 22:51, Lin Ma ***@***.***> a écrit :

…

Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄 I feel some opportunities/differentiators for Ballista are the following: 1. non-JVM - this brings a lot of benefit such as lower footprint, memory efficiency, and no GC cost (Rust specific) 2. A chance for more modern design principles - for example Spark was originally architected to best deployed to bare metal, it is hard to make some changes to be more cloud friendly 3. Utilize modern resource management and orchestration technologies - reusing mature tools like k8s will simplify Ballista's implementation and integrate easily with modern systems 4. Using Arrow as the backbone opens doors for more advanced use case such as ML - it may be efficiently integrated with Pandas or Tensorflow through Arrow. We heavily use systems like Spark for Analytics and ML, the above points are pain points that worth consider switching. — Reply to this email directly, view it on GitHub <#1916 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADDFBWMEEONETGGK73ASOTU7ZWNXANCNFSM5P3MCIBQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

alamb · 2022-04-25T17:51:08Z

I think the discussion on this issue is closed (Ballista is a system seems to be the prevailing consensus). Closing ticket

alamb added the ballista label Mar 3, 2022

alamb mentioned this issue Mar 3, 2022

add udf/udaf plugin #1881

Closed

alamb closed this as completed Apr 25, 2022

thinkharderdev mentioned this issue May 20, 2022

[Discuss] Ballista Future Direction apache/datafusion-ballista#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Is Ballista a standalone system or framework #1916

Discussion: Is Ballista a standalone system or framework #1916

alamb commented Mar 3, 2022 •

edited

Loading

alamb commented Mar 3, 2022

thinkharderdev commented Mar 3, 2022

andygrove commented Mar 3, 2022

Igosuki commented Mar 4, 2022

yahoNanJing commented Mar 4, 2022

yjshen commented Mar 4, 2022

EricJoy2048 commented Mar 4, 2022

realno commented Mar 4, 2022 •

edited

Loading

matthewmturner commented Mar 4, 2022

yjshen commented Mar 4, 2022 •

edited

Loading

realno commented Mar 4, 2022

liukun4515 commented Mar 4, 2022

liukun4515 commented Mar 4, 2022

Ted-Jiang commented Mar 6, 2022

andygrove commented Mar 10, 2022

rdettai commented Mar 11, 2022 •

edited

Loading

alamb commented Mar 12, 2022

rdettai commented Mar 13, 2022

realno commented Mar 13, 2022 •

edited

Loading

Igosuki commented Mar 13, 2022 via email

alamb commented Apr 25, 2022

Discussion: Is Ballista a standalone system or framework #1916

Discussion: Is Ballista a standalone system or framework #1916

Comments

alamb commented Mar 3, 2022 • edited Loading

alamb commented Mar 3, 2022

thinkharderdev commented Mar 3, 2022

andygrove commented Mar 3, 2022

Igosuki commented Mar 4, 2022

yahoNanJing commented Mar 4, 2022

yjshen commented Mar 4, 2022

EricJoy2048 commented Mar 4, 2022

realno commented Mar 4, 2022 • edited Loading

matthewmturner commented Mar 4, 2022

yjshen commented Mar 4, 2022 • edited Loading

realno commented Mar 4, 2022

liukun4515 commented Mar 4, 2022

liukun4515 commented Mar 4, 2022

Ted-Jiang commented Mar 6, 2022

andygrove commented Mar 10, 2022

rdettai commented Mar 11, 2022 • edited Loading

alamb commented Mar 12, 2022

rdettai commented Mar 13, 2022

realno commented Mar 13, 2022 • edited Loading

Igosuki commented Mar 13, 2022 via email

alamb commented Apr 25, 2022

alamb commented Mar 3, 2022 •

edited

Loading

realno commented Mar 4, 2022 •

edited

Loading

yjshen commented Mar 4, 2022 •

edited

Loading

rdettai commented Mar 11, 2022 •

edited

Loading

realno commented Mar 13, 2022 •

edited

Loading