How can I make ballista distributed compute work? #327

Jeeesie · 2021-05-12T02:59:03Z

I want to execute benchmark q1.sql distributed, And I noticed that in from_proto.rs there is PhysicalPlanType::ParquetScan, in which we can use ParquetExec::try_from_files() to make several partitions.
However, in benchmark tests, the code didnot call this method, instead, it directly use read_csv(). Can I know why? And how can I use parquetScan?

Also, I attempted to call datafusion's repartition() function in register_table() :

   
         let rr_repartition = Partitioning::RoundRobinBatch(3);

        let roundtrip_plan = LogicalPlan::Repartition {
            input: Arc::from(table.to_logical_plan()),
            partitioning_scheme: rr_repartition,
        };

         @state
            .tables
            .insert(name.to_owned(), roundtrip_plan);

but I meet the error:
General("Invalid LogicalPlan::TableScan")
Can you help to resolve this? My purpose is to execute benchmark q1.sql distributed. I have several data files of Lineitem Schema.

The text was updated successfully, but these errors were encountered:

andygrove · 2021-05-12T13:14:22Z

The User Guide source is here: https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide

The previously published version is here: https://ballistacompute.org/docs/

I am in the process of updating the user guide, and it will be published to the Arrow web site on the next release.

Jeeesie · 2021-05-13T08:00:48Z

Thanks.
Under the data path, each schema only has one data file. As you said, one file will be in one partition. While one partition will only be executed in one executor.
So are there distributed examples already?

The User Guide source is here: https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide

The previously published version is here: https://ballistacompute.org/docs/

I am in the process of updating the user guide, and it will be published to the Arrow web site on the next release.

andygrove · 2021-08-28T16:30:03Z

The benchmark crate in the repo can be used for executing fully distributed queries against partitioned data and the README in there explains how to do this.

Jeeesie added the bug Something isn't working label May 12, 2021

alamb added the ballista label May 12, 2021

andygrove closed this as completed Aug 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I make ballista distributed compute work? #327

How can I make ballista distributed compute work? #327

Jeeesie commented May 12, 2021 •

edited

Loading

andygrove commented May 12, 2021

Jeeesie commented May 13, 2021 •

edited

Loading

andygrove commented Aug 28, 2021

How can I make ballista distributed compute work? #327

How can I make ballista distributed compute work? #327

Comments

Jeeesie commented May 12, 2021 • edited Loading

andygrove commented May 12, 2021

Jeeesie commented May 13, 2021 • edited Loading

andygrove commented Aug 28, 2021

Jeeesie commented May 12, 2021 •

edited

Loading

Jeeesie commented May 13, 2021 •

edited

Loading