Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I make ballista distributed compute work? #327

Closed
Jeeesie opened this issue May 12, 2021 · 3 comments
Closed

How can I make ballista distributed compute work? #327

Jeeesie opened this issue May 12, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@Jeeesie
Copy link

Jeeesie commented May 12, 2021

I want to execute benchmark q1.sql distributed, And I noticed that in from_proto.rs there is PhysicalPlanType::ParquetScan, in which we can use ParquetExec::try_from_files() to make several partitions.
However, in benchmark tests, the code didnot call this method, instead, it directly use read_csv(). Can I know why? And how can I use parquetScan?

Also, I attempted to call datafusion's repartition() function in register_table() :

   
         let rr_repartition = Partitioning::RoundRobinBatch(3);

        let roundtrip_plan = LogicalPlan::Repartition {
            input: Arc::from(table.to_logical_plan()),
            partitioning_scheme: rr_repartition,
        };

         @state
            .tables
            .insert(name.to_owned(), roundtrip_plan);

but I meet the error:
General("Invalid LogicalPlan::TableScan")
Can you help to resolve this? My purpose is to execute benchmark q1.sql distributed. I have several data files of Lineitem Schema.

@Jeeesie Jeeesie added the bug Something isn't working label May 12, 2021
@andygrove
Copy link
Member

The User Guide source is here: https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide

The previously published version is here: https://ballistacompute.org/docs/

I am in the process of updating the user guide, and it will be published to the Arrow web site on the next release.

@alamb alamb added the ballista label May 12, 2021
@Jeeesie
Copy link
Author

Jeeesie commented May 13, 2021

Thanks.
Under the data path, each schema only has one data file. As you said, one file will be in one partition. While one partition will only be executed in one executor.
So are there distributed examples already?

The User Guide source is here: https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide

The previously published version is here: https://ballistacompute.org/docs/

I am in the process of updating the user guide, and it will be published to the Arrow web site on the next release.

@andygrove
Copy link
Member

The benchmark crate in the repo can be used for executing fully distributed queries against partitioned data and the README in there explains how to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants