Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental support for executing SQL with Polars and Pandas #190

Merged
merged 21 commits into from
Feb 20, 2023

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Feb 18, 2023

Which issue does this PR close?

Part of #191

Rationale for this change

Demonstrate how to use DataFusion as a SQL query planner and optimizer in Python and then translate the logical plan to another execution engine.

The examples added in this PR run trivial SQL queries against Polars and Pandas.

To support a wider range of queries we would need to expose all of the DataFusion operators and expressions (see #191).

Polars Example

select passenger_count, count(*) from taxi group by passenger_count

Results:

$ python examples/sql-on-polars.py 
shape: (10, 2)
┌─────────────────┬─────────────────┐
│ passenger_count ┆ COUNT(UInt8(1)) │
│ ---             ┆ ---             │
│ f64             ┆ u32             │
╞═════════════════╪═════════════════╡
│ 1.0             ┆ 966236          │
│ 0.0             ┆ 26726           │
│ 2.0             ┆ 161671          │
│ 3.0             ┆ 43935           │
│ ...             ┆ ...             │
│ 6.0             ┆ 25362           │
│ 8.0             ┆ 2               │
│ 7.0             ┆ 5               │
│ null            ┆ 98352           │
└─────────────────┴─────────────────┘

Pandas Example

select passenger_count from taxi

Results:

$ python examples/sql-on-pandas.py 
         passenger_count
0                    1.0
1                    1.0
2                    1.0
3                    0.0
4                    1.0
...                  ...
1369764              NaN
1369765              NaN
1369766              NaN
1369767              NaN
1369768              NaN

[1369769 rows x 1 columns]

What changes are included in this PR?

  • New examples

Are there any user-facing changes?

@andygrove andygrove changed the title Add example showing how execute SQL with Polars Add example showing how to execute SQL with Polars Feb 18, 2023
@andygrove
Copy link
Member Author

@jdye64 fyi

@andygrove andygrove changed the title Add example showing how to execute SQL with Polars Add examples showing how to execute SQL with Polars and Pandas Feb 18, 2023
@jdye64
Copy link
Contributor

jdye64 commented Feb 20, 2023

LGTM!

@andygrove andygrove changed the title Add examples showing how to execute SQL with Polars and Pandas Add experimental support for executing SQL with Polars and Pandas Feb 20, 2023
@andygrove andygrove marked this pull request as ready for review February 20, 2023 16:03
@andygrove andygrove merged commit ca8b055 into apache:main Feb 20, 2023
@andygrove andygrove deleted the sql-on-polars branch February 20, 2023 17:03
andygrove added a commit that referenced this pull request Feb 22, 2023
* changelog (#188)

* Add Python wrapper for LogicalPlan::Sort (#196)

* Add Python wrapper for LogicalPlan::Aggregate (#195)

* Add Python wrapper for LogicalPlan::Limit (#193)

* Add Python wrapper for LogicalPlan::Filter (#192)

* Add Python wrapper for LogicalPlan::Filter

* clippy

* clippy

* Update src/expr/filter.rs

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

---------

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

* Add tests for recently added functionality (#199)

* Add experimental support for executing SQL with Polars and Pandas (#190)

* Run `maturin develop` instead of `cargo build` in verification script (#200)

* Implement `to_pandas()` (#197)

* Implement to_pandas()

* Update documentation

* Write unit test

* Add support for cudf as a physical execution engine (#205)

* Update README in preparation for 0.8 release (#206)

* Analyze table bindings (#204)

* method for getting the internal LogicalPlan instance

* Add explain plan method

* Add bindings for analyze table

* Add to_variant

* cargo fmt

* blake and flake formatting

* changelog (#209)

---------

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Dejan Simic <10134699+simicd@users.noreply.github.com>
Co-authored-by: Jeremy Dyer <jdye64@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants