[FEA] Support "dataframe.query-planning" config in `dask.dataframe` #15027

rjzamora · 2024-02-12T17:33:04Z

PSA

To unblock CI failures related to the dask-expr migration, down-stream RAPIDS libraries can set the following environment variable in CI (before dask.dataframe/dask_cudf is ever imported):

export DASK_DATAFRAME__QUERY_PLANNING=False

If you do this, please be sure to comment on the change, and link it to this meta issue. (So I can make the necessary changes/fixes, and turn query-planning back on)

Background

The 2024.2.0 release of Dask has deprecated the "legacy" dask.dataframe API. Given that dask-cudf (and much of RAPIDS) is tightly integrated with dask.dataframe, it is critical that dask_cudf be updated to use the new dask_expr backend smoothly.

Most of the heavy lifting is already being done in #14805. However, there will also be some follow-up work to expand coverage/examples/documentation/benchmarks. We will also need to update dask-cuda/explicit-comms.

Action Items

Basics (to be covered by #14805):

Add dask-expr DataFrameBackendEntrypoint entrypoint for "cudf"
Align top-level dask_cudf imports with dask.dataframe for "dataframe.query-planning" support

Expected Follow-up:

cuDF / Dask cuDF doc build:

Revert Disable dask-expr in docs builds. #15343 (see: Patch dask-expr var logic in dask-cudf #15347)

cuML support:

General debugging PR

cuxfilter support:

(Prevent path conflict in builds cuxfilter#583, Avoid importing from dask_cudf.core cuxfilter#593)

cugraph support:

General debugging PR

Dask CUDA:

Explicit comms support (Support "dataframe.query-planning" config in dask.dataframe dask-cuda#1311)

Dask SQL:

Migrate predicate pushdown to dask-expr

NeMo Curator:

Migrate custom-graph code and test against latest dask/cudf

Merlin:

Port Merlin/NVTabular (Heavy lift - Aiming for 24.08)

The text was updated successfully, but these errors were encountered:

Dask CUDA must use the deprecated `dask.dataframe` API until #1311 and rapidsai/cudf#15027 are both closed. This means that we must explicitly filter the following deprecation warning to avoid nighlty CI failures: ``` DeprecationWarning: The current Dask DataFrame implementation is deprecated. In a future release, Dask DataFrame will use new implementation that contains several improvements including a logical query planning. The user-facing DataFrame API will remain unchanged. The new implementation is already available and can be enabled by installing the dask-expr library: $ pip install dask-expr and turning the query planning option on: >>> import dask >>> dask.config.set({'dataframe.query-planning': True}) >>> import dask.dataframe as dd API documentation for the new implementation is available at https://docs.dask.org/en/stable/dask-expr-api.html Any feedback can be reported on the Dask issue tracker https://github.com/dask/dask/issues import dask.dataframe as dd ``` This PR adds the (temporarily) necessary warning filter. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1312

Mostly addresses #15027 dask/dask-expr#728 exposed the necessary mechanisms for us to define a custom dask-expr backend for `cudf`. The new dispatching mechanisms are effectively the same as those in `dask.dataframe`. The only difference is that we are now registering/implementing "expression-based" collections. This PR does the following: - Defines a basic `DataFrameBackendEntrypoint` class for collection creation, and registers new collections using `get_collection_type`. - Refactors the `dask_cudf` import structure to properly support the `"dataframe.query-planning"` configuration. - Modifies CI to test dask-expr support for some of the `dask_cudf` tests. This coverage can be expanded in follow-up work. ~**Experimental Change**: This PR patches `dask_expr._expr.Expr.__new__` to enable type-based dispatching. This effectively allows us to surgically replace problematic `Expr` subclasses that do not work for cudf-backed data. For example, this PR replaces the upstream `TakeLast` expression to avoid using `squeeze` (since this method is not supported by cudf). This particular fix can be moved upstream relatively easily. However, having this kind of "patching" mechanism may be valuable for more complicated pandas/cudf discrepancies.~ ## Usage example ```python from dask import config config.set({"dataframe.query-planning": True}) import dask_cudf df = dask_cudf.DataFrame.from_dict( {"x": range(100), "y": [1, 2, 3, 4] * 25, "z": ["1", "2"] * 50}, npartitions=10, ) df["y2"] = df["x"] + df["y"] agg = df.groupby("y").agg({"y2": "mean"})["y2"] agg.simplify().pprint() ``` Dask cuDF should now be using dask-expr for "query planning": ``` Projection: columns='y2' GroupbyAggregation: arg={'y2': 'mean'} observed=True split_out=1'y' Assign: y2= Projection: columns=['y'] FromPandas: frame='<dataframe>' npartitions=10 columns=['x', 'y'] Add: Projection: columns='x' FromPandas: frame='<dataframe>' npartitions=10 columns=['x', 'y'] Projection: columns='y' FromPandas: frame='<dataframe>' npartitions=10 columns=['x', 'y'] ``` ## TODO - [x] Add basic tests - [x] Confirm that general design makes sense **Follow Up Work**: - Expand dask-expr test coverage - Fix local and upstream bugs - Add documentation once "critical mass" is reached Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Ray Douglass (https://github.com/raydouglass) URL: #14805

Addresses parts of #15027 (json and s3 testing). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15408

Related to orc and text support in #15027 Follow-up work can to enable predicate pushdown and column projection with ORC, but the goal of this PR is basic functionality (and parity with the legacy API). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15439

Dask CUDA must use the deprecated `dask.dataframe` API until rapidsai#1311 and rapidsai/cudf#15027 are both closed. This means that we must explicitly filter the following deprecation warning to avoid nighlty CI failures: ``` DeprecationWarning: The current Dask DataFrame implementation is deprecated. In a future release, Dask DataFrame will use new implementation that contains several improvements including a logical query planning. The user-facing DataFrame API will remain unchanged. The new implementation is already available and can be enabled by installing the dask-expr library: $ pip install dask-expr and turning the query planning option on: >>> import dask >>> dask.config.set({'dataframe.query-planning': True}) >>> import dask.dataframe as dd API documentation for the new implementation is available at https://docs.dask.org/en/stable/dask-expr-api.html Any feedback can be reported on the Dask issue tracker https://github.com/dask/dask/issues import dask.dataframe as dd ``` This PR adds the (temporarily) necessary warning filter. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: rapidsai#1312

Related to #15027 Adds a minor tokenization fix, and adjusts testing for categorical-accessor support. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) - Matthew Roeschke (https://github.com/mroeschke) - Bradley Dice (https://github.com/bdice) URL: #15591

Related to #15027 Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #15639

rjzamora added feature request New feature or request dask Dask issue labels Feb 12, 2024

This was referenced Feb 12, 2024

Support "dataframe.query-planning" config in dask.dataframe rapidsai/dask-cuda#1311

Open

Filter dd deprecation rapidsai/dask-cuda#1312

Merged

Introduce basic "cudf" backend for Dask Expressions #14805

Merged

jorisvandenbossche mentioned this issue Mar 11, 2024

Support latest dask.dataframe with query planning (dask-expr) geopandas/dask-geopandas#284

Open

rjzamora self-assigned this Mar 19, 2024

charlesbluca mentioned this issue Mar 19, 2024

Start building images for python 3.11 rapidsai/dask-build-environment#91

Merged

bdice mentioned this issue Mar 20, 2024

Disable dask query planning. rapidsai/cuml#5815

Closed

charlesbluca mentioned this issue Mar 21, 2024

Update gpuCI RAPIDS_VER to 24.06 dask/distributed#8588

Merged

This was referenced Mar 26, 2024

Dependencies for Dask-expr and Rapids AI dask/dask-expr#1003

Closed

Enable dask_cudf json and s3 tests with query-planning on #15408

Merged

rjzamora mentioned this issue Apr 2, 2024

Support orc and text IO with dask-expr using legacy conversion #15439

Merged

3 tasks

This was referenced Apr 24, 2024

Fix categorical-accessor support and testing in dask-cudf #15591

Merged

Prevent path conflict in builds rapidsai/cuxfilter#583

Merged

rjzamora mentioned this issue May 2, 2024

Enable sorting on column with nulls using query-planning #15639

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support "dataframe.query-planning" config in `dask.dataframe` #15027

[FEA] Support "dataframe.query-planning" config in `dask.dataframe` #15027

rjzamora commented Feb 12, 2024 •

edited

Loading

[FEA] Support "dataframe.query-planning" config in dask.dataframe #15027

[FEA] Support "dataframe.query-planning" config in dask.dataframe #15027

Comments

rjzamora commented Feb 12, 2024 • edited Loading

PSA

Background

Action Items

[FEA] Support "dataframe.query-planning" config in `dask.dataframe` #15027

[FEA] Support "dataframe.query-planning" config in `dask.dataframe` #15027

rjzamora commented Feb 12, 2024 •

edited

Loading