Add experimental `filesystem="arrow"` support in `dask_cudf.read_parquet` #16684

rjzamora · 2024-08-28T22:08:40Z

Description

This PR piggybacks on the existing CPU/Arrow Parquet infrastructure in dask-expr. With this PR,

df = dask_cudf.read_parquet(path, filesystem="arrow")

will produce a cudf-backed collection using PyArrow for IO (i.e. disk->pa.Table->cudf.DataFrame). Before this PR, passing filesystem="arrow" will simply result in an error.

Although this code path is not ideal for fast/local storage, it can be very efficient for remote storage (e.g. S3).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…rrow-filesystem

Adds new benchmark for parquet read performance using a `LocalCUDACluster`. The user can pass in `--key` and `--secret` options to specify S3 credentials. E.g. ``` $ python ./local_read_parquet.py --devs 0,1,2,3,4,5,6,7 --filesystem fsspec --type gpu --file-count 48 --aggregate-files Parquet read benchmark -------------------------------------------------------------------------------- Path | s3://dask-cudf-parquet-testing/dedup_parquet Columns | None Backend | cudf Filesystem | fsspec Blocksize | 244.14 MiB Aggregate files | True Row count | 372066 Size on disk | 1.03 GiB Number of workers | 8 ================================================================================ Wall clock | Throughput -------------------------------------------------------------------------------- 36.75 s | 28.78 MiB/s 21.29 s | 49.67 MiB/s 17.91 s | 59.05 MiB/s ================================================================================ Throughput | 41.77 MiB/s +/- 7.81 MiB/s Bandwidth | 0 B/s +/- 0 B/s Wall clock | 25.32 s +/- 8.20 s ================================================================================ ... ``` **Notes**: - S3 Performance generally scales with the number of workers (multiplied the number of threads per worker) - The example shown above was not executed from an EC2 instance - The example shown above *should* perform better after rapidsai/cudf#16657 - Using `--filesystem arrow` together with `--type gpu` performs well, but depends on rapidsai/cudf#16684 Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) - Peter Andreas Entschev (https://github.com/pentschev) URL: #1371

…rrow-filesystem

Closes rapidsai#14537. Authors: - Matthew Murray (https://github.com/Matt711) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#16601

…#16574) Improves performance of wide strings (avg > 64 bytes) when using `cudf::strings::slice_strings`. Addresses some concerns from issue rapidsai#15924 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: rapidsai#16574

…rrow-filesystem

GregoryKimball · 2024-09-24T20:06:41Z

@rjzamora , I just heard from @ayushdg that NeMo Curator would benefit from this feature in 24.10. Would you please let me know what steps are needed to complete this work? Who should the reviewers be?

rjzamora · 2024-09-24T20:22:29Z

Would you please let me know what steps are needed to complete this work? Who should the reviewers be?

Thanks for the nudge @GregoryKimball - I spent some time today isolating the "experimental" logic used for this feature. Therefore, the PR should now be relatively "low risk". I'd like to grab an ec2 instance later to day to test that the performance is still good.

The usual dask-cudf and python/IO reviewers include @pentschev @charlesbluca @galipremsagar @wence- @madsbk @quasiben @vyasr - I'd I welcome a review from anyone available :)

madsbk

Thanks @rjzamora, I only have some minor suggestions

python/dask_cudf/dask_cudf/backends.py

…rrow-filesystem

python/dask_cudf/dask_cudf/backends.py

galipremsagar

Thanks @rjzamora !

wence-

One question about how noisy we want the "experimental" warning to be

wence- · 2024-09-25T15:53:28Z

python/dask_cudf/dask_cudf/backends.py

+            warnings.warn(
+                f"Support for `filesystem={filesystem}` is experimental. "
+                "Using PyArrow to perform IO on multiple CPU threads. "
+                "Behavior may change in the future (without deprecation). "
+            )


question: Do we really want a user-visible (?) warning here? If curator uses this, it would presumably mean end users would see this warning but not know what to do about it (or have any facility to).

@ayushdg @VibhuJawa - What are your feelings on this. I agree that this is probably a bit annoying :)

Okay - I added a note to the best-practices docs about this feature, and removed the explicit warning.

python/dask_cudf/dask_cudf/expr/_expr.py

This PR adds `cudf-polars` to the top level build script. Authors: - https://github.com/brandon-b-miller - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Jake Awe (https://github.com/AyodeAwe) URL: rapidsai#16898

Before when `columns=` was a `cudf.Series/Index` we would call `return array.unique.to_pandas()`, but `.unique` is a method not a property so this would have raised an error. Also took the time to refactor the helper methods here and push down the `errors=` keyword to `Frame._drop_column` Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#16712

This PR is a first pass at rapidsai#15937. We will close rapidsai#15937 after rapidsai#15162 is closed Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: rapidsai#16810

Fixes rapidsai#16625 This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation. Previously it was doing a linear march over the individual tdigests per group and merging them one by one. This led to terrible performance for large numbers of groups. In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to. Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x, ``` Old --------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------------------- TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time 7473 ms 7472 ms 8 TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time 7433 ms 7431 ms 8 ``` ``` New --------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------------------- TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time 6.72 ms 6.79 ms 8 TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time 1.24 ms 1.32 ms 8 ``` Authors: - https://github.com/nvdbaranec - Muhammad Haseeb (https://github.com/mhaseeb123) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: rapidsai#16780

This PR displays delta's for CPU and GPU usage metrics that are extracted from `cudf.pandas` pytests. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Jake Awe (https://github.com/AyodeAwe) URL: rapidsai#16864

…rrow-filesystem

galipremsagar · 2024-09-25T21:09:41Z

/merge

Follow-up to #16684 There is currently a bug in `dask_cudf.read_parquet(..., filesystem="arrow")` when the files are larger than the `"dataframe.parquet.minimum-partition-size"` config. More specifically, when the files are not aggregated together, the output will be `pd.DataFrame` instead of `cudf.DataFrame`. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #17099

rjzamora added 6 commits August 27, 2024 11:51

allow pyarrow-based read with cudf backend

469bc5e

re-org

f20cc25

temporary change for debugging

8f0f598

adjust for upstream bug

64fd701

remove stale comment

8e0c902

add file aggregation

18e1c08

rjzamora added feature request New feature or request 2 - In Progress Currently a work in progress dask Dask issue non-breaking Non-breaking change labels Aug 28, 2024

rjzamora self-assigned this Aug 28, 2024

github-actions bot added the Python Affects Python cuDF API. label Aug 28, 2024

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

5215a05

rjzamora mentioned this pull request Aug 29, 2024

[Benchmark] Add parquet read benchmark rapidsai/dask-cuda#1371

Merged

rjzamora added 4 commits August 29, 2024 10:00

test coverage

c51a7bb

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

b7a90c1

allow aggregate_files=True

43274e2

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-a…

63c3f04

…rrow-filesystem

rjzamora added 3 commits August 30, 2024 09:46

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

a1bd43c

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-a…

e3ca47f

…rrow-filesystem

fix test

12c09a5

rjzamora mentioned this pull request Sep 3, 2024

Add dask query-planning support NVIDIA/NeMo-Curator#139

Merged

3 tasks

rjzamora and others added 7 commits September 4, 2024 09:03

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

daee7ec

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-a…

d068103

…rrow-filesystem

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

257eb26

skip for pyarrow<15

bdd2bab

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-a…

d943d8d

…rrow-filesystem

rjzamora added 3 commits September 24, 2024 10:06

remove blocksize and aggregate_files handling

8cfe71e

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-a…

badf359

…rrow-filesystem

warn rather than raise for blocksize

4c1c5ae

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

3f1d925

madsbk reviewed Sep 25, 2024

View reviewed changes

python/dask_cudf/dask_cudf/backends.py Outdated Show resolved Hide resolved

python/dask_cudf/dask_cudf/backends.py Outdated Show resolved Hide resolved

rjzamora and others added 4 commits September 25, 2024 06:37

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-a…

8c267c7

…rrow-filesystem

address code review from mads

91d2d77

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-a…

239639f

…rrow-filesystem

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

c944a52

galipremsagar reviewed Sep 25, 2024

View reviewed changes

python/dask_cudf/dask_cudf/backends.py Outdated Show resolved Hide resolved

bdice reviewed Sep 25, 2024

View reviewed changes

python/dask_cudf/dask_cudf/backends.py Outdated Show resolved Hide resolved

python/dask_cudf/dask_cudf/backends.py Outdated Show resolved Hide resolved

more cleanup

791a4fd

galipremsagar approved these changes Sep 25, 2024

View reviewed changes

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Sep 25, 2024

wence- approved these changes Sep 25, 2024

View reviewed changes

rjzamora and others added 7 commits September 25, 2024 12:36

remove warning and add not to best-practices

4c5ee6d

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-a…

aa492f5

…rrow-filesystem

rapids-bot bot merged commit 0425963 into rapidsai:branch-24.10 Sep 25, 2024
96 checks passed

rjzamora deleted the dask-cudf-arrow-filesystem branch September 26, 2024 01:23

rjzamora mentioned this pull request Oct 16, 2024

[Bug] Fix Arrow-FS parquet reader for larger files #17099

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental `filesystem="arrow"` support in `dask_cudf.read_parquet` #16684

Add experimental `filesystem="arrow"` support in `dask_cudf.read_parquet` #16684

rjzamora commented Aug 28, 2024

GregoryKimball commented Sep 24, 2024

rjzamora commented Sep 24, 2024

madsbk left a comment

galipremsagar left a comment

wence- left a comment

wence- Sep 25, 2024

rjzamora Sep 25, 2024

rjzamora Sep 25, 2024

galipremsagar commented Sep 25, 2024

Add experimental filesystem="arrow" support in dask_cudf.read_parquet #16684

Add experimental filesystem="arrow" support in dask_cudf.read_parquet #16684

Conversation

rjzamora commented Aug 28, 2024

Description

Checklist

GregoryKimball commented Sep 24, 2024

rjzamora commented Sep 24, 2024

madsbk left a comment

Choose a reason for hiding this comment

galipremsagar left a comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

wence- Sep 25, 2024

Choose a reason for hiding this comment

rjzamora Sep 25, 2024

Choose a reason for hiding this comment

rjzamora Sep 25, 2024

Choose a reason for hiding this comment

galipremsagar commented Sep 25, 2024

Add experimental `filesystem="arrow"` support in `dask_cudf.read_parquet` #16684

Add experimental `filesystem="arrow"` support in `dask_cudf.read_parquet` #16684