Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental filesystem="arrow" support in dask_cudf.read_parquet #16684

Merged
merged 49 commits into from
Sep 25, 2024

Commits on Aug 27, 2024

  1. Configuration menu
    Copy the full SHA
    469bc5e View commit details
    Browse the repository at this point in the history
  2. re-org

    rjzamora committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    f20cc25 View commit details
    Browse the repository at this point in the history

Commits on Aug 28, 2024

  1. Configuration menu
    Copy the full SHA
    8f0f598 View commit details
    Browse the repository at this point in the history
  2. adjust for upstream bug

    rjzamora committed Aug 28, 2024
    Configuration menu
    Copy the full SHA
    64fd701 View commit details
    Browse the repository at this point in the history
  3. remove stale comment

    rjzamora committed Aug 28, 2024
    Configuration menu
    Copy the full SHA
    8e0c902 View commit details
    Browse the repository at this point in the history
  4. add file aggregation

    rjzamora committed Aug 28, 2024
    Configuration menu
    Copy the full SHA
    18e1c08 View commit details
    Browse the repository at this point in the history

Commits on Aug 29, 2024

  1. Configuration menu
    Copy the full SHA
    5215a05 View commit details
    Browse the repository at this point in the history
  2. test coverage

    rjzamora committed Aug 29, 2024
    Configuration menu
    Copy the full SHA
    c51a7bb View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    b7a90c1 View commit details
    Browse the repository at this point in the history

Commits on Aug 30, 2024

  1. allow aggregate_files=True

    rjzamora committed Aug 30, 2024
    Configuration menu
    Copy the full SHA
    43274e2 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    63c3f04 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    a1bd43c View commit details
    Browse the repository at this point in the history

Commits on Sep 3, 2024

  1. Configuration menu
    Copy the full SHA
    e3ca47f View commit details
    Browse the repository at this point in the history
  2. fix test

    rjzamora committed Sep 3, 2024
    Configuration menu
    Copy the full SHA
    12c09a5 View commit details
    Browse the repository at this point in the history

Commits on Sep 4, 2024

  1. Configuration menu
    Copy the full SHA
    daee7ec View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    d068103 View commit details
    Browse the repository at this point in the history

Commits on Sep 5, 2024

  1. Configuration menu
    Copy the full SHA
    257eb26 View commit details
    Browse the repository at this point in the history

Commits on Sep 6, 2024

  1. Configuration menu
    Copy the full SHA
    ec38b1e View commit details
    Browse the repository at this point in the history
  2. Performance improvement for strings::slice for wide strings (rapidsai…

    …#16574)
    
    Improves performance of wide strings (avg > 64 bytes) when using `cudf::strings::slice_strings`.
    Addresses some concerns from issue rapidsai#15924
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
    
    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - Muhammad Haseeb (https://github.com/mhaseeb123)
    
    URL: rapidsai#16574
    davidwendt authored and rjzamora committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    853c76b View commit details
    Browse the repository at this point in the history
  3. skip for pyarrow<15

    rjzamora committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    bdd2bab View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    d943d8d View commit details
    Browse the repository at this point in the history

Commits on Sep 10, 2024

  1. Configuration menu
    Copy the full SHA
    eb9eee0 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    b9c5147 View commit details
    Browse the repository at this point in the history

Commits on Sep 18, 2024

  1. Configuration menu
    Copy the full SHA
    ec04e78 View commit details
    Browse the repository at this point in the history

Commits on Sep 19, 2024

  1. Configuration menu
    Copy the full SHA
    e391789 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    e154d01 View commit details
    Browse the repository at this point in the history

Commits on Sep 24, 2024

  1. Intentionally leak thread_local CUDA resources to avoid crash (part 1) (

    rapidsai#16787)
    
    The NVbench application `PARQUET_READER_NVBENCH` in libcudf currently crashes with the segmentation fault. To reproduce:
    
    ```
    ./PARQUET_READER_NVBENCH -d 0 -b 1 --run-once -a io_type=FILEPATH -a compression_type=SNAPPY -a cardinality=0 -a run_length=1
    ```
     
    The root cause is that some (1) `thread_local`  objects on the main thread in `libcudf` and (2) `static` objects in `kvikio` are destroyed after `cudaDeviceReset()` in NVbench and upon program termination. These objects should simply be leaked, since their destructors making CUDA calls upon program termination constitutes UB in CUDA.
    
    This simple PR is the cuDF side of the fix. The other part is done here rapidsai/kvikio#462.
    
    closes rapidsai#13229
    
    Authors:
      - Tianyu Liu (https://github.com/kingcrimsontianyu)
      - Vukasin Milovanovic (https://github.com/vuule)
    
    Approvers:
      - Vukasin Milovanovic (https://github.com/vuule)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: rapidsai#16787
    kingcrimsontianyu authored and rjzamora committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    3246d67 View commit details
    Browse the repository at this point in the history
  2. Access Frame attributes instead of ColumnAccessor attributes when ava…

    …ilable (rapidsai#16652)
    
    There are some places where a public object like `DataFrame` or `Index` accesses a `ColumnAccessor` attribute when it's accessible in a shared subclass attribute instead (like `Frame`).
    
    In an effort to access the `ColumnAccessor` less, replaced usages of `._data.attribute` with a `Frame` specific attribute`
    
    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    URL: rapidsai#16652
    mroeschke authored and rjzamora committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    2f424f2 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    362195d View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    4ce83d4 View commit details
    Browse the repository at this point in the history
  5. remove unncessary logic

    rjzamora committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    4d87013 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    e5b272a View commit details
    Browse the repository at this point in the history
  7. add warning

    rjzamora committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    8d87c54 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    8cfe71e View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    badf359 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    4c1c5ae View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    3f1d925 View commit details
    Browse the repository at this point in the history

Commits on Sep 25, 2024

  1. Configuration menu
    Copy the full SHA
    8c267c7 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    91d2d77 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    239639f View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    c944a52 View commit details
    Browse the repository at this point in the history
  5. more cleanup

    rjzamora committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    791a4fd View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    4c5ee6d View commit details
    Browse the repository at this point in the history
  7. Build cudf-polars with build.sh (rapidsai#16898)

    This PR adds `cudf-polars` to the top level build script.
    
    Authors:
      - https://github.com/brandon-b-miller
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
      - Jake Awe (https://github.com/AyodeAwe)
    
    URL: rapidsai#16898
    brandon-b-miller authored and rjzamora committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    4d28db7 View commit details
    Browse the repository at this point in the history
  8. Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) (rapidsai#16712)

    Before when `columns=` was a `cudf.Series/Index` we would call `return array.unique.to_pandas()`, but `.unique` is a method not a property so this would have raised an error.
    
    Also took the time to refactor the helper methods here and push down the `errors=` keyword to `Frame._drop_column`
    
    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
    
    Approvers:
      - Bradley Dice (https://github.com/bdice)
    
    URL: rapidsai#16712
    mroeschke authored and rjzamora committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    9aa5aca View commit details
    Browse the repository at this point in the history
  9. [DOC] Update Pylibcudf doc strings (rapidsai#16810)

    This PR is a first pass at rapidsai#15937. We will close rapidsai#15937 after rapidsai#15162 is closed
    
    Authors:
      - Matthew Murray (https://github.com/Matt711)
    
    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    URL: rapidsai#16810
    Matt711 authored and rjzamora committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    42a15ee View commit details
    Browse the repository at this point in the history
  10. Optimization of tdigest merge aggregation. (rapidsai#16780)

    Fixes rapidsai#16625
    
    This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation.  Previously it was doing a linear march over the individual tdigests per group and merging them one by one.  This led to terrible performance for large numbers of groups.  In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to.  Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x,
    
    ```
    Old
    ---------------------------------------------------------------------------------------------------------------
    Benchmark                                                                     Time             CPU   Iterations
    ---------------------------------------------------------------------------------------------------------------
    TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        7473 ms         7472 ms            8
    TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        7433 ms         7431 ms            8
    ```
    
    
    ```
    New
    ---------------------------------------------------------------------------------------------------------------
    Benchmark                                                                     Time             CPU   Iterations
    ---------------------------------------------------------------------------------------------------------------
    TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        6.72 ms         6.79 ms            8
    TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        1.24 ms         1.32 ms            8
    ```
    
    Authors:
      - https://github.com/nvdbaranec
      - Muhammad Haseeb (https://github.com/mhaseeb123)
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    Approvers:
      - Muhammad Haseeb (https://github.com/mhaseeb123)
      - Nghia Truong (https://github.com/ttnghia)
      - Mike Wilson (https://github.com/hyperbolic2346)
    
    URL: rapidsai#16780
    nvdbaranec authored and rjzamora committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    2c5bb57 View commit details
    Browse the repository at this point in the history
  11. Display deltas for cudf.pandas test summary (rapidsai#16864)

    This PR displays delta's for CPU and GPU usage metrics that are extracted from `cudf.pandas` pytests.
    
    Authors:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    Approvers:
      - Jake Awe (https://github.com/AyodeAwe)
    
    URL: rapidsai#16864
    galipremsagar authored and rjzamora committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    ed19b2e View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    aa492f5 View commit details
    Browse the repository at this point in the history