Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] cudf v24.10 #16943

Open
wants to merge 336 commits into
base: main
Choose a base branch
from
Open

[RELEASE] cudf v24.10 #16943

wants to merge 336 commits into from

Conversation

raydouglass
Copy link
Member

❄️ Code freeze for branch-24.10 and v24.10 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-24.10 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-24.10 into main for the release

PointKernel and others added 30 commits August 5, 2024 21:48
Closes #16395

This PR resolves two types of compilation errors, allowing for successful builds with GCC 13:

- replacing the `cuco_allocator` strong type with an alias to fix a new build time check with GCC 13
- removing `std::move` when returning a temporary

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Mark Harris (https://github.com/harrism)

URL: #16488
Fixes call to CUB `DeviceSegmentedSort::SortPairs` where the input and output indices pointed to the same temp memory. The documentation from https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#id8 indicates the `d_values_in` and `d_values_out` memory must not overlap so using the same pointer for both created invalid output in certain conditions. The internal function was implemented to expect the input values to be updated in-place. The fix uses separate device memory for the input and output indices.

Closes #16455

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: #16463
…6454)

`cudf.Series` is a public constructor that happens to accept a private `ColumnBase` object. Many ops return Columns and is natural to want to reconstruct a `Series`.

This PR adds a `SingleColumnFrame._from_column` classmethod for instances where we need to wrap a new column in an `Index` or `Series`. This constructor also passes some unneeded validation in `ColumnAccessor` and `Series`

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #16454
Forward-merge branch-24.08 into branch-24.10
…rsion (#16503)

Contributes to rapidsai/build-planning#58.

`scikit-build-core==0.10.0` was released today (https://github.com/scikit-build/scikit-build-core/releases/tag/v0.10.0), and wheel-building configurations across RAPIDS are incompatible with it.

This proposes upgrading to that version and fixing configuration here in a way that:

* is compatible with that new `scikit-build-core` version
* takes advantage of the forward-compatibility mechanism (`minimum-version`) that `scikit-build-core` provides, to reduce the risk of needing to do this again in the future

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - https://github.com/jakirkham

URL: #16503
Exposes the `stream` param in transform APIs

Authors:
  - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #16452
…16498)

Demonstrates the conversion from an `arrow:StringViewArray` to a `cudf::column`

Authors:
  - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)

URL: #16498
Changes the integer type for `cudf::strings::ipv4_to_integers` and `cudf::strings::integers_to_ipv4` to use UINT32 types instead of INT64. The INT64 type was originally chosen because libcudf did not support unsigned types at the time.
This is a breaking change since the basic input/output type is changed.

Closes #16324

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - https://github.com/brandon-b-miller
  - Karthikeyan (https://github.com/karthikeyann)

URL: #16489
A few small tweaks to `update-version.sh` for alignment across RAPIDS.

The `UCX_PY` curl call is unused.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #16506
This PR updates pre-commit hooks to the latest versions that are supported without causing style check errors.

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #16510
This PR adopts some work from @srinivasyadav18 with additional modifications. This is meant to complement #16484.

Authors:
  - Bradley Dice (https://github.com/bdice)
  - Srinivas Yadav (https://github.com/srinivasyadav18)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Srinivas Yadav (https://github.com/srinivasyadav18)

URL: #16497
closes #15278

This PR allows list type also forced as string when mixed type as string is enabled and a user given schema specifies a column as string, in JSON reader.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Shruti Shivakumar (https://github.com/shrshi)

URL: #16472
Removes overloaded `cudf::io::text::multibyte_split` API deprecated in 24.08 and is no longer needed.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #16501
This change updates json normalization calls (quote and whitespace normalization) to take owning buffer of device_buffer as input rather than device_uvector. It makes it easy to hand over a string_column's char buffer to normalization calls.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Shruti Shivakumar (https://github.com/shrshi)

URL: #16520
#16516)

xref #16507

`date_range` generates its dates via `range`, and the end of this range was calculated via `math.ceil((end - start) / freq)`. If `(end - start) / freq` did not produce a remainder, `math.ceil` would not correctly increment this value by `1` to capture the last date.

Instead, this PR uses `math.floor((end - start) / freq) + 1` to always ensure the last date is captured

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #16516
xref #16507

I would say this was a bug before because we would silently return a new DataFrame with just `len(set(column_labels))` when selecting by column. Now this operation raises since duplicate column labels are generally not supported.

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - https://github.com/brandon-b-miller

URL: #16514
Removing some more deprecated public libcudf APIs.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #16524
The JSON reader set the batch size to `INT_MAX` bytes since the motivation for implementing a batched JSON reader was to parse source files whose total size is larger than `INT_MAX` (#16138, #16162). However, we can use a much smaller batch size to evaluate the correctness of the reader and speed up tests significantly.
This PR focuses on reducing runtime of the batched reader test by setting the batch size to be used by the reader as an environment variable. 
The runtime of `JsonLargeReaderTest.MultiBatch` in `LARGE_STRINGS_TEST` gtest  drops from ~52s to ~3s.

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)

URL: #16502
…rings (#16536)

Recently some JSON parsing was updated so lists could be returned as strings. This updates the java code so that when cleaning up the results to match the desired schema that it can handle corner cases associated with lists and structs properly.

Tests are covered in the Spark plugin, but I am happy to add some here if we really want to validate that part of this.

Authors:
  - Robert (Bobby) Evans (https://github.com/revans2)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)

URL: #16536
Adds `const` declarations to appropriate member functions in class `cudf::io::text::byte_range_info` and moves the ctor implementation to .cpp file.
This helps with using the `byte_range_info` objects in `const` variables and inside of `const` functions.

Found while working on #15983

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Bradley Dice (https://github.com/bdice)

URL: #16518
Fixes specialized behavior for all empty input column on the strings split APIs.
Verifying behavior with Pandas `str.split( pat, expand, regex )`
`pat=None     -- whitespace`
`expand=False -- record APIs`
`regex=True   -- re APIs`

- [x] `split`
- [x] `split` - whitespace
- [x] `rsplit`
- [x] `rsplit` - whitespace
- [x] `split_record`
- [x] `split_record` - whitespace
- [x] `rsplit_record`
- [x] `rsplit_record` - whitespace
- [x] `split_re`
- [x] `rsplit_re`
- [x] `split_record_re`
- [x] `rsplit_record_re`

Closes #16453

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Bradley Dice (https://github.com/bdice)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #16466
Removes the pair-iterator benchmark logic. The remaining benchmarks use the null-replacement-iterator which uses the libcudf pair-iterator internally. There is no need for benchmarking this unique iterator pattern that is not used by libcudf.

The `cpp/benchmarks/iterator/iterator.cu` failed to compile with gcc 12 because the sum-reduce function cannot resolve adding `thrust::pair` objects together likely due to some recent changes in CCCL. Regardless, adding `thrust::pair` objects is not something we need to benchmark. The existing benchmark benchmarks libcudf's usage of the internal pair-iterator correctly.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)

URL: #16511
This PR removes hardcoded Python versions from CI workflows. It is a prerequisite for dropping Python 3.9. See rapidsai/build-planning#88.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #16540
After dask/dask-expr#1114, Dask cuDF must register specific `read_parquet` and `read_csv` functions to be used when query-planning is enabled (the default).

**This PR is required for CI to pass with dask>2024.8.0**

**NOTE**: It probably doesn't make sense to add specific tests for this change. Once the 2014.7.1 dask pin is removed, all `dask_cudf` tests using `read_parquet` and  `read_csv` will fail without this change...

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #16535
)

When Python integers are compared to a series of integers, the result can always be correctly defined no matter the values of the Python integer.

This was always a very mild issue.  But with NumPy 2 behavior not upcasting the computation result type based on the value anymore, even things like:
```
cudf.Series([1, 2, 3], dtype="int8") < 1000
```
would fail.
(Similar paths could be taken for other integer scalars, but there would be mostly nice for performance.)

N.B. NumPy/pandas also support exact comparisons when mixing e.g. uint64 and int64.  This is another rare exception that cudf currently does not support.

Closes gh-16282

Authors:
  - Sebastian Berg (https://github.com/seberg)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #16532
…mns (#16529)

Fixes `cudf::empty_like` to only create empty child columns for nested types. The empty child columns are needed to store the types for consistency with `cudf::make_empty_column`.

Closes #16490

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Mark Harris (https://github.com/harrism)

URL: #16529
rjzamora and others added 6 commits September 25, 2024 22:43
…uet` (#16684)

This PR piggybacks on the existing CPU/Arrow Parquet infrastructure in dask-expr. With this PR,

```python
df = dask_cudf.read_parquet(path, filesystem="arrow")
```
will produce a `cudf`-backed collection using PyArrow for IO (i.e. disk->`pa.Table`->`cudf.DataFrame`). Before this PR, passing `filesystem="arrow"` will simply result in an error.

Although this code path is not ideal for fast/local storage, it can be **very** efficient for remote storage (e.g. S3).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - Matthew Murray (https://github.com/Matt711)
  - David Wendt (https://github.com/davidwendt)
  - Tianyu Liu (https://github.com/kingcrimsontianyu)
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - https://github.com/brandon-b-miller
  - https://github.com/nvdbaranec

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #16684
Depends on #16836
This change adds a new host tree building algorithms for JSON reader and utf8 field name support.

This constructs the device_column_tree using an adjacency list created from parent information.
This adjacency list is pruned based on input schema, and also types are enforced as per schema. `mark_is_pruned`
Tree is constructed from pruned adjacency list, (with mixed types handling). `construct_tree`

utf8 field name support added: (spark requested)
utf8 decoding of field names during hashing of field nodes so that utf8 encoded field names also match to same column.

All unit tests passes, 1 unit test added where old algorithm fails.
This code is kept under experimental flag.

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #16545
More follow-up fixes to the recent Dask-cuDF documentation additions.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #16929
Copy link

copy-pr-bot bot commented Sep 27, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Sep 27, 2024
…ith non-newline delimiter (#16950)

Backporting PR #16923: : Parse newline as whitespace character while
tokenizing JSONL inputs

Addresses #16915
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cudf.pandas Issues specific to cudf.pandas cudf.polars Issues specific to cudf.polars Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.