Support CPU object for `train_test_split` #5873

isVoid · 2024-04-29T02:22:49Z

This PR adds support to CPU objects for train_test_split, leveraging the
input conversion tools defined in input_utils.py. This PR also adds
output_to_df_obj_like API that converts CumlArray back to a series/dataframe,
matching metadata from input.

In the meantime, this PR reimplements majority of train_test_split by
centralizing indices compute and gather. This reduces the number of kernels
launched, especially in the cases where stratify keys are provided.

Closes #5619

python/cuml/internals/input_utils.py

python/cuml/model_selection/_split.py

…branch-24.06

dantegd

Overall things look great! Love the code cleaning included in the PR as well! Just had one question

dantegd · 2024-04-30T05:08:41Z

python/cuml/tests/test_train_test_split.py


 cuda = gpu_only_import_from("numba", "cuda")

-test_array_input_types = ["numba", "cupy"]
+test_array_input_types = ["numba", "cupy", "cudf", "pandas"]


Code looks great! But the in the tests we're only testing for dataframes, should we add tests for series as well?

Existing tests have hardcoded 2-d array inputs that are can't be naturally cast to Series. I parametrize them with 1-d array and a fixture to convert to df / series based on data shape.

dantegd · 2024-04-30T14:41:06Z

/merge

Patch train_test_split non-stratify case

219d153

github-actions bot added the Cython / Python Cython or Python issue label Apr 29, 2024

isVoid added 4 commits April 29, 2024 03:30

Enforce stratify split with internal array structure

0053e3a

Centralize indices compute

f97bf4c

clean up and docs

4337571

address my own reviews

329ac7e

isVoid commented Apr 29, 2024

View reviewed changes

python/cuml/internals/input_utils.py Outdated Show resolved Hide resolved

python/cuml/internals/input_utils.py Outdated Show resolved Hide resolved

python/cuml/model_selection/_split.py Outdated Show resolved Hide resolved

python/cuml/model_selection/_split.py Outdated Show resolved Hide resolved

isVoid added 3 commits April 29, 2024 21:54

Address doc changes

f00c0b5

Merge branch 'branch-24.06' of https://github.com/rapidsai/cuml into …

69ff80c

…branch-24.06

Merge branch 'branch-24.06' of github.com:isVoid/cuml into branch-24.06

ff33420

isVoid marked this pull request as ready for review April 29, 2024 14:00

isVoid requested a review from a team as a code owner April 29, 2024 14:00

Merge branch 'branch-24.06' of https://github.com/rapidsai/cuml into …

d2274cf

…branch-24.06

dantegd requested changes Apr 30, 2024

View reviewed changes

add 1-d data

406691d

isVoid requested a review from dantegd April 30, 2024 06:55

dantegd added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 30, 2024

dantegd approved these changes Apr 30, 2024

View reviewed changes

rapids-bot bot merged commit 1609fcd into rapidsai:branch-24.06 Apr 30, 2024
59 checks passed

dantegd mentioned this pull request May 10, 2024

[FEA] Feature request: train_test_split to accept numpy inputs #1619

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support CPU object for `train_test_split` #5873

Support CPU object for `train_test_split` #5873

isVoid commented Apr 29, 2024 •

edited

Loading

dantegd left a comment

dantegd Apr 30, 2024

isVoid Apr 30, 2024

dantegd commented Apr 30, 2024

Support CPU object for train_test_split #5873

Support CPU object for train_test_split #5873

Conversation

isVoid commented Apr 29, 2024 • edited Loading

dantegd left a comment

Choose a reason for hiding this comment

dantegd Apr 30, 2024

Choose a reason for hiding this comment

isVoid Apr 30, 2024

Choose a reason for hiding this comment

dantegd commented Apr 30, 2024

Support CPU object for `train_test_split` #5873

Support CPU object for `train_test_split` #5873

isVoid commented Apr 29, 2024 •

edited

Loading