Disallow indexing by selecting duplicate labels #16514

mroeschke · 2024-08-08T21:22:03Z

Description

I would say this was a bug before because we would silently return a new DataFrame with just len(set(column_labels)) when selecting by column. Now this operation raises since duplicate column labels are generally not supported.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

brandon-b-miller

Thanks @mroeschke . I noticed pandas supports this, do you feel this is a common pattern its worth supporting in cuDF over the longer term?

bdice · 2024-08-12T14:18:08Z

Or conversely, does pandas plan to remove this functionality?

mroeschke · 2024-08-12T17:56:32Z

I would say pandas does not plan on ever removing duplicate label functionality, but also that duplicate labels are not that commonplace in pandas.

It would be nice if cudf would eventually support duplicate column labels (I'll open an issue), but I don't think it's high priority.

mroeschke · 2024-08-12T17:56:44Z

/merge

@rlratzel

cc: @rlratzel @ChuckHastings This PR addresses failures seen in certain PRs (like [here](https://github.com/rapidsai/cugraph/actions/runs/10372270389/job/28718471674?pr=4606#step:7:5269)) due to a [recent change](rapidsai/cudf#16514) to `cudf` that disallows selecting duplicate column labels. --- In `hypergraph.py`, this PR modifies `_create_hyper_edges` and `_create_direct_edges` to ensure that DataFrames are being indexed by non-duplicate column values. This is done by taking a list that includes duplicates (`fs`), and removing the non-unique values ```python fs = list(set(fs)) ``` _This part requires some attention from the author of the unit test @jnke2016_ In `test_hypergraph.py`, this PR adds the `check_like=True` arg to `assert_frame_equals` function because the ordering of the columns is different for the two DFs. Authors: - Ralph Liu (https://github.com/nv-rliu) Approvers: - Rick Ratzel (https://github.com/rlratzel) - Chuck Hastings (https://github.com/ChuckHastings) - Paul Taylor (https://github.com/trxcllnt) - Joseph Nke (https://github.com/jnke2016) URL: #4610

Disallow indexing by selecting duplicate labels

3e25394

mroeschke added bug Something isn't working Python Affects Python cuDF API. non-breaking Non-breaking change labels Aug 8, 2024

mroeschke requested a review from a team as a code owner August 8, 2024 21:22

mroeschke requested review from bdice and brandon-b-miller August 8, 2024 21:22

Matt711 mentioned this pull request Aug 9, 2024

[BUG] cuDF and Pandas return different results for ... #16507

Open

brandon-b-miller approved these changes Aug 12, 2024

View reviewed changes

rapids-bot bot merged commit a3dc14f into rapidsai:branch-24.10 Aug 12, 2024
88 checks passed

mroeschke deleted the bug/indexing/duplicates branch August 12, 2024 17:57

This was referenced Aug 12, 2024

[BUG] DataFrame loc indexing is incorrect with repeated column labels. #13269

Closed

[FEA] Support duplicate column labels in cudf.DataFrame #16533

Open

nv-rliu mentioned this pull request Aug 14, 2024

Updates to cugraph.hypergraph (Duplicate Col Labels Bug) rapidsai/cugraph#4610

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disallow indexing by selecting duplicate labels #16514

Disallow indexing by selecting duplicate labels #16514

mroeschke commented Aug 8, 2024

brandon-b-miller left a comment

bdice commented Aug 12, 2024

mroeschke commented Aug 12, 2024

mroeschke commented Aug 12, 2024

Disallow indexing by selecting duplicate labels #16514

Disallow indexing by selecting duplicate labels #16514

Conversation

mroeschke commented Aug 8, 2024

Description

Checklist

brandon-b-miller left a comment

Choose a reason for hiding this comment

bdice commented Aug 12, 2024

mroeschke commented Aug 12, 2024

mroeschke commented Aug 12, 2024