[Data] Dataset.unique() raises error in case of any null values #42142

bdewilde · 2024-01-02T17:48:33Z

What happened + What you expected to happen

I wanted to get the unique values in a given column of my dataset, but some of the values are null for unavoidable reasons. Calling Dataset.unique(colname) on such data raises a TypeError, with differing specifics depending on how the column dtype is specified. This behavior was surprising since the equivalent operation on a pandas.Series works just fine, as does getting unique values via Python built-ins.

Here are two versions of type error I got, seemingly from the same line of code:

File ~/.pyenv/versions/3.9.18/envs/ev-detection/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py:110, in SortTaskSpec.sample_boundaries(blocks, sort_key, num_reducers)
    107 sample_dict = BlockAccessor.for_block(samples).to_numpy(columns=columns)
    108 # Compute sorted indices of the samples. In np.lexsort last key is the
    109 # primary key hence have to reverse the order.
--> 110 indices = np.lexsort(list(reversed(list(sample_dict.values()))))
    111 # Sort each column by indices, and calculate q-ths quantile items.
    112 # Ignore the 1st item as it's not required for the boundary
    113 for k, v in sample_dict.items():

File <__array_function__ internals>:180, in lexsort(*args, **kwargs)

TypeError: '<' not supported between instances of 'NoneType' and 'int'

and

File ~/.pyenv/versions/3.9.18/envs/test-env/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py:110, in SortTaskSpec.sample_boundaries(blocks, sort_key, num_reducers)
    107 sample_dict = BlockAccessor.for_block(samples).to_numpy(columns=columns)
    108 # Compute sorted indices of the samples. In np.lexsort last key is the
    109 # primary key hence have to reverse the order.
--> 110 indices = np.lexsort(list(reversed(list(sample_dict.values()))))
    111 # Sort each column by indices, and calculate q-ths quantile items.
    112 # Ignore the 1st item as it's not required for the boundary
    113 for k, v in sample_dict.items():

File <__array_function__ internals>:180, in lexsort(*args, **kwargs)

File missing.pyx:419, in pandas._libs.missing.NAType.__bool__()

TypeError: boolean value of NA is ambiguous

Versions / Dependencies

macOS 14.1
PY 3.9
ray == 2.9.0
pandas == 2.1.0

Reproduction script

import pandas as pd
import ray.data

items = [1, 2, 3, 2, 3, None]
# set(items) works fine, as expected
ds1 = ray.data.from_items(items)
ds1.unique("item")
# raises TypeError: '<' not supported between instances of 'NoneType' and 'int'

df = pd.DataFrame({"col": [1, 2, 3, None]}, dtype="Int64")
# df["col"].unique() works fine, as expected
ds2 = ray.data.from_pandas(df)
ds2.unique("col")
# raises TypeError: boolean value of NA is ambiguous

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

Akshi22 · 2024-02-20T00:13:59Z

Hello burton, I'd like to work on this issue! TIA.

bdewilde · 2024-02-20T15:15:27Z

hi @Akshi22 , don't let me get in your way! though it looks like @ujjawal-khare-27 has already submitted a pr to fix this issue. maybe you can help there?

bdewilde · 2024-03-09T21:57:10Z

For what it's worth, I just ran into this issue again, only this time in the context of Dataset.groupby(col). It's the same error message, and presumably the same code under the hood. Just a bummer.

csking101 · 2024-08-13T09:44:43Z

Hi, is this issue still open?
If so, I'd like to get started contributing to Ray.io!

richardliaw · 2024-11-12T04:11:27Z

I believe the right way to fix this is going to require the underlying Merge operations to be Pyarrow based, instead of Python based (where we currently use a heapq iterator, which doesn't compare NaNs well)

…48697) ## Why are these changes needed? This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented. ## Related issue number This is related to #42776 and #42142 Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

…ay-project#48697) ## Why are these changes needed? This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented. ## Related issue number This is related to ray-project#42776 and ray-project#42142 Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

…ay-project#48697) ## Why are these changes needed? This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented. ## Related issue number This is related to ray-project#42776 and ray-project#42142 Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>

bdewilde added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 2, 2024

anyscalesam added the data Ray Data-related issues label Jan 3, 2024

scottjlee added good first issue Great starter issue for someone just starting to contribute to Ray P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 4, 2024

ujjawal-khare-27 mentioned this issue Jan 28, 2024

Ujjawal/fix/ds unique #42776

Closed

8 tasks

richardliaw mentioned this issue Nov 12, 2024

[data] cleanup: use SortKey instead of mixed typing in aggregation #48697

Merged

8 tasks

richardliaw mentioned this issue Nov 15, 2024

[data] Sort with None #48750

Merged

8 tasks

richardliaw closed this as completed in 134e5ec Nov 15, 2024

richardliaw closed this as completed in #48750 Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Dataset.unique() raises error in case of any null values #42142

[Data] Dataset.unique() raises error in case of any null values #42142

bdewilde commented Jan 2, 2024

Akshi22 commented Feb 20, 2024

bdewilde commented Feb 20, 2024

bdewilde commented Mar 9, 2024

csking101 commented Aug 13, 2024

richardliaw commented Nov 12, 2024

[Data] Dataset.unique() raises error in case of any null values #42142

[Data] Dataset.unique() raises error in case of any null values #42142

Comments

bdewilde commented Jan 2, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Akshi22 commented Feb 20, 2024

bdewilde commented Feb 20, 2024

bdewilde commented Mar 9, 2024

csking101 commented Aug 13, 2024

richardliaw commented Nov 12, 2024