Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Dataset.unique() raises error in case of any null values #42142

Closed
bdewilde opened this issue Jan 2, 2024 · 5 comments · Fixed by #48750
Closed

[Data] Dataset.unique() raises error in case of any null values #42142

bdewilde opened this issue Jan 2, 2024 · 5 comments · Fixed by #48750
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues good first issue Great starter issue for someone just starting to contribute to Ray P1 Issue that should be fixed within a few weeks

Comments

@bdewilde
Copy link

bdewilde commented Jan 2, 2024

What happened + What you expected to happen

I wanted to get the unique values in a given column of my dataset, but some of the values are null for unavoidable reasons. Calling Dataset.unique(colname) on such data raises a TypeError, with differing specifics depending on how the column dtype is specified. This behavior was surprising since the equivalent operation on a pandas.Series works just fine, as does getting unique values via Python built-ins.

Here are two versions of type error I got, seemingly from the same line of code:

File ~/.pyenv/versions/3.9.18/envs/ev-detection/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py:110, in SortTaskSpec.sample_boundaries(blocks, sort_key, num_reducers)
    107 sample_dict = BlockAccessor.for_block(samples).to_numpy(columns=columns)
    108 # Compute sorted indices of the samples. In np.lexsort last key is the
    109 # primary key hence have to reverse the order.
--> 110 indices = np.lexsort(list(reversed(list(sample_dict.values()))))
    111 # Sort each column by indices, and calculate q-ths quantile items.
    112 # Ignore the 1st item as it's not required for the boundary
    113 for k, v in sample_dict.items():

File <__array_function__ internals>:180, in lexsort(*args, **kwargs)

TypeError: '<' not supported between instances of 'NoneType' and 'int'

and

File ~/.pyenv/versions/3.9.18/envs/test-env/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py:110, in SortTaskSpec.sample_boundaries(blocks, sort_key, num_reducers)
    107 sample_dict = BlockAccessor.for_block(samples).to_numpy(columns=columns)
    108 # Compute sorted indices of the samples. In np.lexsort last key is the
    109 # primary key hence have to reverse the order.
--> 110 indices = np.lexsort(list(reversed(list(sample_dict.values()))))
    111 # Sort each column by indices, and calculate q-ths quantile items.
    112 # Ignore the 1st item as it's not required for the boundary
    113 for k, v in sample_dict.items():

File <__array_function__ internals>:180, in lexsort(*args, **kwargs)

File missing.pyx:419, in pandas._libs.missing.NAType.__bool__()

TypeError: boolean value of NA is ambiguous

Versions / Dependencies

macOS 14.1
PY 3.9
ray == 2.9.0
pandas == 2.1.0

Reproduction script

import pandas as pd
import ray.data

items = [1, 2, 3, 2, 3, None]
# set(items) works fine, as expected
ds1 = ray.data.from_items(items)
ds1.unique("item")
# raises TypeError: '<' not supported between instances of 'NoneType' and 'int'

df = pd.DataFrame({"col": [1, 2, 3, None]}, dtype="Int64")
# df["col"].unique() works fine, as expected
ds2 = ray.data.from_pandas(df)
ds2.unique("col")
# raises TypeError: boolean value of NA is ambiguous

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@bdewilde bdewilde added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 2, 2024
@anyscalesam anyscalesam added the data Ray Data-related issues label Jan 3, 2024
@scottjlee scottjlee added good first issue Great starter issue for someone just starting to contribute to Ray P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 4, 2024
@Akshi22
Copy link

Akshi22 commented Feb 20, 2024

Hello burton, I'd like to work on this issue! TIA.

@bdewilde
Copy link
Author

hi @Akshi22 , don't let me get in your way! though it looks like @ujjawal-khare-27 has already submitted a pr to fix this issue. maybe you can help there?

@bdewilde
Copy link
Author

bdewilde commented Mar 9, 2024

For what it's worth, I just ran into this issue again, only this time in the context of Dataset.groupby(col). It's the same error message, and presumably the same code under the hood. Just a bummer.

@csking101
Copy link

Hi, is this issue still open?
If so, I'd like to get started contributing to Ray.io!

@richardliaw
Copy link
Contributor

I believe the right way to fix this is going to require the underlying Merge operations to be Pyarrow based, instead of Python based (where we currently use a heapq iterator, which doesn't compare NaNs well)

richardliaw added a commit that referenced this issue Nov 14, 2024
…48697)

## Why are these changes needed?

This makes SortAggregate more consistent by unifying the API on the
SortKey object, similar to how SortTaskSpec is implemented.


## Related issue number

This is related to #42776 and
#42142


Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this issue Nov 14, 2024
…ay-project#48697)

## Why are these changes needed?

This makes SortAggregate more consistent by unifying the API on the
SortKey object, similar to how SortTaskSpec is implemented.


## Related issue number

This is related to ray-project#42776 and
ray-project#42142


Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this issue Nov 15, 2024
…ay-project#48697)

## Why are these changes needed?

This makes SortAggregate more consistent by unifying the API on the
SortKey object, similar to how SortTaskSpec is implemented.

## Related issue number

This is related to ray-project#42776 and
ray-project#42142

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues good first issue Great starter issue for someone just starting to contribute to Ray P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants