-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Dataset.unique() raises error in case of any null values #42142
Comments
Hello burton, I'd like to work on this issue! TIA. |
hi @Akshi22 , don't let me get in your way! though it looks like @ujjawal-khare-27 has already submitted a pr to fix this issue. maybe you can help there? |
For what it's worth, I just ran into this issue again, only this time in the context of |
Hi, is this issue still open? |
I believe the right way to fix this is going to require the underlying Merge operations to be Pyarrow based, instead of Python based (where we currently use a heapq iterator, which doesn't compare NaNs well) |
…ay-project#48697) ## Why are these changes needed? This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented. ## Related issue number This is related to ray-project#42776 and ray-project#42142 Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
…ay-project#48697) ## Why are these changes needed? This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented. ## Related issue number This is related to ray-project#42776 and ray-project#42142 Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
What happened + What you expected to happen
I wanted to get the unique values in a given column of my dataset, but some of the values are null for unavoidable reasons. Calling
Dataset.unique(colname)
on such data raises a TypeError, with differing specifics depending on how the column dtype is specified. This behavior was surprising since the equivalent operation on apandas.Series
works just fine, as does getting unique values via Python built-ins.Here are two versions of type error I got, seemingly from the same line of code:
and
Versions / Dependencies
macOS 14.1
PY 3.9
ray == 2.9.0
pandas == 2.1.0
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: