-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support drop_duplicates on Series containing list objects #6784
Comments
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. |
I think this feature request is still relevant. |
Hey guys! Can you provide an example for your feature request, please? We just have |
I personally like @kkraus14 's suggestion |
Hi, In the code above, we have a series where the column datatype is a list. Each row it that series may contain different values, i.e.:
After applying
My only concern is when we have a list that contains the same number of elements but in a different order. For instance:
Should the last row be removed? My personal opinion is that, if we may have available Hope the above helps! Regards, |
This is an entirely different request than #7414. This issue is asking for removing duplicate elements in a column where the column is list dtype so the elements are lists. i.e.
#7414 is about removing duplicate values from within list elements. i.e.
|
Okay, I see. Removing duplicate lists requires list comparison operator. That operator is also useful for list sorting, as requested here: #5890. So, once the list comparison is supported, we can address both issues at the same time. |
Hi, Congratulations on the recent announcement of lists support on cuDF. I hope this feature request could be considered for next release (0.20). Thanks! |
@miguelusque it likely will not be. Operations that require comparing lists require us changes to our row comparator and have the potential to slow down all of libcudf if not done carefully. We're trying to figure out a path forward here but it's likely going to take some time. |
@miguelusque a potential workaround (once #7929 lands) would be concatting list elements w/ a token into a string and doing drop duplicates against the string column. |
Hi @randerzander , I think that your proposal is similar to the workaround we are using currently, detailed at the top of this issue . That would imply to first sort the elements in the list, then concatenate and then dropping the duplicates, right? We wanted to avoid having to move the data to host memory, and that's why this feature request. (we need to sort the data in host memory, if I am not wrong) Please, let me know if I have misunderstood your workaround. Thanks! |
Sorting within lists is supported on GPU as well, so I think you'll be good once list->string concat is merged. |
Thank you! I was not aware of it! :-) |
@miguelusque - a related issue to watch for the workaround |
Hi, it looks like the following workaround works 100% in GPU memory. Could someone please confirm it? Thanks!
If the lists within the series already contain string elements, it might be more straightforward:
Considering that we already have all the pieces available in cuDF, it might be well-worth adding support to lists to Thanks! |
With the addition of list column support for |
In 22.06
|
In 22.10 we are closer but there is some lingering issue. The list columns returns sorted, but the duplicates remain.
|
May be fixed by: which is implemented in either: |
This feature is now available in 23.06.
|
@GregoryKimball Can we open a follow-up issue to add explicit Python tests for this? I would have done so in #11656 if I’d realized it could close this issue. Thanks! |
Hi!
Is your feature request related to a problem? Please describe.
It would be useful to be able to drop duplicates over a series which contains list objects.
Describe the solution you'd like
To be able to drop duplicates on series containing list objects.
Describe alternatives you've considered
We are using the code below, which is very CPU intensive.
Additional context
Thanks!!!!
The text was updated successfully, but these errors were encountered: