Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support drop_duplicates on Series containing list objects #6784

Closed
miguelusque opened this issue Nov 17, 2020 · 21 comments
Closed

[FEA] Support drop_duplicates on Series containing list objects #6784

miguelusque opened this issue Nov 17, 2020 · 21 comments
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@miguelusque
Copy link
Member

Hi!

Is your feature request related to a problem? Please describe.
It would be useful to be able to drop duplicates over a series which contains list objects.

Describe the solution you'd like
To be able to drop duplicates on series containing list objects.

Describe alternatives you've considered
We are using the code below, which is very CPU intensive.

# Remove duplicates 
compliant_indices["tmp"] = ['_'.join([str(z) for z in y]) for y in [sorted(x) for x in compliant_indices.values.tolist()]]
compliant_indices.drop_duplicates(subset="tmp", inplace=True)
compliant_indices.drop(columns=['tmp'], inplace=True)

Additional context
Thanks!!!!

@miguelusque miguelusque added Needs Triage Need team to review and classify feature request New feature or request labels Nov 17, 2020
@harrism harrism added the libcudf Affects libcudf (C++/CUDA) code. label Nov 17, 2020
@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Nov 17, 2020
@github-actions
Copy link

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@miguelusque
Copy link
Member Author

I think this feature request is still relevant.

@ttnghia
Copy link
Contributor

ttnghia commented Mar 12, 2021

Hey guys! Can you provide an example for your feature request, please? We just have drop_list_duplicates API in C++ side and will have a Python binding soon. I'm not sure if this is what you want:

@randerzander
Copy link
Contributor

I personally like @kkraus14 's suggestion

@miguelusque
Copy link
Member Author

Hi,

In the code above, we have a series where the column datatype is a list.

Each row it that series may contain different values, i.e.:

[1, 3, 5, 7]
[2, 8, 4]
[1, 3, 5, 7]

After applying drop_duplicates on that series, the output should be:

[1, 3, 5, 7]
[2, 8, 4]

My only concern is when we have a list that contains the same number of elements but in a different order. For instance:

[1, 3, 5, 7]
[2, 8, 4]
[3, 1, 5, 7]

Should the last row be removed? My personal opinion is that, if we may have available sorted() or sort() method that would work on device memory, I would leave that functionality to explicitly invoke that method before invoking drop_duplicates(). Alternatively, I would add an ignore_sort or similar parameter to be able to perform that operation without needing to copy the data first to host_memory before dropping duplicates.

Hope the above helps!

Regards,
Miguel

@kkraus14
Copy link
Collaborator

This is an entirely different request than #7414.

This issue is asking for removing duplicate elements in a column where the column is list dtype so the elements are lists. i.e.

input = [
    [1, 1, 2],
    [3, 4, 4],
    [1, 1, 2]
]

output = [
    [1, 1, 2],
    [3, 4, 4]
]

#7414 is about removing duplicate values from within list elements. i.e.

input = [
    [1, 1, 2],
    [3, 4, 4],
    [1, 1, 2]
]

output = [
    [1, 2],
    [3, 4],
    [1, 2]
]

@ttnghia
Copy link
Contributor

ttnghia commented Mar 14, 2021

Okay, I see. Removing duplicate lists requires list comparison operator. That operator is also useful for list sorting, as requested here: #5890. So, once the list comparison is supported, we can address both issues at the same time.

@miguelusque
Copy link
Member Author

Hi,

Congratulations on the recent announcement of lists support on cuDF.

I hope this feature request could be considered for next release (0.20).

Thanks!

@kkraus14
Copy link
Collaborator

@miguelusque it likely will not be. Operations that require comparing lists require us changes to our row comparator and have the potential to slow down all of libcudf if not done carefully. We're trying to figure out a path forward here but it's likely going to take some time.

@randerzander
Copy link
Contributor

@miguelusque a potential workaround (once #7929 lands) would be concatting list elements w/ a token into a string and doing drop duplicates against the string column.

@miguelusque
Copy link
Member Author

miguelusque commented Apr 21, 2021

Hi @randerzander , I think that your proposal is similar to the workaround we are using currently, detailed at the top of this issue .

That would imply to first sort the elements in the list, then concatenate and then dropping the duplicates, right?

We wanted to avoid having to move the data to host memory, and that's why this feature request. (we need to sort the data in host memory, if I am not wrong)

Please, let me know if I have misunderstood your workaround.

Thanks!

@randerzander
Copy link
Contributor

Sorting within lists is supported on GPU as well, so I think you'll be good once list->string concat is merged.

@miguelusque
Copy link
Member Author

miguelusque commented Apr 21, 2021

Thank you! I was not aware of it! :-)

@randerzander
Copy link
Contributor

@miguelusque - a related issue to watch for the workaround

@miguelusque
Copy link
Member Author

miguelusque commented Jun 25, 2022

Hi, it looks like the following workaround works 100% in GPU memory. Could someone please confirm it? Thanks!

import cudf

df = cudf.DataFrame({"a": [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
b = df["a"].list.sort_values().list.astype(str).str.join("-").drop_duplicates()
b
0    1-3-5-7
1      2-4-8
Name: a, dtype: object

If the lists within the series already contain string elements, it might be more straightforward:

import cudf
df = cudf.DataFrame({"a": [["1", "3", "5", "7"], ["2", "8", "4"], ["1", "3", "5", "7"]]})

b = df["a"].list.sort_values().str.join("-").drop_duplicates()
b
0    1-3-5-7
1      2-4-8
Name: a, dtype: object

Considering that we already have all the pieces available in cuDF, it might be well-worth adding support to lists to drop_duplicates method.

Thanks!

@GregoryKimball GregoryKimball removed the libcudf Affects libcudf (C++/CUDA) code. label Jun 30, 2022
@GregoryKimball
Copy link
Contributor

With the addition of list column support for distinct in libcudf (#10641), this issue just needs python bindings.

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Aug 1, 2022

In 22.06 drop_duplicates uses a sort-based algorithm and relies on the lexicographic comparator. We expect this will be closed by #11129

import cudf
df = cudf.DataFrame({"a":  [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
df['a'].drop_duplicates()
RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/table/row_operators.cu:267: Cannot lexicographic compare a table with a LIST column

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Oct 5, 2022

In 22.10 we are closer but there is some lingering issue. The list columns returns sorted, but the duplicates remain.

>>> df = cudf.DataFrame({"a":  [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
>>> df['a'].drop_duplicates()
0    [1, 3, 5, 7]
2    [1, 3, 5, 7]
1       [2, 8, 4]
Name: a, dtype: list

@ttnghia
Copy link
Contributor

ttnghia commented Apr 6, 2023

@GregoryKimball
Copy link
Contributor

This feature is now available in 23.06.

>>> df = cudf.DataFrame({"a":  [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
>>> df['a'].drop_duplicates()
0    [1, 3, 5, 7]
1       [2, 8, 4]
Name: a, dtype: list

@bdice
Copy link
Contributor

bdice commented May 31, 2023

@GregoryKimball Can we open a follow-up issue to add explicit Python tests for this? I would have done so in #11656 if I’d realized it could close this issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

7 participants