-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Optimize has_duplicate_edges #2409
[REVIEW] Optimize has_duplicate_edges #2409
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.08 #2409 +/- ##
================================================
+ Coverage 60.11% 60.13% +0.02%
================================================
Files 102 102
Lines 5155 5153 -2
================================================
Hits 3099 3099
+ Misses 2056 2054 -2
Continue to review full report at Codecov.
|
unique_pair_len = len( | ||
df[[cls.src_col_name, cls.dst_col_name]].drop_duplicates() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, since I think the runtime performance seems worth it, but is there a tradeoff here in the temporary DataFrame using potentially more memory than the groupby object returned from groupby()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out the problem around the memory usage . We indeed seem to use more memory here as we now store unique dst and src combinations
while previously we were only saving unique src
columns.
That said, with dask i am not too worried as we will release it for each partition as we create it so it should not be too much.
Also previous to this review we were saving index too, now with ignore_index=True
we wont save that so that should save on some memory. Thanks for that.
Just FYI: I am not sure this is relevant here, but cuGraph C++ graph_view_t has count_multi_edges() (https://github.com/rapidsai/cugraph/blob/branch-22.08/cpp/include/cugraph/graph_view.hpp#L680). |
@gpucibot merge |
@seunghwak , Thanks for the input , raised an issue here to track it. #2417 |
This PR fixes drop duplicates scalability by removing apply which does serial processing.
Benchmark Data
### After PR:
### Before PR: