Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Optimize has_duplicate_edges #2409

Merged
merged 10 commits into from
Jul 15, 2022

Conversation

VibhuJawa
Copy link
Member

@VibhuJawa VibhuJawa commented Jul 14, 2022

This PR fixes drop duplicates scalability by removing apply which does serial processing.

Benchmark Data

n_nodes = 100_000 
n_rows  = 1_500_000
df = cudf.DataFrame({'src':cp.random.randint(0,n_nodes,n_rows),
                    'dst':cp.random.randint(0,n_nodes,n_rows)})

### After PR:

17.8 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

### Before PR:

26.3 s ± 78.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@VibhuJawa VibhuJawa requested a review from a team as a code owner July 14, 2022 04:22
@VibhuJawa VibhuJawa added python improvement Improvement / enhancement to an existing function labels Jul 14, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jul 14, 2022

Codecov Report

Merging #2409 (39d56ae) into branch-22.08 (2aad5f2) will increase coverage by 0.02%.
The diff coverage is 50.00%.

@@               Coverage Diff                @@
##           branch-22.08    #2409      +/-   ##
================================================
+ Coverage         60.11%   60.13%   +0.02%     
================================================
  Files               102      102              
  Lines              5155     5153       -2     
================================================
  Hits               3099     3099              
+ Misses             2056     2054       -2     
Impacted Files Coverage Δ
...ugraph/cugraph/dask/structure/mg_property_graph.py 18.56% <0.00%> (+0.07%) ⬆️
python/cugraph/cugraph/structure/property_graph.py 96.41% <100.00%> (-0.02%) ⬇️
...ython/cugraph/cugraph/community/ktruss_subgraph.py 88.23% <0.00%> (+2.94%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2aad5f2...39d56ae. Read the comment docs.

@VibhuJawa VibhuJawa added the non-breaking Non-breaking change label Jul 14, 2022
Comment on lines 659 to 661
unique_pair_len = len(
df[[cls.src_col_name, cls.dst_col_name]].drop_duplicates()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, since I think the runtime performance seems worth it, but is there a tradeoff here in the temporary DataFrame using potentially more memory than the groupby object returned from groupby()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out the problem around the memory usage . We indeed seem to use more memory here as we now store unique dst and src combinations while previously we were only saving unique src columns.

That said, with dask i am not too worried as we will release it for each partition as we create it so it should not be too much.

Also previous to this review we were saving index too, now with ignore_index=True we wont save that so that should save on some memory. Thanks for that.

@seunghwak
Copy link
Contributor

Just FYI: I am not sure this is relevant here, but cuGraph C++ graph_view_t has count_multi_edges() (https://github.com/rapidsai/cugraph/blob/branch-22.08/cpp/include/cugraph/graph_view.hpp#L680).

@rlratzel
Copy link
Contributor

@gpucibot merge

@VibhuJawa
Copy link
Member Author

Just FYI: I am not sure this is relevant here, but cuGraph C++ graph_view_t has count_multi_edges() (https://github.com/rapidsai/cugraph/blob/branch-22.08/cpp/include/cugraph/graph_view.hpp#L680).

@seunghwak , Thanks for the input , raised an issue here to track it. #2417

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants