[REVIEW] Optimize has_duplicate_edges #2409

VibhuJawa · 2022-07-14T04:21:59Z

This PR fixes drop duplicates scalability by removing apply which does serial processing.

Benchmark Data

n_nodes = 100_000 
n_rows  = 1_500_000
df = cudf.DataFrame({'src':cp.random.randint(0,n_nodes,n_rows),
                    'dst':cp.random.randint(0,n_nodes,n_rows)})

### After PR:

17.8 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

### Before PR:

26.3 s ± 78.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

codecov-commenter · 2022-07-14T06:23:44Z

Codecov Report

Merging #2409 (39d56ae) into branch-22.08 (2aad5f2) will increase coverage by 0.02%.
The diff coverage is 50.00%.

@@               Coverage Diff                @@
##           branch-22.08    #2409      +/-   ##
================================================
+ Coverage         60.11%   60.13%   +0.02%     
================================================
  Files               102      102              
  Lines              5155     5153       -2     
================================================
  Hits               3099     3099              
+ Misses             2056     2054       -2

Impacted Files	Coverage Δ
...ugraph/cugraph/dask/structure/mg_property_graph.py	`18.56% <0.00%> (+0.07%)`	⬆️
python/cugraph/cugraph/structure/property_graph.py	`96.41% <100.00%> (-0.02%)`	⬇️
...ython/cugraph/cugraph/community/ktruss_subgraph.py	`88.23% <0.00%> (+2.94%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2aad5f2...39d56ae. Read the comment docs.

rlratzel · 2022-07-14T20:57:15Z

python/cugraph/cugraph/dask/structure/mg_property_graph.py

+        unique_pair_len = len(
+            df[[cls.src_col_name, cls.dst_col_name]].drop_duplicates()
+        )


Just curious, since I think the runtime performance seems worth it, but is there a tradeoff here in the temporary DataFrame using potentially more memory than the groupby object returned from groupby()?

Thanks for pointing out the problem around the memory usage . We indeed seem to use more memory here as we now store unique dst and src combinations while previously we were only saving unique src columns.

That said, with dask i am not too worried as we will release it for each partition as we create it so it should not be too much.

Also previous to this review we were saving index too, now with ignore_index=True we wont save that so that should save on some memory. Thanks for that.

seunghwak · 2022-07-14T21:55:59Z

Just FYI: I am not sure this is relevant here, but cuGraph C++ graph_view_t has count_multi_edges() (https://github.com/rapidsai/cugraph/blob/branch-22.08/cpp/include/cugraph/graph_view.hpp#L680).

rlratzel · 2022-07-15T17:53:11Z

@gpucibot merge

VibhuJawa · 2022-07-15T18:11:29Z

Just FYI: I am not sure this is relevant here, but cuGraph C++ graph_view_t has count_multi_edges() (https://github.com/rapidsai/cugraph/blob/branch-22.08/cpp/include/cugraph/graph_view.hpp#L680).

@seunghwak , Thanks for the input , raised an issue here to track it. #2417

VibhuJawa added 6 commits July 13, 2022 20:49

fix drop duplicates scalabilty

dde169b

revert empty change

6092050

remove not correct dask comment

b4fd039

remove fixme comment

fa3560d

remove typo

26e624a

Fixed Style

99df26d

VibhuJawa requested a review from a team as a code owner July 14, 2022 04:22

VibhuJawa added python improvement Improvement / enhancement to an existing function labels Jul 14, 2022

VibhuJawa added the non-breaking Non-breaking change label Jul 14, 2022

BradReesWork approved these changes Jul 14, 2022

View reviewed changes

rlratzel reviewed Jul 14, 2022

View reviewed changes

rlratzel approved these changes Jul 14, 2022

View reviewed changes

VibhuJawa added 4 commits July 14, 2022 14:19

made the output a dask scalar

94ff85a

changed from size to shape[0]

c29bcb0

save memory by removing index\

9c0c14b

Fix style changes

39d56ae

rapids-bot bot merged commit 049d441 into rapidsai:branch-22.08 Jul 15, 2022

VibhuJawa mentioned this pull request Jul 15, 2022

[FEA] Expose count_multi_edges for cuGraph C++ graph_view_t to python #2417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Optimize has_duplicate_edges #2409

[REVIEW] Optimize has_duplicate_edges #2409

VibhuJawa commented Jul 14, 2022 •

edited

Loading

codecov-commenter commented Jul 14, 2022 •

edited

Loading

rlratzel Jul 14, 2022

VibhuJawa Jul 14, 2022

seunghwak commented Jul 14, 2022

rlratzel commented Jul 15, 2022

VibhuJawa commented Jul 15, 2022

[REVIEW] Optimize has_duplicate_edges #2409

[REVIEW] Optimize has_duplicate_edges #2409

Conversation

VibhuJawa commented Jul 14, 2022 • edited Loading

codecov-commenter commented Jul 14, 2022 • edited Loading

Codecov Report

rlratzel Jul 14, 2022

Choose a reason for hiding this comment

VibhuJawa Jul 14, 2022

Choose a reason for hiding this comment

seunghwak commented Jul 14, 2022

rlratzel commented Jul 15, 2022

VibhuJawa commented Jul 15, 2022

VibhuJawa commented Jul 14, 2022 •

edited

Loading

codecov-commenter commented Jul 14, 2022 •

edited

Loading