Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure MG to have the same number of allreduce calls in mean_stddev for sparse matrix to avoid hanging #6141

Merged
merged 4 commits into from
Dec 6, 2024

Conversation

lijinf2
Copy link
Contributor

@lijinf2 lijinf2 commented Nov 22, 2024

The hanging occurs when one GPU gets a sparse matrix of all zero values, while other GPUs get-zero values.

@lijinf2 lijinf2 requested a review from a team as a code owner November 22, 2024 01:23
@lijinf2 lijinf2 requested review from teju85 and dantegd November 22, 2024 01:23
@lijinf2 lijinf2 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change 2 - In Progress Currenty a work in progress labels Nov 22, 2024
@lijinf2 lijinf2 force-pushed the fix_hanging_mean_stddev branch from 6c56fa1 to 21be3fb Compare November 26, 2024 18:33
@lijinf2 lijinf2 added 3 - Ready for Review Ready for review by team and removed 3 - Ready for Review Ready for review by team labels Nov 26, 2024
@lijinf2 lijinf2 force-pushed the fix_hanging_mean_stddev branch from 21be3fb to 83ca352 Compare December 3, 2024 04:45
Copy link

copy-pr-bot bot commented Dec 3, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lijinf2
Copy link
Contributor Author

lijinf2 commented Dec 3, 2024

build

@lijinf2 lijinf2 force-pushed the fix_hanging_mean_stddev branch from 83ca352 to f87fab6 Compare December 3, 2024 05:07
@lijinf2 lijinf2 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currenty a work in progress labels Dec 3, 2024
@lijinf2 lijinf2 force-pushed the fix_hanging_mean_stddev branch from f87fab6 to dabf335 Compare December 4, 2024 23:00
@lijinf2 lijinf2 requested a review from a team as a code owner December 4, 2024 23:00
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Dec 4, 2024
@lijinf2
Copy link
Contributor Author

lijinf2 commented Dec 4, 2024

Added test cases to test_dask_logistic_regression.py for better testing. No change to the main code (standardization.cuh). Ready for review.

Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional testing @lijinf2 ! The code itself looks good to me

add the testcase of one GPU gets all zeroes

revise test_standardization_example to reuse functions

keep revise to reuse code

give a better name
@lijinf2 lijinf2 force-pushed the fix_hanging_mean_stddev branch from 6affb8a to 90d7df5 Compare December 5, 2024 17:51
@wphicks
Copy link
Contributor

wphicks commented Dec 6, 2024

/merge

@rapids-bot rapids-bot bot merged commit 4bfe72f into rapidsai:branch-24.12 Dec 6, 2024
64 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants