Use one cusparse handle per thread to avoid race condition on cuspars… #544

hlinsen · 2025-10-27T21:31:58Z

Summary by CodeRabbit

Refactor
- Improved GPU pointer and stream configuration to make GPU-accelerated solvers more consistent and reliable.
Tests
- Re-enabled a previously-skipped barrier solver test so it now runs as part of test suite.
Chores
- Restored default threading behavior in CI test runs by removing a forced thread-count override.

…eSetStream

coderabbitai · 2025-10-27T21:32:14Z

Walkthrough

Sets cuBLAS/cuSPARSE pointer modes to DEVICE in the cusparse_view constructor, changes concurrent-solver barrier handle creation to use a raft::handle_t constructed from barrier_stream, and removes the OMP_NUM_THREADS=1 override and a pytest skip so tests run with default threading and the barrier test enabled.

Changes

Cohort / File(s)	Summary
cuBLAS/cuSPARSE pointer mode initialization `cpp/src/dual_simplex/cusparse_view.cu`	Adds constructor calls to set cuBLAS and cuSPARSE pointer modes to `DEVICE`, ensuring subsequent library calls use device pointers.
Barrier handle / stream management `cpp/src/linear_programming/solve.cu`	In the concurrent solver path, constructs a `raft::handle_t` from `barrier_stream` and removes the explicit `raft::resource::set_cuda_stream` call; the barrier problem now uses the new handle.
CI test and Python test enabling `ci/test_wheel_cuopt.sh`, `python/cuopt_server/cuopt_server/tests/test_lp.py`	Removes `OMP_NUM_THREADS=1` environment override in CI test script and un-skips `test_barrier_solver_options`, enabling the test to run under default/system threading.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Confirm no code paths pass host pointers to cuBLAS/cuSPARSE after switching pointer mode to DEVICE.
Verify all CUDA operations using the new barrier_stream-based handle execute on the intended stream and that required synchronizations remain correct.
Run the previously skipped test and CI tests under increased thread counts to check for nondeterminism or flakiness.

Poem

🐇 I set the modes to DEVICE with a whiskered twitch,
Streams hop to new handles, tidy and quick,
Tests wake from slumber, threads stretch and play,
I nibble at race bugs and bound on my way,
Carrots for fixes — then off for a flick. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "Use one cusparse handle per thread to avoid race condition on cuspars…" directly relates to the changeset. The modifications across multiple files—including setting pointer modes to DEVICE in cusparse_view.cu, restructuring concurrent barrier_handle creation in solve.cu to use stream-based handles, removing thread count constraints in the test configuration, and enabling a barrier solver test—all support the stated objective of addressing race conditions through improved handle management. The title is specific and clear about both the problem (race condition) and the intended solution (per-thread handle usage), avoiding vague or generic phrasing.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3726891 and a2c2ae7.

📒 Files selected for processing (1)

python/cuopt_server/cuopt_server/tests/test_lp.py (0 hunks)

💤 Files with no reviewable changes (1)

python/cuopt_server/cuopt_server/tests/test_lp.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: wheel-build-cuopt-mps-parser / 13.0.1, 3.13, arm64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.0.1, 3.10, arm64, rockylinux8

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/src/linear_programming/solve.cu (1)

676-682: Race condition confirmed: Both threads share single cuSPARSE/cuBLAS handle via shallow copy — separate handles required

The issue is real. user_problem_t stores raft::handle_t const* handle_ptr as a member. When barrier_problem = dual_simplex_problem (line 684) executes, the shallow copy leaves both problems pointing to the same barrier_handle. Both run_dual_simplex_thread (via dual_simplex_problem) and run_barrier_thread (via barrier_problem) then use this shared handle concurrently, creating a race on cuSPARSE/cuBLAS state. Additionally, only barrier_handle.sync_stream() is called before joining threads—the dual simplex thread stream is never synced.

Create separate raft::handle_t instances for each thread and sync both streams before consuming results, as suggested. This eliminates the race without requiring per-thread default stream initialization.

🧹 Nitpick comments (1)

cpp/src/dual_simplex/cusparse_view.cu (1)

141-144: Verified: pointer mode setup is correct, consider centralizing across codebase

Verification confirms all cuBLAS/cuSPARSE calls in cpp/src/dual_simplex/cusparse_view.cu use device scalars (d_one_, d_zero_, d_minus_one_), consistent with the DEVICE pointer mode set in lines 141-144. No HOST pointer mode usages detected in this file.

The centralization suggestion remains valid—pointer mode initialization appears in multiple files (mip/solve.cu, linear_programming/solve.cu, mip/problem/problem.cu, mip/solution/solution.cu, mip/solver.cu) and could be consolidated to reduce redundant state changes per handle. This is optional but worth considering in a future refactoring pass.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53bf0fb and ef84855.

📒 Files selected for processing (2)

cpp/src/dual_simplex/cusparse_view.cu (1 hunks)
cpp/src/linear_programming/solve.cu (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: wheel-build-cuopt-mps-parser / 13.0.1, 3.12, amd64, rockylinux8
GitHub Check: wheel-build-cuopt-mps-parser / 13.0.1, 3.10, amd64, rockylinux8
GitHub Check: wheel-build-cuopt-sh-client / 13.0.1, 3.10, amd64, rockylinux8
GitHub Check: checks / check-style

rgsl888prabhu · 2025-10-29T14:16:05Z

/merge

Hot fix with the following commits: - #541 - #544 - #550

Use one cusparse handle per thread to avoid race condition on cuspars…

ef84855

…eSetStream

hlinsen added the do not merge Do not merge if this flag is set label Oct 27, 2025

hlinsen requested a review from a team as a code owner October 27, 2025 21:31

hlinsen requested review from akifcorduk and rg20 and removed request for a team October 27, 2025 21:31

coderabbitai bot reviewed Oct 27, 2025

View reviewed changes

rgsl888prabhu requested a review from a team as a code owner October 27, 2025 21:56

rgsl888prabhu requested review from KyleFromNVIDIA and removed request for a team October 27, 2025 21:56

rgsl888prabhu requested a review from a team as a code owner October 27, 2025 21:57

rgsl888prabhu requested review from Iroy30 and removed request for a team October 27, 2025 21:57

rgsl888prabhu added bug Something isn't working non-breaking Introduces a non-breaking change labels Oct 27, 2025

hlinsen force-pushed the cusparse-handles branch from a2c2ae7 to ef84855 Compare October 28, 2025 20:29

hlinsen removed the do not merge Do not merge if this flag is set label Oct 29, 2025

hlinsen mentioned this pull request Oct 29, 2025

Hot fix 25.10 #552

Merged

Merge branch 'main' into cusparse-handles

030cb94

rg20 approved these changes Oct 29, 2025

View reviewed changes

rgsl888prabhu removed request for Iroy30 and KyleFromNVIDIA October 29, 2025 14:15

rapids-bot bot merged commit 43c00e5 into NVIDIA:main Oct 29, 2025
174 of 175 checks passed

rgsl888prabhu pushed a commit that referenced this pull request Oct 29, 2025

Hot fix 25.10 (#552)

ef68d50

Hot fix with the following commits: - #541 - #544 - #550

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use one cusparse handle per thread to avoid race condition on cuspars… #544

Use one cusparse handle per thread to avoid race condition on cuspars… #544

Uh oh!

hlinsen commented Oct 27, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 27, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

rgsl888prabhu commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use one cusparse handle per thread to avoid race condition on cuspars… #544

Use one cusparse handle per thread to avoid race condition on cuspars… #544

Uh oh!

Conversation

hlinsen commented Oct 27, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

rgsl888prabhu commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hlinsen commented Oct 27, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 27, 2025 •

edited

Loading