Skip to content

Conversation

@hlinsen
Copy link
Contributor

@hlinsen hlinsen commented Oct 27, 2025

Summary by CodeRabbit

  • Refactor
    • Improved GPU pointer and stream configuration to make GPU-accelerated solvers more consistent and reliable.
  • Tests
    • Re-enabled a previously-skipped barrier solver test so it now runs as part of test suite.
  • Chores
    • Restored default threading behavior in CI test runs by removing a forced thread-count override.

@hlinsen hlinsen added the do not merge Do not merge if this flag is set label Oct 27, 2025
@hlinsen hlinsen requested a review from a team as a code owner October 27, 2025 21:31
@hlinsen hlinsen requested review from akifcorduk and rg20 and removed request for a team October 27, 2025 21:31
@coderabbitai
Copy link

coderabbitai bot commented Oct 27, 2025

Walkthrough

Sets cuBLAS/cuSPARSE pointer modes to DEVICE in the cusparse_view constructor, changes concurrent-solver barrier handle creation to use a raft::handle_t constructed from barrier_stream, and removes the OMP_NUM_THREADS=1 override and a pytest skip so tests run with default threading and the barrier test enabled.

Changes

Cohort / File(s) Summary
cuBLAS/cuSPARSE pointer mode initialization
cpp/src/dual_simplex/cusparse_view.cu
Adds constructor calls to set cuBLAS and cuSPARSE pointer modes to DEVICE, ensuring subsequent library calls use device pointers.
Barrier handle / stream management
cpp/src/linear_programming/solve.cu
In the concurrent solver path, constructs a raft::handle_t from barrier_stream and removes the explicit raft::resource::set_cuda_stream call; the barrier problem now uses the new handle.
CI test and Python test enabling
ci/test_wheel_cuopt.sh, python/cuopt_server/cuopt_server/tests/test_lp.py
Removes OMP_NUM_THREADS=1 environment override in CI test script and un-skips test_barrier_solver_options, enabling the test to run under default/system threading.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Confirm no code paths pass host pointers to cuBLAS/cuSPARSE after switching pointer mode to DEVICE.
  • Verify all CUDA operations using the new barrier_stream-based handle execute on the intended stream and that required synchronizations remain correct.
  • Run the previously skipped test and CI tests under increased thread counts to check for nondeterminism or flakiness.

Poem

🐇 I set the modes to DEVICE with a whiskered twitch,
Streams hop to new handles, tidy and quick,
Tests wake from slumber, threads stretch and play,
I nibble at race bugs and bound on my way,
Carrots for fixes — then off for a flick. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "Use one cusparse handle per thread to avoid race condition on cuspars…" directly relates to the changeset. The modifications across multiple files—including setting pointer modes to DEVICE in cusparse_view.cu, restructuring concurrent barrier_handle creation in solve.cu to use stream-based handles, removing thread count constraints in the test configuration, and enabling a barrier solver test—all support the stated objective of addressing race conditions through improved handle management. The title is specific and clear about both the problem (race condition) and the intended solution (per-thread handle usage), avoiding vague or generic phrasing.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3726891 and a2c2ae7.

📒 Files selected for processing (1)
  • python/cuopt_server/cuopt_server/tests/test_lp.py (0 hunks)
💤 Files with no reviewable changes (1)
  • python/cuopt_server/cuopt_server/tests/test_lp.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: wheel-build-cuopt-mps-parser / 13.0.1, 3.13, arm64, rockylinux8
  • GitHub Check: wheel-build-cuopt-mps-parser / 13.0.1, 3.10, arm64, rockylinux8

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/src/linear_programming/solve.cu (1)

676-682: Race condition confirmed: Both threads share single cuSPARSE/cuBLAS handle via shallow copy — separate handles required

The issue is real. user_problem_t stores raft::handle_t const* handle_ptr as a member. When barrier_problem = dual_simplex_problem (line 684) executes, the shallow copy leaves both problems pointing to the same barrier_handle. Both run_dual_simplex_thread (via dual_simplex_problem) and run_barrier_thread (via barrier_problem) then use this shared handle concurrently, creating a race on cuSPARSE/cuBLAS state. Additionally, only barrier_handle.sync_stream() is called before joining threads—the dual simplex thread stream is never synced.

Create separate raft::handle_t instances for each thread and sync both streams before consuming results, as suggested. This eliminates the race without requiring per-thread default stream initialization.

🧹 Nitpick comments (1)
cpp/src/dual_simplex/cusparse_view.cu (1)

141-144: Verified: pointer mode setup is correct, consider centralizing across codebase

Verification confirms all cuBLAS/cuSPARSE calls in cpp/src/dual_simplex/cusparse_view.cu use device scalars (d_one_, d_zero_, d_minus_one_), consistent with the DEVICE pointer mode set in lines 141-144. No HOST pointer mode usages detected in this file.

The centralization suggestion remains valid—pointer mode initialization appears in multiple files (mip/solve.cu, linear_programming/solve.cu, mip/problem/problem.cu, mip/solution/solution.cu, mip/solver.cu) and could be consolidated to reduce redundant state changes per handle. This is optional but worth considering in a future refactoring pass.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53bf0fb and ef84855.

📒 Files selected for processing (2)
  • cpp/src/dual_simplex/cusparse_view.cu (1 hunks)
  • cpp/src/linear_programming/solve.cu (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: wheel-build-cuopt-mps-parser / 13.0.1, 3.12, amd64, rockylinux8
  • GitHub Check: wheel-build-cuopt-mps-parser / 13.0.1, 3.10, amd64, rockylinux8
  • GitHub Check: wheel-build-cuopt-sh-client / 13.0.1, 3.10, amd64, rockylinux8
  • GitHub Check: checks / check-style

@rgsl888prabhu rgsl888prabhu requested a review from a team as a code owner October 27, 2025 21:56
@rgsl888prabhu rgsl888prabhu requested review from KyleFromNVIDIA and removed request for a team October 27, 2025 21:56
@rgsl888prabhu rgsl888prabhu requested a review from a team as a code owner October 27, 2025 21:57
@rgsl888prabhu rgsl888prabhu requested review from Iroy30 and removed request for a team October 27, 2025 21:57
@rgsl888prabhu rgsl888prabhu added bug Something isn't working non-breaking Introduces a non-breaking change labels Oct 27, 2025
@hlinsen hlinsen removed the do not merge Do not merge if this flag is set label Oct 29, 2025
@hlinsen hlinsen mentioned this pull request Oct 29, 2025
@rgsl888prabhu
Copy link
Collaborator

/merge

@rapids-bot rapids-bot bot merged commit 43c00e5 into NVIDIA:main Oct 29, 2025
174 of 175 checks passed
rgsl888prabhu pushed a commit that referenced this pull request Oct 29, 2025
Hot fix with the following commits:
- #541
- #544
- #550
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants