Skip to content

Conversation

@Kh4ster
Copy link
Contributor

@Kh4ster Kh4ster commented Jun 18, 2025

This PR aims at fixing the intermittent error we are seeing in CI on the Python Batch PDLP test.

Previous PR were trying to achieve the same while not solving the root cause: work being pushed to the default stream.

As described here: NVIDIA/cccl#5027 creating a thrust::host pinned memory allocator is incorrectly launching work on the GPU on the default stream which might happen during a graph capture, creating an error.

To fix it, we now use a std::unique_ptr with a custom allocator to allocate host pinned memory.

This allows us to go back to a non blocking stream when using batch mode.

@Kh4ster Kh4ster requested a review from a team as a code owner June 18, 2025 12:25
@Kh4ster Kh4ster requested review from akifcorduk, aliceb-nv and chris-maes and removed request for akifcorduk and chris-maes June 18, 2025 12:25
@Kh4ster Kh4ster self-assigned this Jun 18, 2025
@Kh4ster Kh4ster added bug Something isn't working non-breaking Introduces a non-breaking change pdlp labels Jun 18, 2025
Copy link
Contributor

@aliceb-nv aliceb-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, awesome :)

@Kh4ster
Copy link
Contributor Author

Kh4ster commented Jun 18, 2025

/merge

@rapids-bot rapids-bot bot merged commit 064d08c into branch-25.08 Jun 18, 2025
144 of 145 checks passed
@Kh4ster Kh4ster deleted the pdlp_fix_pinned_memory_allocator branch June 18, 2025 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change pdlp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants