fix: L0_sequence_batcher_cudashm #7852

oandreeva-nv · 2024-12-04T18:50:20Z

What does the PR do?

We were hiting the case, when cudaMemcpy was copying output to GPU in an async way without proper synchronization. Thus, when during the test client side was reading output result from cuda SHM, there were no guarantees that server side finished the copy. As a result, test failures.

Resolution: use cudaMemcpyAsync + sync on stream.

Documentation reference: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#concurrent-execution-between-host-and-device

The following device operations are asynchronous with respect to the host:

* Kernel launches;
* Memory copies within a single device’s memory;
* Memory copies from host to device of a memory block of 64 KB or less; <------------------
* Memory copies performed by functions that are suffixed with Async;
* Memory set function calls.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

Where should the reviewer start?

Test plan:

CI Pipeline ID:

21063313 - relevant test: L0_sequence_batcher_cudashm--base

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

oandreeva-nv added 6 commits November 27, 2024 15:18

bumping c++ version in custom sequence test backend

d63dd85

checking cudaMemCopy

b25c8ea

testing removinf explicit export of CUDA_VISIBLE_DEVICES

f4d6083

testing setting CUDA_VISIBLE_DEVICES by logic

f7c5360

changing cudaMemcpy to cudaMemcpyAsync

62850e3

Restoring cuda visible devices

34bdcb4

oandreeva-nv added the bug Something isn't working label Dec 4, 2024

oandreeva-nv requested review from kthui and GuanLuo December 4, 2024 18:50

GuanLuo approved these changes Dec 4, 2024

View reviewed changes

kthui approved these changes Dec 4, 2024

View reviewed changes

oandreeva-nv merged commit 83d0e30 into main Dec 4, 2024
3 checks passed

oandreeva-nv deleted the oandreeva_batcher branch December 4, 2024 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: L0_sequence_batcher_cudashm #7852

fix: L0_sequence_batcher_cudashm #7852

oandreeva-nv commented Dec 4, 2024 •

edited

Loading

fix: L0_sequence_batcher_cudashm #7852

fix: L0_sequence_batcher_cudashm #7852

Conversation

oandreeva-nv commented Dec 4, 2024 • edited Loading

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

oandreeva-nv commented Dec 4, 2024 •

edited

Loading