Fix a bug in tensornet backend scratch pad allocation in multi-GPU mode #2516

1tnguyen · 2025-01-17T03:21:24Z

Description

ScratchDeviceMem allocates memory based on memory availability on construction. This mechanism is not compatible with multi-GPU code path (MPI execution), whereby the CUDA device is selected in the simulator constructor; hence we need to defer the allocation until the device is selected.

Fixed by having a separate allocate method to be called once during the simulator backend constructor after device selection.

This bug was introduced in #1865, where the scratch pad is allocated once (scratch pad is a member variable of the simulator class) rather than on-demand to improve performance.

Add a unit test for this case, to be executed when there are multiple GPUs.

…r we've set the device Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>

Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>

github-actions · 2025-01-17T04:47:41Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

bmhowe23

LGTM

1tnguyen added 4 commits January 17, 2025 01:43

Fix a bug in default init of scratchpad: it must allocate memory afte…

730908e

…r we've set the device Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>

Merge branch 'main' into tnguyen/tensornet-scratchpad-init-bug

020e2b9

Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>

Add test

090b407

Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>

Add a check to prevent multiple allocate calls

12f619c

Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>

1tnguyen added the bug fix To be listed under Bug Fixes in the release notes label Jan 17, 2025

1tnguyen requested review from schweitzpgi and bmhowe23 January 17, 2025 03:23

github-actions bot pushed a commit that referenced this pull request Jan 17, 2025

Docs preview for PR #2516.

ee74594

bmhowe23 approved these changes Jan 17, 2025

View reviewed changes

1tnguyen merged commit 9e0b590 into NVIDIA:main Jan 17, 2025
213 checks passed

github-actions bot pushed a commit that referenced this pull request Jan 17, 2025

Cleaning up docs preview for PR #2516.

9dafd8a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a bug in tensornet backend scratch pad allocation in multi-GPU mode #2516

Fix a bug in tensornet backend scratch pad allocation in multi-GPU mode #2516

1tnguyen commented Jan 17, 2025

github-actions bot commented Jan 17, 2025

bmhowe23 left a comment

Fix a bug in tensornet backend scratch pad allocation in multi-GPU mode #2516

Fix a bug in tensornet backend scratch pad allocation in multi-GPU mode #2516

Conversation

1tnguyen commented Jan 17, 2025

Description

github-actions bot commented Jan 17, 2025

bmhowe23 left a comment

Choose a reason for hiding this comment