Skip to content

Conversation

@TroyGarden
Copy link
Contributor

Differential Revision: D85399705

@meta-codesync
Copy link
Contributor

meta-codesync bot commented Oct 24, 2025

@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85399705.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 24, 2025
@TroyGarden TroyGarden changed the title demostration of cuda memory footprint with multi-stream use case [detailed][benchmark] demostration of cuda memory footprint with multi-stream use case Oct 24, 2025
@TroyGarden TroyGarden changed the title [detailed][benchmark] demostration of cuda memory footprint with multi-stream use case [detailed][benchmark] demonstration of cuda memory footprint with multi-stream use case Oct 24, 2025
@TroyGarden TroyGarden force-pushed the export-D85399705 branch 2 times, most recently from ff60cfd to 0ae7f55 Compare October 24, 2025 17:13
TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Oct 24, 2025
…-pytorch#3480)

Summary:
Google Document: https://docs.google.com/document/d/1Odt6oJJgvPDeVSQmQqrl3iUAl2yu_HPUVLJ92EqDiTI
Workplace Post: https://fb.workplace.com/groups/429376538334034/permalink/1488841649054179/

# context
* high-level design and technical discussions are in the document/post
* this diff added three benchmark jobs to demonstrate the memory footprint in multi-stream vs single-stream scenarios
* other changes in the benchmark function:
**a**. make reset_accumulated_memory_stats default to True
**b**. call `torch.cuda.empty_cache()` before the memory snapshot so that the snapshot won't include the residual effects from the previous benchmark runs
* benchmark stats
|name| GPU Runtime|CPU Runtime|GPU Peak Memory **alloc**|GPU Peak Memory **reserved**| CPU Peak RSS|
|--|--|
|single_stream_memory|156.62 ms|231.05 ms|5.06 GB|5.11 GB|1.68 GB|
|multi_stream_memory|144.52 ms|175.61 ms|5.06 GB |**10.13 GB** | 1.71 GB|
|multi_stream_optimized|145.66 ms |232.76 ms | 5.06 GB |5.11 GB | 1.70 GB|

Differential Revision: D85399705
…-pytorch#3480)

Summary:
Google Document: https://docs.google.com/document/d/1Odt6oJJgvPDeVSQmQqrl3iUAl2yu_HPUVLJ92EqDiTI
Workplace Post: https://fb.workplace.com/groups/429376538334034/permalink/1488841649054179/

# context
* high-level design and technical discussions are in the document/post
* this diff added three benchmark jobs to demonstrate the memory footprint in multi-stream vs single-stream scenarios
* other changes in the benchmark function:
**a**. make reset_accumulated_memory_stats default to True
**b**. call `torch.cuda.empty_cache()` before the memory snapshot so that the snapshot won't include the residual effects from the previous benchmark runs
* benchmark stats
|name| GPU Runtime|CPU Runtime|GPU Peak Memory **alloc**|GPU Peak Memory **reserved**| CPU Peak RSS|
|--|--|
|single_stream_memory|**158.58 ms**|233.79 ms|5.06 GB|5.11 GB|1.70 GB|
|multi_stream_memory|145.43 ms|241.68 ms|5.06 GB |**10.13 GB** | 1.70 GB|
|multi_stream_optimized|146.98 ms |244.72 ms |5.06 GB |5.11 GB | 1.70 GB|

Reviewed By: spmex

Differential Revision: D85399705
@meta-codesync meta-codesync bot closed this in d50069c Oct 24, 2025
@TroyGarden TroyGarden deleted the export-D85399705 branch October 24, 2025 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant