[detailed][benchmark] demonstration of cuda memory footprint with multi-stream use case #3480

TroyGarden · 2025-10-24T03:41:00Z

Differential Revision: D85399705

meta-codesync · 2025-10-24T03:41:08Z

@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85399705.

…-pytorch#3480) Summary: Google Document: https://docs.google.com/document/d/1Odt6oJJgvPDeVSQmQqrl3iUAl2yu_HPUVLJ92EqDiTI Workplace Post: https://fb.workplace.com/groups/429376538334034/permalink/1488841649054179/ Differential Revision: D85399705

…-pytorch#3480) Summary: Google Document: https://docs.google.com/document/d/1Odt6oJJgvPDeVSQmQqrl3iUAl2yu_HPUVLJ92EqDiTI Workplace Post: https://fb.workplace.com/groups/429376538334034/permalink/1488841649054179/ # context * high-level design and technical discussions are in the document/post * this diff added three benchmark jobs to demonstrate the memory footprint in multi-stream vs single-stream scenarios * other changes in the benchmark function: **a**. make reset_accumulated_memory_stats default to True **b**. call `torch.cuda.empty_cache()` before the memory snapshot so that the snapshot won't include the residual effects from the previous benchmark runs * benchmark stats |name| GPU Runtime|CPU Runtime|GPU Peak Memory **alloc**|GPU Peak Memory **reserved**| CPU Peak RSS| |--|--| |single_stream_memory|156.62 ms|231.05 ms|5.06 GB|5.11 GB|1.68 GB| |multi_stream_memory|144.52 ms|175.61 ms|5.06 GB |**10.13 GB** | 1.71 GB| |multi_stream_optimized|145.66 ms |232.76 ms | 5.06 GB |5.11 GB | 1.70 GB| Differential Revision: D85399705

…-pytorch#3480) Summary: Google Document: https://docs.google.com/document/d/1Odt6oJJgvPDeVSQmQqrl3iUAl2yu_HPUVLJ92EqDiTI Workplace Post: https://fb.workplace.com/groups/429376538334034/permalink/1488841649054179/ # context * high-level design and technical discussions are in the document/post * this diff added three benchmark jobs to demonstrate the memory footprint in multi-stream vs single-stream scenarios * other changes in the benchmark function: **a**. make reset_accumulated_memory_stats default to True **b**. call `torch.cuda.empty_cache()` before the memory snapshot so that the snapshot won't include the residual effects from the previous benchmark runs * benchmark stats |name| GPU Runtime|CPU Runtime|GPU Peak Memory **alloc**|GPU Peak Memory **reserved**| CPU Peak RSS| |--|--| |single_stream_memory|**158.58 ms**|233.79 ms|5.06 GB|5.11 GB|1.70 GB| |multi_stream_memory|145.43 ms|241.68 ms|5.06 GB |**10.13 GB** | 1.70 GB| |multi_stream_optimized|146.98 ms |244.72 ms |5.06 GB |5.11 GB | 1.70 GB| Reviewed By: spmex Differential Revision: D85399705

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 24, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 24, 2025

TroyGarden changed the title ~~demostration of cuda memory footprint with multi-stream use case~~ [detailed][benchmark] demostration of cuda memory footprint with multi-stream use case Oct 24, 2025

TroyGarden changed the title ~~[detailed][benchmark] demostration of cuda memory footprint with multi-stream use case~~ [detailed][benchmark] demonstration of cuda memory footprint with multi-stream use case Oct 24, 2025

TroyGarden force-pushed the export-D85399705 branch from dc2defd to 5c07535 Compare October 24, 2025 05:31

TroyGarden force-pushed the export-D85399705 branch 2 times, most recently from ff60cfd to 0ae7f55 Compare October 24, 2025 17:13

TroyGarden force-pushed the export-D85399705 branch from 0ae7f55 to e2dbe39 Compare October 24, 2025 21:03

meta-codesync bot closed this in d50069c Oct 24, 2025

TroyGarden deleted the export-D85399705 branch October 24, 2025 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[detailed][benchmark] demonstration of cuda memory footprint with multi-stream use case #3480

[detailed][benchmark] demonstration of cuda memory footprint with multi-stream use case #3480

Uh oh!

TroyGarden commented Oct 24, 2025

Uh oh!

meta-codesync bot commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[detailed][benchmark] demonstration of cuda memory footprint with multi-stream use case #3480

[detailed][benchmark] demonstration of cuda memory footprint with multi-stream use case #3480

Uh oh!

Conversation

TroyGarden commented Oct 24, 2025

Uh oh!

meta-codesync bot commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant