Skip to content

Conversation

@EddyLXJ
Copy link
Contributor

@EddyLXJ EddyLXJ commented Oct 27, 2025

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2067

X-link: meta-pytorch/torchrec#3442

Before KVZCH is using ID_COUNT and MEM_UTIL eviction trigger mode, both are very tricky and hard for model engineer to decide what num to use for the id count or mem util threshold. Besides that, the eviction start time is out of sync after some time in training, which can cause great qps drop during eviction.

This diff is adding support for free memory trigger eviction. It will check how many free memory left every N batch in every rank and if free memory below the threshold, it will trigger eviction in all tbes of all ranks using all reduce. In this way, we can force the start time of eviction in all ranks.

Reviewed By: emlin

Differential Revision: D83896528

@netlify
Copy link

netlify bot commented Oct 27, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 8b9301e
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/690261d5d3a46e0008fb2acc
😎 Deploy Preview https://deploy-preview-5058--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-cla meta-cla bot added the cla signed label Oct 27, 2025
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Oct 27, 2025

@EddyLXJ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D83896528.

EddyLXJ added a commit to EddyLXJ/torchrec that referenced this pull request Oct 27, 2025
…#3442)

Summary:
X-link: pytorch/FBGEMM#5058

X-link: facebookresearch/FBGEMM#2067


Before KVZCH is using ID_COUNT and MEM_UTIL eviction trigger mode, both are very tricky and hard for model engineer to decide what num to use for the id count or mem util threshold. Besides that, the eviction start time is out of sync after some time in training, which can cause great qps drop during eviction. 

This diff is adding support for free memory trigger eviction. It will check how many free memory left every N batch in every rank and if free memory below the threshold, it will trigger eviction in all tbes of all ranks using all reduce. In this way, we can force the start time of eviction in all ranks.

Reviewed By: emlin

Differential Revision: D83896528
EddyLXJ added a commit to EddyLXJ/FBGEMM-1 that referenced this pull request Oct 27, 2025
Summary:

X-link: facebookresearch/FBGEMM#2067

X-link: meta-pytorch/torchrec#3442

Before KVZCH is using ID_COUNT and MEM_UTIL eviction trigger mode, both are very tricky and hard for model engineer to decide what num to use for the id count or mem util threshold. Besides that, the eviction start time is out of sync after some time in training, which can cause great qps drop during eviction. 

This diff is adding support for free memory trigger eviction. It will check how many free memory left every N batch in every rank and if free memory below the threshold, it will trigger eviction in all tbes of all ranks using all reduce. In this way, we can force the start time of eviction in all ranks.

Reviewed By: emlin

Differential Revision: D83896528
EddyLXJ added a commit to EddyLXJ/torchrec that referenced this pull request Oct 27, 2025
…#3442)

Summary:
X-link: pytorch/FBGEMM#5058

X-link: facebookresearch/FBGEMM#2067


Before KVZCH is using ID_COUNT and MEM_UTIL eviction trigger mode, both are very tricky and hard for model engineer to decide what num to use for the id count or mem util threshold. Besides that, the eviction start time is out of sync after some time in training, which can cause great qps drop during eviction. 

This diff is adding support for free memory trigger eviction. It will check how many free memory left every N batch in every rank and if free memory below the threshold, it will trigger eviction in all tbes of all ranks using all reduce. In this way, we can force the start time of eviction in all ranks.

Reviewed By: emlin

Differential Revision: D83896528
EddyLXJ added a commit to EddyLXJ/FBGEMM-1 that referenced this pull request Oct 27, 2025
Summary:

X-link: facebookresearch/FBGEMM#2067

X-link: meta-pytorch/torchrec#3442

Before KVZCH is using ID_COUNT and MEM_UTIL eviction trigger mode, both are very tricky and hard for model engineer to decide what num to use for the id count or mem util threshold. Besides that, the eviction start time is out of sync after some time in training, which can cause great qps drop during eviction. 

This diff is adding support for free memory trigger eviction. It will check how many free memory left every N batch in every rank and if free memory below the threshold, it will trigger eviction in all tbes of all ranks using all reduce. In this way, we can force the start time of eviction in all ranks.

Reviewed By: emlin

Differential Revision: D83896528
EddyLXJ added a commit to EddyLXJ/FBGEMM-1 that referenced this pull request Oct 27, 2025
Summary:

X-link: facebookresearch/FBGEMM#2067

X-link: meta-pytorch/torchrec#3442

Before KVZCH is using ID_COUNT and MEM_UTIL eviction trigger mode, both are very tricky and hard for model engineer to decide what num to use for the id count or mem util threshold. Besides that, the eviction start time is out of sync after some time in training, which can cause great qps drop during eviction. 

This diff is adding support for free memory trigger eviction. It will check how many free memory left every N batch in every rank and if free memory below the threshold, it will trigger eviction in all tbes of all ranks using all reduce. In this way, we can force the start time of eviction in all ranks.

Differential Revision: D83896528
EddyLXJ added a commit to EddyLXJ/FBGEMM-1 that referenced this pull request Oct 27, 2025
Summary:

X-link: facebookresearch/FBGEMM#2067

X-link: meta-pytorch/torchrec#3442

Before KVZCH is using ID_COUNT and MEM_UTIL eviction trigger mode, both are very tricky and hard for model engineer to decide what num to use for the id count or mem util threshold. Besides that, the eviction start time is out of sync after some time in training, which can cause great qps drop during eviction. 

This diff is adding support for free memory trigger eviction. It will check how many free memory left every N batch in every rank and if free memory below the threshold, it will trigger eviction in all tbes of all ranks using all reduce. In this way, we can force the start time of eviction in all ranks.

Differential Revision: D83896528
EddyLXJ added a commit to EddyLXJ/FBGEMM-1 that referenced this pull request Oct 27, 2025
Summary:

X-link: facebookresearch/FBGEMM#2067

X-link: meta-pytorch/torchrec#3442

Before KVZCH is using ID_COUNT and MEM_UTIL eviction trigger mode, both are very tricky and hard for model engineer to decide what num to use for the id count or mem util threshold. Besides that, the eviction start time is out of sync after some time in training, which can cause great qps drop during eviction. 

This diff is adding support for free memory trigger eviction. It will check how many free memory left every N batch in every rank and if free memory below the threshold, it will trigger eviction in all tbes of all ranks using all reduce. In this way, we can force the start time of eviction in all ranks.

Differential Revision: D83896528
Summary:
X-link: facebookresearch/FBGEMM#2067

See diff D85604160, this KVZCHEvictionTBEConfig is in FBGEMM and used in torchrec. Both FBGEEM and torchrec are open source in github. It is required to land first, otherwise torchrec github build will throw error {F1983027645}

Reviewed By: emlin

Differential Revision: D83896528
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Oct 30, 2025

This pull request has been merged in 0267804.

Bernard-Liu pushed a commit to ROCm/FBGEMM that referenced this pull request Oct 31, 2025
Summary:
Pull Request resolved: pytorch#5058

X-link: https://github.com/facebookresearch/FBGEMM/pull/2067

See diff D85604160, this KVZCHEvictionTBEConfig is in FBGEMM and used in torchrec. Both FBGEEM and torchrec are open source in github. It is required to land first, otherwise torchrec github build will throw error {F1983027645}

Reviewed By: emlin

Differential Revision: D83896528

fbshipit-source-id: 7a8bacc3d0ee1f53a797dac6ba2647d372a15074
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants