Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up prefetched parameters #6557

Open
wants to merge 35 commits into
base: master
Choose a base branch
from

Conversation

tohtana
Copy link
Contributor

@tohtana tohtana commented Sep 21, 2024

Parameters prefetched by ZeRO3 are sometimes not used. This occurs when the actual sub-module execution differs from previous tracing. As a result, the state of the allgather handle for such a parameter remains INFLIGHT, causing functions like empty_partition_cache to detect it and throw an error.
This PR resolves the issue by ensuring that communication finishes and the parameters are freed.

As this issue was mentioned in #6011, this includes the change of the branch. We need to merge #6011 first.

@tjruwase
Copy link
Contributor

Please check if this PR fixes #5828.

@tohtana
Copy link
Contributor Author

tohtana commented Sep 27, 2024

Please check if this PR fixes #5828.

@tjruwase Using this PR branch, the repro in #5828 shows the message below but exits without throwing an error. I think this is expected as the model has a conditional branch and the execution order of modules changes.

Invalidate trace cache @ step 3: expected module 2, but got module 4

tohtana added a commit that referenced this pull request Oct 4, 2024
@tohtana tohtana enabled auto-merge October 8, 2024 15:42
@tohtana
Copy link
Contributor Author

tohtana commented Oct 8, 2024

@tjruwase I added the cleaning of the inflight parameter registry in _invalidate_trace as you suggested. This allows us to free the gathered (but unused) parameters earlier. However, I also kept it in reset_step.
This is why we don't detect deviations from the trace when some modules at the end of the trace remain unvisited. The original assertion in reset_step will still be triggered in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants