-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception running inference with MCore Distributed Checkpoint with different TP setting than training #8460
Comments
@dimapihtar @ericharper Loading distributed checkpoint with a single A100 works fine, with gbs=1,tp=1,pp=1,mbs=1. Has the team been able to reproduce this internally?
|
Some debug logs as well, let me know if anything else could be useful to include:
The assertion is checking that the shard is access once on all ranks:
But this is the behavior I observe:
Notice that there is a TODO listed to check the shard_access_cnt of replicas as well: https://github.com/NVIDIA/Megatron-LM/blob/0fecd76e995c136021d478c6c52caa57c2f9aa25/megatron/core/dist_checkpointing/serialization.py#L444C1-L447C59 But there should only be one replica per rank, which is the one above. Since I'm setting tensor parallel to 1 here, and gbs to 8, I think that the embedding_weights should not be expected to be sharded since there is only a single TP group with worldsize=8. I believe that during training, this dist ckpt also only used tp=1. So it should be fine if the embedding weights are fully replicated across all of the ranks? Off topic: |
Seem to have figured out the root cause. BackgroundWhen loading from distributed checkpoint, we call The current logic is as follows:
For On this line
make_tp_sharded_tensor_for_checkpoint:
The IssueAt this point, parallel_state is not yet available since Potential ResolutionIt seems like an easy fix would be to move the initialization logic before the
Can you please let me know if this is a correct understanding? |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
any updates on this issue? |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
Describe the bug
A clear and concise description of what the bug is.
I have an mcore distributed checkpoint trained with PP=1, TP=1. When running inference with this distributed checkpoint, when I set the TP to higher than 1, it results in exceptions and inconsistent hangs.
When running inference with mcore distributed checkpoint with a tp > 1, there is an exception raised for:
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
MegatronGPTModel.load_from_checkpoint
This results in an exception everytime, and a percentage of runs are able to complete, but most of the time the process ends up hanging.
Expected behavior
A clear and concise description of what you expected to happen.
With mcore distributed checkpointing, I expect to be able to load an mcore model with different model parallel configs without any error using the example scripts for inference.
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
Additional context
Attaching full logs in files.
Add any other context about the problem here.
Example: GPU model
tp8_error.log
The text was updated successfully, but these errors were encountered: