Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for megatron_amp_O2 + MegatronGPTSFTModel #7910

Closed
wants to merge 1 commit into from

Conversation

vysarge
Copy link
Collaborator

@vysarge vysarge commented Nov 18, 2023

Currently MegatronGPTSFTModel inherits sharded_state_dict from NLPAdapterModelMixin, but when restoring weights from the distributed checkpoint it's expected that MegatronGPTModel's version of sharded_state_dict has been run, resulting in:

  File "/root/llm/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_peft_tuning.py", line 63, in main
    model = MegatronGPTSFTModel.restore_from(cfg.model.restore_from_path, model_cfg, trainer=trainer)
  File "/root/llm/NeMo/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
    return super().restore_from(
  File "/root/llm/NeMo/nemo/core/classes/modelPT.py", line 442, in restore_from
    instance = cls._save_restore_connector.restore_from(
  File "/root/llm/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 743, in restore_from
    checkpoint = dist_checkpointing.load(
  File "/opt/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 72, in load
    sharded_objects, sharded_state_dict = load_sharded_objects(sharded_state_dict, checkpoint_dir)
  File "/opt/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 123, in load_sharded_objects
    return dict_list_map_inplace(load_sharded_object, sharded_objects), sharded_state_dict
  File "/opt/Megatron-LM/megatron/core/dist_checkpointing/dict_utils.py", line 155, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/opt/Megatron-LM/megatron/core/dist_checkpointing/dict_utils.py", line 155, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/opt/Megatron-LM/megatron/core/dist_checkpointing/dict_utils.py", line 159, in dict_list_map_inplace
    return f(x)
  File "/opt/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 120, in load_sharded_object
    loaded_obj = torch.load(load_path)
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 988, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 437, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 418, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/pretrained_models/Llama-2-70b-bf16-mcore/model_weights/model.module.decoder.layers.self_attention.linear_proj._extra_state/shard_0_80.pt'

This code change overrides sharded_state_dict in MegatronGPTSFTModel to call that method instead. Changes may also be needed to reconcile MegatronT5SFTModel.

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
@github-actions github-actions bot added the NLP label Nov 18, 2023
Copy link
Contributor

github-actions bot commented Dec 3, 2023

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Dec 3, 2023
@vysarge
Copy link
Collaborator Author

vysarge commented Dec 5, 2023

Closing in favor of more-compatible change on the NLPAdapterModelMixin side in #7971.

@vysarge vysarge closed this Dec 5, 2023
@vysarge vysarge deleted the vsarge/sft_O2 branch March 12, 2024 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant