move load state dict after initialize parallel state in nlp_model #9382

ryxli · 2024-06-05T01:06:29Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Move loading sharded state dict after initialize parallel state.

Addresses #8460

Notes:

see the original issue for more details
unsure if this is the correct approach, or if there are downstream impacts for this change
This fix also resolves issues when running SFT using NeMo-Aligner (https://github.com/NVIDIA/NeMo-Aligner/blob/main/examples/nlp/gpt/train_gpt_sft.py) loading from a distributed ckpt directly instead of legacy or .nemo

Traceback (most recent call last):
  File "train_sft.py", line 64, in main
    ptl_model, updated_cfg = load_megatron_model(GPTSFTModel, cfg, trainer, additional_keys=additional_keys)
  File "/workspace/NeMo/nemo/collections/nlp/models/nlp_model.py", line 397, in load_from_checkpoint
    checkpoint = dist_checkpointing.load(sharded_state_dict=checkpoint, checkpoint_dir=checkpoint_dir)
  File "/workspace/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 133, in load
    validate_sharding_integrity(nested_values(sharded_state_dict))
  File "/workspace/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 437, in validate_sharding_integrity
    _validate_sharding_for_key(shardings)
  File "/workspace/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 475, in _validate_sharding_for_key
    raise CheckpointingException(f'Invalid access pattern for {rank_sharding[0][1]}')
megatron.core.dist_checkpointing.core.CheckpointingException: Invalid access pattern for ShardedTensor(key='model.embedding.word_embeddings.weight')

Collection: [Note which collection this PR will affect]
nlp/language_modeling

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

For the nemo aligner SFT use case:

change the implementation of nemo_aligner.models.nlp.gpt.gpt_sft_model.GPTSFTModel like so

from omegaconf import DictConfig
from pytorch_lightning.trainer.trainer import Trainer

from nemo_aligner.models.nlp.gpt.gpt_sft_model import GPTSFTModel

from torch.nn.modules.module import Module


class _GPTSFTModel(GPTSFTModel):
    """
    Megatron GPT Supervised Fine-Tuning, support load from mcore distributed ckpt.
    """

    def __init__(self, cfg: DictConfig, trainer: Trainer):
        super().__init__(cfg, trainer=trainer)

    def load_state_dict(self, state_dict, strict: bool = True):
        if self.use_peft and self.setup_complete:
            super().load_state_dict(state_dict, strict=strict)
        else:
            if self.mcore_gpt and not self.use_fsdp:
                # mcore distributed checkpoint
                # we defer loading the state_dict until the class has been initialized
                return Module.load_state_dict(self, state_dict, strict=False)
            else:
                super().load_state_dict(state_dict, strict=strict)

Load the checkpoint overriding the files like so

from fsspec.implementations.local import LocalFileSystem


def load_from_checkpoint_dir(cfg, trainer, modify_confg_fn, checkpoint_logger=None):
    cls = _GPTSFTModel # from above
    app_state = AppState()
    if cfg.model.tensor_model_parallel_size > 1 or cfg.model.pipeline_model_parallel_size > 1:
        app_state.model_parallel_size = (
            cfg.model.tensor_model_parallel_size * cfg.model.pipeline_model_parallel_size * cfg.get("expert_model_parallel_size", 1)
        )
        app_state.tensor_model_parallel_size = cfg.model.tensor_model_parallel_size
        app_state.pipeline_model_parallel_size = cfg.model.pipeline_model_parallel_size
        app_state.expert_model_parallel_size = cfg.get("expert_model_parallel_size", 1)
        (
            app_state.tensor_model_parallel_rank,
            app_state.pipeline_model_parallel_rank,
            app_state.expert_model_parallel_rank,
            app_state.model_parallel_size,
            app_state.data_parallel_size,
            app_state.pipeline_model_parallel_split_rank,
            app_state.virtual_pipeline_model_parallel_rank,
        ) = fake_initialize_model_parallel(
            world_size=app_state.model_parallel_size,
            rank=trainer.global_rank,
            tensor_model_parallel_size_=cfg.model.tensor_model_parallel_size,
            pipeline_model_parallel_size_=cfg.model.pipeline_model_parallel_size,
            pipeline_model_parallel_split_rank_=cfg.model.get("pipeline_model_parallel_split_rank", None),
            expert_model_parallel_size_=cfg.get("expert_model_parallel_size", 1),
        )
    checkpoint_path = os.path.join(
        cfg.model.pretrained_checkpoint.checkpoint_dir,
        cfg.model.pretrained_checkpoint.checkpoint_name,
    )

    if not os.path.isdir(checkpoint_path):
        # legacy checkpoint needs model parallel rank injection
        checkpoint_path = inject_model_parallel_rank(checkpoint_path)
    fs = LocalFileSystem()
    with fs.open(cfg.model.pretrained_checkpoint.hparams_file, "r") as fin:
        hparams_file = OmegaConf.load(fin)
    gpt_cfg = modify_confg_fn(hparams_file.cfg, cfg, add_cfg_to_tree=True)
    with tempfile.NamedTemporaryFile(suffix=".yaml") as f:
        OmegaConf.save(config=gpt_cfg, f=f.name)
        model = cls.load_from_checkpoint(checkpoint_path=checkpoint_path, trainer=trainer, hparams_file=f.name)
        return model, gpt_cfg

in the config, resume the training from a dist ckpt dir:

model:
  pretrained_checkpoint:
    checkpoint_dir: /experiments/example_gpt_model_A/checkpoints/
    checkpoint_name: megatron_gpt--val_loss=1.23-step=5000-consumed_samples=100000.0
    hparams_file: /hparams.yaml

this hparams file, had to manually create as by default hparams.yaml is not saved with dist ckpt format last I checked. It seems like the behavior is to only create and store with .nemo or the legacy pp tp partition format with the PTL callback. Unrelated, but it might be nice to also save the hparams.yaml file when saving a distributed ckpt (not with .nemo)

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
[ x] Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

mikolajblaz · 2024-06-07T12:38:42Z

Look good to me 👍

Can you apply a similar sharded_state_dict delay here? https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/parts/nlp_overrides.py#L947

Signed-off-by: Ryan Li <rynli@amazon.com>

ryxli · 2024-06-07T22:13:56Z

@mikolajblaz

Look good to me 👍

Can you apply a similar sharded_state_dict delay here? https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/parts/nlp_overrides.py#L947

Done, thanks for review.

ryxli · 2024-06-13T20:44:19Z

@ericharper or @mikolajblaz

Could you add the run CICD label?

…IDIA#9382) * move load state dict after initialize parallel state Signed-off-by: Ryan Li <rynli@amazon.com> * delay sharded_state_dict in save_to Signed-off-by: Ryan Li <rynli@amazon.com> --------- Signed-off-by: Ryan Li <rynli@amazon.com> Co-authored-by: Ryan Li <rynli@amazon.com>

github-actions bot added the NLP label Jun 5, 2024

ryxli force-pushed the fix_load_dist_ckpt branch from 5a78127 to 130de2c Compare June 5, 2024 01:08

ericharper requested a review from mikolajblaz June 5, 2024 01:22

mikolajblaz previously approved these changes Jun 7, 2024

View reviewed changes

ryxli dismissed mikolajblaz’s stale review via 77f5191 June 7, 2024 22:10

ryxli force-pushed the fix_load_dist_ckpt branch from c96f3ed to 77f5191 Compare June 7, 2024 22:10

Ryan Li added 2 commits June 7, 2024 22:11

move load state dict after initialize parallel state

f5c481b

Signed-off-by: Ryan Li <rynli@amazon.com>

delay sharded_state_dict in save_to

db27ef8

Signed-off-by: Ryan Li <rynli@amazon.com>

ryxli force-pushed the fix_load_dist_ckpt branch from 77f5191 to db27ef8 Compare June 7, 2024 22:11

ryxli requested a review from mikolajblaz June 12, 2024 00:18

Merge branch 'main' into fix_load_dist_ckpt

2d3bb2f

mikolajblaz approved these changes Jun 12, 2024

View reviewed changes

Merge branch 'main' into fix_load_dist_ckpt

e205623

mikolajblaz added the Run CICD label Jun 14, 2024

ericharper merged commit ec0eb59 into NVIDIA:main Jun 15, 2024
108 of 109 checks passed

ryxli deleted the fix_load_dist_ckpt branch June 15, 2024 03:46

ko3n1g mentioned this pull request Jul 18, 2024

Release 2.0.0rc1 #9786

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move load state dict after initialize parallel state in nlp_model #9382

move load state dict after initialize parallel state in nlp_model #9382

ryxli commented Jun 5, 2024 •

edited

Loading

mikolajblaz commented Jun 7, 2024

ryxli commented Jun 7, 2024

ryxli commented Jun 13, 2024

move load state dict after initialize parallel state in nlp_model #9382

move load state dict after initialize parallel state in nlp_model #9382

Conversation

ryxli commented Jun 5, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

mikolajblaz commented Jun 7, 2024

ryxli commented Jun 7, 2024

ryxli commented Jun 13, 2024

ryxli commented Jun 5, 2024 •

edited

Loading