[NeMo-UX] Make TE and Apex dependencies optional (NVIDIA#9732)

* [NeMo-UX] Make TE and Apex dependencies optional (NVIDIA#9550) * Provide a pure pytorch/jit path to avoid required dependency on TE and Apex Signed-off-by: ashors1 <ashors@nvidia.com> * add missing file Signed-off-by: ashors1 <ashors@nvidia.com> * add minimal gpt pretraining example Signed-off-by: ashors1 <ashors@nvidia.com> * fix pre-training datamodule initialization Signed-off-by: ashors1 <ashors@nvidia.com> * add non-te/non-apex test Signed-off-by: ashors1 <ashors@nvidia.com> * add comment to pretraining script Signed-off-by: ashors1 <ashors@nvidia.com> * use microbatch calculator from mcore Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * fix nemo 2 test name Signed-off-by: ashors1 <ashors@nvidia.com> * update Mcore commit for CI Signed-off-by: ashors1 <ashors@nvidia.com> * replace apex microbatch calculator with megatron's in more places Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * fix missing import Signed-off-by: ashors1 <ashors@nvidia.com> * fix typo Signed-off-by: ashors1 <ashors@nvidia.com> * fix missed apex import Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> Signed-off-by: ashors1 <ashors@nvidia.com> * move imports Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> Signed-off-by: ashors1 <ashors@nvidia.com> * move imports Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * add types to command-line args Signed-off-by: ashors1 <ashors@nvidia.com> * bug fix Signed-off-by: ashors1 <ashors@nvidia.com> * fix path Signed-off-by: ashors1 <ashors@nvidia.com> * Disable distributed optimizer in nemo 2.0 test Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * fix optimizer config Signed-off-by: ashors1 <ashors@nvidia.com> * update checkpointing Signed-off-by: ashors1 <ashors@nvidia.com> * move import Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * fix failing unit test Signed-off-by: ashors1 <ashors@nvidia.com> * fix failing test Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * Updating num_weights check of RETRO due to underlying changes from mcore RETRO MLM Signed-off-by: huvunvidia <86480512+huvunvidia@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: huvunvidia <huvunvidia@users.noreply.github.com> * fix typo Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * remove stale warning Signed-off-by: ashors1 <ashors@nvidia.com> * fix lora notebook Signed-off-by: ashors1 <ashors@nvidia.com> * fix small typo Signed-off-by: ashors1 <ashors@nvidia.com> * add import guards to gemma2 Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> --------- Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: ashors1 <ashors1@users.noreply.github.com> Signed-off-by: huvunvidia <86480512+huvunvidia@users.noreply.github.com> Signed-off-by: huvunvidia <huvunvidia@users.noreply.github.com> Co-authored-by: ashors1 <ashors1@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: huvunvidia <86480512+huvunvidia@users.noreply.github.com> Co-authored-by: huvunvidia <huvunvidia@users.noreply.github.com> * fix cherry-pick Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> --------- Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: ashors1 <ashors1@users.noreply.github.com> Signed-off-by: huvunvidia <86480512+huvunvidia@users.noreply.github.com> Signed-off-by: huvunvidia <huvunvidia@users.noreply.github.com> Co-authored-by: ashors1 <ashors1@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: huvunvidia <86480512+huvunvidia@users.noreply.github.com> Co-authored-by: huvunvidia <huvunvidia@users.noreply.github.com> Signed-off-by: Hainan Xu <hainanx@nvidia.com>
hainan-xv · Nov 5, 2024 · 93e1abe · 93e1abe
1 parent 6579102
commit 93e1abe
Show file tree

Hide file tree

Showing 46 changed files with 762 additions and 627 deletions.
diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
@@ -4516,6 +4516,48 @@ jobs:
       AFTER_SCRIPT: |
         rm -rf examples/multimodal/text_to_image/sd_train_results
 
+  L2_NeMo_2_GPT_Pretraining_no_transformer_engine:
+    needs: [cicd-test-container-setup]
+    runs-on: self-hosted-azure
+    timeout-minutes: 10
+    container:
+      image: nemoci.azurecr.io/nemo_container_${{ github.run_id }}
+      options:
+        --device=/dev/nvidia0
+        --gpus all
+        --shm-size=8g
+        --env TRANSFORMERS_OFFLINE=0
+        --volume /mnt/datadrive/TestData:/home/TestData
+    steps:
+        - name: Checkout repository
+          uses: actions/checkout@v4
+        - run: |
+            pip uninstall -y apex ## TODO: remove when apex is no longer a dependency
+            pip uninstall -y transformer_engine
+
+            python examples/llm/megatron_gpt_pretraining.py \
+            --devices=2 \
+            --max-steps=3 \
+            --experiment-dir=examples/llm/gpt_pretrain_results \
+            --vocab-path=/home/TestData/nlp/megatron_gpt/data/gpt/vocab.json \
+            --merges-path=/home/TestData/nlp/megatron_gpt/data/gpt/merges.txt \
+            --data-path=/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document \
+            --index-mapping-dir=examples/llm/gpt_index_mappings
+
+            python examples/llm/megatron_gpt_pretraining.py \
+            --devices=2 \
+            --max-steps=6 \
+            --experiment-dir=examples/llm/gpt_pretrain_results \
+            --vocab-path=/home/TestData/nlp/megatron_gpt/data/gpt/vocab.json \
+            --merges-path=/home/TestData/nlp/megatron_gpt/data/gpt/merges.txt \
+            --data-path=/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document \
+            --index-mapping-dir=examples/llm/gpt_index_mappings
+
+            rm -rf examples/llm/gpt_pretrain_results
+            rm -rf examples/llm/gpt_index_mappings
+        - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
+          if: "failure()"
+
   Nemo_CICD_Test:
     needs: 
       - L0_Unit_Tests_GPU
@@ -4616,6 +4658,7 @@ jobs:
       - L2_TTS_Fast_dev_runs_1_Hifigan
       - Speech_Checkpoints_tests
       - L2_Stable_Diffusion_Training
+      - L2_NeMo_2_GPT_Pretraining_no_transformer_engine
     if: always()
     runs-on: ubuntu-latest
     steps:  

diff --git a/Dockerfile.ci b/Dockerfile.ci
@@ -34,7 +34,7 @@ WORKDIR /workspace
 # Install NeMo requirements
 ARG TE_TAG=7d576ed25266a17a7b651f2c12e8498f67e0baea
 ARG MODELOPT_VERSION=0.13.0
-ARG MCORE_TAG=c0164bcfd4f8213a10a6b1e47ef80721a68b4fb6
+ARG MCORE_TAG=c7a1f82d761577e6ca0338d3521eac82f2aa0904
 ARG APEX_TAG=810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c
 RUN \
 --mount=type=bind,source=requirements,target=requirements \

diff --git a/examples/llm/megatron_gpt_pretraining.py b/examples/llm/megatron_gpt_pretraining.py
@@ -0,0 +1,109 @@
+## NOTE: This script is present for github-actions testing only.
+## There are no guarantees that this script is up-to-date with latest NeMo.
+
+import argparse
+
+from megatron.core.optimizer import OptimizerConfig
+from pytorch_lightning.loggers import TensorBoardLogger
+
+from nemo import lightning as nl
+from nemo.collections import llm
+from nemo.collections.llm.api import train
+from nemo.collections.llm.gpt.data import PreTrainingDataModule
+from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
+from nemo.lightning import NeMoLogger
+from nemo.lightning.pytorch.callbacks import ModelCheckpoint
+from nemo.lightning.pytorch.optim.megatron import MegatronOptimizerModule
+
+
+def get_args():
+    parser = argparse.ArgumentParser(description='Train a small GPT model using NeMo 2.0')
+    parser.add_argument('--devices', type=int, help="Number of devices to use for training")
+    parser.add_argument('--max-steps', type=int, help="Number of steps to train for")
+    parser.add_argument('--experiment-dir', type=str, help="directory to write results and checkpoints to")
+    parser.add_argument('--data-path', type=str, help="Path to data file")
+    parser.add_argument('--vocab-path', type=str, help="Path to vocab file")
+    parser.add_argument('--merges-path', type=str, help="Path to merges file")
+    parser.add_argument('--index-mapping-dir', type=str, help="directory to write index mappings to")
+
+    return parser.parse_args()
+
+
+if __name__ == '__main__':
+
+    args = get_args()
+
+    seq_length = 2048
+
+    tokenizer = get_nmt_tokenizer(
+        "megatron",
+        "GPT2BPETokenizer",
+        vocab_file=args.vocab_path,
+        merges_file=args.merges_path,
+    )
+    data = PreTrainingDataModule(
+        paths=args.data_path,
+        seq_length=2048,
+        global_batch_size=32,
+        seed=1234,
+        tokenizer=tokenizer,
+    )
+    gpt_config = llm.GPTConfig(
+        num_layers=12,
+        hidden_size=768,
+        ffn_hidden_size=3072,
+        num_attention_heads=12,
+        seq_length=seq_length,
+        init_method_std=0.023,
+        hidden_dropout=0.1,
+        attention_dropout=0.1,
+        layernorm_epsilon=1e-5,
+        make_vocab_size_divisible_by=128,
+    )
+    model = llm.GPTModel(gpt_config, tokenizer=data.tokenizer)
+    strategy = nl.MegatronStrategy()
+    checkpoint_callback = ModelCheckpoint(
+        every_n_train_steps=5000,
+        enable_nemo_ckpt_io=False,
+        async_save=False,
+    )
+    callbacks = [checkpoint_callback]
+
+    loggers = []
+    tensorboard_logger = TensorBoardLogger(
+        save_dir='dummy',  ## NOTE: this gets overwritten by default
+    )
+    loggers.append(tensorboard_logger)
+
+    opt_config = OptimizerConfig(
+        optimizer='adam',
+        lr=6e-4,
+        min_lr=6e-5,
+        use_distributed_optimizer=False,
+        bf16=True,
+    )
+    opt = MegatronOptimizerModule(config=opt_config)
+
+    trainer = nl.Trainer(
+        devices=args.devices,
+        max_steps=args.max_steps,
+        accelerator="gpu",
+        strategy=strategy,
+        logger=loggers,
+        callbacks=callbacks,
+        log_every_n_steps=1,
+        plugins=nl.MegatronMixedPrecision(precision="bf16-mixed", amp_O2=False),
+    )
+
+    nemo_logger = NeMoLogger(
+        dir=args.experiment_dir,
+    )
+
+    train(
+        model=model,
+        data=data,
+        trainer=trainer,
+        log=nemo_logger,
+        tokenizer='data',
+        optim=opt,
+    )
diff --git a/examples/nlp/machine_translation/nmt_transformer_infer_megatron.py b/examples/nlp/machine_translation/nmt_transformer_infer_megatron.py
@@ -29,21 +29,12 @@
 
 from nemo.collections.nlp.models.machine_translation.megatron_nmt_model import MegatronNMTModel
 from nemo.collections.nlp.modules.common.megatron.megatron_init import fake_initialize_model_parallel
-from nemo.collections.nlp.modules.common.megatron.utils import ApexGuardDefaults
 from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy, NLPSaveRestoreConnector
 from nemo.core.config import hydra_runner
 from nemo.utils import logging
 from nemo.utils.app_state import AppState
 from nemo.utils.model_utils import inject_model_parallel_rank
 
-try:
-    from apex.transformer.pipeline_parallel.utils import _reconfigure_microbatch_calculator
-
-    HAVE_APEX = True
-except (ImportError, ModuleNotFoundError):
-    ModelType = ApexGuardDefaults()
-    HAVE_APEX = False
-
 
 @hydra_runner(config_path="conf", config_name="nmt_megatron_infer")
 def main(cfg) -> None:
@@ -101,13 +92,19 @@ def main(cfg) -> None:
             src_text.append(line.strip())
             if len(src_text) == cfg.batch_size:
                 translations = model.translate(
-                    text=src_text, source_lang=cfg.source_lang, target_lang=cfg.target_lang,
+                    text=src_text,
+                    source_lang=cfg.source_lang,
+                    target_lang=cfg.target_lang,
                 )
                 for translation in translations:
                     tgt_f.write(translation + "\n")
                 src_text = []
         if len(src_text) > 0:
-            translations = model.translate(text=src_text, source_lang=cfg.source_lang, target_lang=cfg.target_lang,)
+            translations = model.translate(
+                text=src_text,
+                source_lang=cfg.source_lang,
+                target_lang=cfg.target_lang,
+            )
             for translation in translations:
                 tgt_f.write(translation + "\n")
 

diff --git a/nemo/collections/llm/gpt/data/pre_training.py b/nemo/collections/llm/gpt/data/pre_training.py
@@ -173,10 +173,8 @@ def load_state_dict(self, state_dict: Dict[str, Any]) -> None:
             state_dict: the datamodule state returned by ``state_dict``.
 
         """
-        try:
-            from apex.transformer.pipeline_parallel.utils import _GLOBAL_NUM_MICROBATCHES_CALCULATOR
-        except ModuleNotFoundError:
-            from nemo.lightning.apex_utils import _GLOBAL_NUM_MICROBATCHES_CALCULATOR
+        from megatron.core.num_microbatches_calculator import _GLOBAL_NUM_MICROBATCHES_CALCULATOR
+
         consumed_samples = state_dict['consumed_samples']
         self.data_sampler.init_consumed_samples = consumed_samples
         self.data_sampler.prev_consumed_samples = consumed_samples

diff --git a/nemo/collections/llm/gpt/model/base.py b/nemo/collections/llm/gpt/model/base.py
@@ -14,6 +14,12 @@
 from nemo.lightning.megatron_parallel import MaskedTokenLossReduction
 from nemo.lightning.pytorch.optim import MegatronOptimizerModule, OptimizerModule
 
+HAVE_TE = True
+try:
+    import transformer_engine
+except (ImportError, ModuleNotFoundError):
+    HAVE_TE = False
+
 if TYPE_CHECKING:
     from megatron.core.models.gpt.gpt_model import GPTModel as MCoreGPTModel
 
@@ -80,6 +86,13 @@ def local_layer_spec(config: "GPTConfig") -> ModuleSpec:
     )
 
 
+def default_layer_spec(config: "GPTConfig") -> ModuleSpec:
+    if HAVE_TE:
+        return transformer_engine_layer_spec(config)
+    else:
+        return local_layer_spec(config)
+
+
 @dataclass
 class GPTConfig(TransformerConfig, io.IOMixin):
     # From megatron.core.models.gpt.gpt_model.GPTModel
@@ -96,7 +109,7 @@ class GPTConfig(TransformerConfig, io.IOMixin):
     # TODO: Move this to better places?
     get_attention_mask_from_fusion: bool = False
 
-    transformer_layer_spec: Union[ModuleSpec, Callable[["GPTConfig"], ModuleSpec]] = transformer_engine_layer_spec
+    transformer_layer_spec: Union[ModuleSpec, Callable[["GPTConfig"], ModuleSpec]] = default_layer_spec
     forward_step_fn: Callable = gpt_forward_step
     data_step_fn: Callable = gpt_data_step
 

diff --git a/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py b/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py
@@ -65,19 +65,10 @@
 from nemo.core.classes.common import PretrainedModelInfo
 from nemo.utils import logging
 
-try:
-    import apex.transformer.pipeline_parallel.utils
-    from apex.transformer.pipeline_parallel.utils import get_num_microbatches
-
-    HAVE_APEX = True
-
-except (ImportError, ModuleNotFoundError):
-
-    HAVE_APEX = False
-
 try:
     from megatron.core import InferenceParams, dist_checkpointing, parallel_state, tensor_parallel
     from megatron.core.models.gpt import GPTModel as MCoreGPTModel
+    from megatron.core.num_microbatches_calculator import get_num_microbatches
     from megatron.core.pipeline_parallel.schedules import get_forward_backward_func
     from megatron.core.transformer.utils import make_sharded_tensors_for_checkpoint