Add mcore spec for full TE TransformerLayer #8316

jbaczek · 2024-02-02T19:05:15Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: jiemingz <jiemingz@nvidia.com> Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

for more information, see https://pre-commit.ci Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

This reverts commit 58d6bce. Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

…sformerLayer Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Layali R <31741533+layalir@users.noreply.github.com> Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

ShriyaPalsamudram · 2024-02-02T19:16:21Z

jenkins

ShriyaPalsamudram · 2024-02-02T19:20:14Z

nemo/collections/nlp/data/language_modeling/megatron/dataset_utils.py

@@ -1330,7 +1330,7 @@ def get_samples_mapping(
        )
    torch.distributed.barrier()
    counts = torch.cuda.LongTensor([1])
-    torch.distributed.all_reduce(counts, group=parallel_state.get_data_parallel_group(with_context_parallel=True))


Please separate context_parallel changes from the full TE spec changes to make review easier.

minitu · 2024-02-02T20:28:04Z

nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

+            apply_residual_connection_post_layernorm=config.apply_residual_connection_post_layernorm,
+            autocast_dtype=precision,
+            #use_emha=False, # Use default 'False'
+            ub_tp_comm_overlap=config.tp_comm_overlap, # TODO: ub_tp_comm_overlap?


Suggested change

ub_tp_comm_overlap=config.tp_comm_overlap, # TODO: ub_tp_comm_overlap?

ub_tp_comm_overlap=config.tp_comm_overlap,

minitu · 2024-02-02T20:35:42Z

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

@@ -225,6 +227,8 @@ def __init__(self, cfg: DictConfig, trainer: Trainer):

        self.mcore_gpt = cfg.get('mcore_gpt', False)
        self.spec_name = cfg.get('name', '')
+        if cfg.get('fp8', False):


I believe this is an artifact from Jimmy's memory fixes (the first two commits), which are now in main.
So those changes should go away as well.

Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

for more information, see https://pre-commit.ci

nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

+from nemo.collections.nlp.modules.common.megatron.utils import (
+    ApexGuardDefaults,
+    init_method_normal,
+    scaled_init_method_normal,
+)


nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

+
+try:
+    from megatron.core import parallel_state, tensor_parallel
+    from megatron.core.dist_checkpointing.utils import apply_prefix_mapping


nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

+            "TransformerEngine was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA/NeMo#megatron-gpt."
+        )
+
+    return TransformerBlockSubmodules(layer_specs=ModuleSpec(module=TETransformerLayerAutocast))


jbaczek · 2024-02-05T10:36:03Z

Closing this one in favour of #8328

github-actions bot added core Changes to NeMo Core NLP CI labels Feb 2, 2024

jiemingz and others added 14 commits February 2, 2024 20:06

add first_val_step for mcore schedules

5eaab1f

Signed-off-by: jiemingz <jiemingz@nvidia.com> Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

fix if non fp8

2e52799

Signed-off-by: jiemingz <jiemingz@nvidia.com> Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

3f69086

for more information, see https://pre-commit.ci Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

Revert "context parallelism (NVIDIA#7739)"

8f74c8c

This reverts commit 58d6bce. Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

Add ModuleSpec changes

8d9a60a

Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

Add is_first_microbatch

f93b01e

Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

Mostly remove manual knobs

7f952c5

Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

Remove unused use_emha flag, remove debugging prints

4fc6089

Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

Copy over AutocastTransformerLayer, remove direct support for TE Tran…

7b80208

…sformerLayer Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

Fix checkpoint loading

e0724f7

Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

remove custom checkpoint code

3f25c6d

Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

Comment out mapping. We should add it together with config update

3b1cb2d

Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

Add mcore spec test

d56297d

Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

jbaczek force-pushed the clean_mcore_spec branch from 267eccd to d56297d Compare February 2, 2024 19:06

ShriyaPalsamudram reviewed Feb 2, 2024

View reviewed changes

minitu reviewed Feb 2, 2024

View reviewed changes

Revert contex parallel changes

73763cc

Signed-off-by: Jan Baczek <jbaczek@nvidia.com>

github-actions bot removed the core Changes to NeMo Core label Feb 5, 2024

pre-commit-ci bot and others added 2 commits February 5, 2024 09:29

[pre-commit.ci] auto fixes from pre-commit.com hooks

a7dcba5

for more information, see https://pre-commit.ci

Merge branch 'NVIDIA:main' into clean_mcore_spec

92d521c

github-advanced-security bot found potential problems Feb 5, 2024

View reviewed changes

jbaczek closed this Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mcore spec for full TE TransformerLayer #8316

Add mcore spec for full TE TransformerLayer #8316

jbaczek commented Feb 2, 2024

ShriyaPalsamudram commented Feb 2, 2024

ShriyaPalsamudram Feb 2, 2024

minitu Feb 2, 2024

minitu Feb 2, 2024

jbaczek commented Feb 5, 2024

	ub_tp_comm_overlap=config.tp_comm_overlap, # TODO: ub_tp_comm_overlap?
	ub_tp_comm_overlap=config.tp_comm_overlap,

Add mcore spec for full TE TransformerLayer #8316

Add mcore spec for full TE TransformerLayer #8316

Conversation

jbaczek commented Feb 2, 2024

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information

ShriyaPalsamudram commented Feb 2, 2024

ShriyaPalsamudram Feb 2, 2024

Choose a reason for hiding this comment

minitu Feb 2, 2024

Choose a reason for hiding this comment

minitu Feb 2, 2024

Choose a reason for hiding this comment

jbaczek commented Feb 5, 2024