[NLP] Access scaler only in FP16 case #7916

janekl · 2023-11-20T15:00:31Z

What does this PR do ?

When trying to load FP16 model with trainer.precision != 16 (or 16-mixed) a user gets a mysterious error message like:

  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 122, in __init__
    self.model_parallel_config: ModelParallelConfig = self.build_model_parallel_config()
  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 782, in build_model_parallel_config
    "grad_scale_func": self.trainer.precision_plugin.scaler.scale
AttributeError: 'NoneType' object has no attribute 'scale'

I'm changing the code to access trainer.precision_plugin.scaler.scale only when trainer is configured with the relevant precision: "16" or "16-mixed".

Collection: NLP

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

ericharper · 2023-11-20T17:15:44Z

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

+        if self.torch_dtype == torch.float16 and not self.trainer.precision.startswith("16"):
+            # Make sure that trainer.precision_plugin.scaler exists for config_mapping below
+            raise ValueError(
+                "Creating a half-precision model requires setting trainer.precision to '16' or '16-mixed'"
+            )


Should we do the same check for bf16?

I think it is needed only for model.precision=16 which is self.torch_dtype here, see this. More specifically, the check is to make sure this will work -- loading FP16 model assumes that a scaler is available in trainer so it needs to be configured properly.

In my understanding in other cases it is the trainer precision to handle training precision if model.precision is different, and this would result in necessary type casting. But I might have not considered all the possibilities.

Which case exactly you mean for model.precision and trainer.precision, respectively?

janekl · 2023-12-06T12:35:58Z

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

@@ -103,7 +103,7 @@ def __init__(self, cfg: DictConfig, trainer: Trainer, no_lm_init=True):
        self.tokenizer = None

        with open_dict(cfg):
-            if cfg.get('precision', None) is None and trainer is not None:


The latter condition is guaranteed by the check in L93

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

…ecision != 16 Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

for more information, see https://pre-commit.ci

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

ericharper

LGTM. Thanks!

* Remove unused 'precision' variable Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Raise informative error when trying to load FP16 model for trainer.precision != 16 Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make sure scaler is available instead of raising error Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * trainer != None is assured thanks to a previous check Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Remove unused 'precision' variable Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Raise informative error when trying to load FP16 model for trainer.precision != 16 Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make sure scaler is available instead of raising error Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * trainer != None is assured thanks to a previous check Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Sasha Meister <ameister@nvidia.com>

* Remove unused 'precision' variable Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Raise informative error when trying to load FP16 model for trainer.precision != 16 Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make sure scaler is available instead of raising error Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * trainer != None is assured thanks to a previous check Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

janekl requested a review from ericharper November 20, 2023 15:00

github-actions bot added the NLP label Nov 20, 2023

ericharper reviewed Nov 20, 2023

View reviewed changes

janekl force-pushed the jlasek/loading_model_precision_16 branch from e520963 to 6d1a5c0 Compare November 27, 2023 10:04

janekl changed the title ~~[NLP] Raise informative error on FP16 model and trainer precision mismatch~~ [NLP] Access scaler only in FP16 case Dec 6, 2023

janekl commented Dec 6, 2023

View reviewed changes

janekl and others added 5 commits December 6, 2023 13:36

Remove unused 'precision' variable

7556c52

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

Raise informative error when trying to load FP16 model for trainer.pr…

1249713

…ecision != 16 Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2d571bd

for more information, see https://pre-commit.ci

Make sure scaler is available instead of raising error

458c821

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

trainer != None is assured thanks to a previous check

b08c123

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

janekl force-pushed the jlasek/loading_model_precision_16 branch from eaf20c1 to b08c123 Compare December 6, 2023 12:36

ericharper approved these changes Dec 6, 2023

View reviewed changes

janekl merged commit bbadcf7 into main Dec 7, 2023
15 checks passed

janekl deleted the jlasek/loading_model_precision_16 branch December 7, 2023 08:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NLP] Access scaler only in FP16 case #7916

[NLP] Access scaler only in FP16 case #7916

janekl commented Nov 20, 2023 •

edited

Loading

ericharper Nov 20, 2023

janekl Nov 21, 2023

janekl Dec 6, 2023

ericharper left a comment

[NLP] Access scaler only in FP16 case #7916

[NLP] Access scaler only in FP16 case #7916

Conversation

janekl commented Nov 20, 2023 • edited Loading

What does this PR do ?

Before your PR is "Ready for review"

Who can review?

ericharper Nov 20, 2023

Choose a reason for hiding this comment

janekl Nov 21, 2023

Choose a reason for hiding this comment

janekl Dec 6, 2023

Choose a reason for hiding this comment

ericharper left a comment

Choose a reason for hiding this comment

janekl commented Nov 20, 2023 •

edited

Loading