Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NLP] Access scaler only in FP16 case #7916

Merged
merged 5 commits into from
Dec 7, 2023
Merged

Conversation

janekl
Copy link
Collaborator

@janekl janekl commented Nov 20, 2023

What does this PR do ?

When trying to load FP16 model with trainer.precision != 16 (or 16-mixed) a user gets a mysterious error message like:

  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 122, in __init__
    self.model_parallel_config: ModelParallelConfig = self.build_model_parallel_config()
  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 782, in build_model_parallel_config
    "grad_scale_func": self.trainer.precision_plugin.scaler.scale
AttributeError: 'NoneType' object has no attribute 'scale'

I'm changing the code to access trainer.precision_plugin.scaler.scale only when trainer is configured with the relevant precision: "16" or "16-mixed".

Collection: NLP

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

@janekl janekl requested a review from ericharper November 20, 2023 15:00
@github-actions github-actions bot added the NLP label Nov 20, 2023
Comment on lines 777 to 781
if self.torch_dtype == torch.float16 and not self.trainer.precision.startswith("16"):
# Make sure that trainer.precision_plugin.scaler exists for config_mapping below
raise ValueError(
"Creating a half-precision model requires setting trainer.precision to '16' or '16-mixed'"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the same check for bf16?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is needed only for model.precision=16 which is self.torch_dtype here, see this. More specifically, the check is to make sure this will work -- loading FP16 model assumes that a scaler is available in trainer so it needs to be configured properly.

In my understanding in other cases it is the trainer precision to handle training precision if model.precision is different, and this would result in necessary type casting. But I might have not considered all the possibilities.

Which case exactly you mean for model.precision and trainer.precision, respectively?

@janekl janekl force-pushed the jlasek/loading_model_precision_16 branch from e520963 to 6d1a5c0 Compare November 27, 2023 10:04
@janekl janekl changed the title [NLP] Raise informative error on FP16 model and trainer precision mismatch [NLP] Access scaler only in FP16 case Dec 6, 2023
@@ -103,7 +103,7 @@ def __init__(self, cfg: DictConfig, trainer: Trainer, no_lm_init=True):
self.tokenizer = None

with open_dict(cfg):
if cfg.get('precision', None) is None and trainer is not None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter condition is guaranteed by the check in L93

janekl and others added 5 commits December 6, 2023 13:36
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
…ecision != 16

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
@janekl janekl force-pushed the jlasek/loading_model_precision_16 branch from eaf20c1 to b08c123 Compare December 6, 2023 12:36
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@janekl janekl merged commit bbadcf7 into main Dec 7, 2023
15 checks passed
@janekl janekl deleted the jlasek/loading_model_precision_16 branch December 7, 2023 08:21
pzelasko pushed a commit to pzelasko/NeMo that referenced this pull request Jan 3, 2024
* Remove unused 'precision' variable

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Raise informative error when trying to load FP16 model for trainer.precision != 16

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Make sure scaler is available instead of raising error

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* trainer != None is assured thanks to a previous check

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
ssh-meister pushed a commit to ssh-meister/NeMo that referenced this pull request Feb 15, 2024
* Remove unused 'precision' variable

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Raise informative error when trying to load FP16 model for trainer.precision != 16

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Make sure scaler is available instead of raising error

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* trainer != None is assured thanks to a previous check

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* Remove unused 'precision' variable

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Raise informative error when trying to load FP16 model for trainer.precision != 16

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Make sure scaler is available instead of raising error

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* trainer != None is assured thanks to a previous check

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants