Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trainer bug leads into log failure #29207

Closed
2 of 4 tasks
Qizhang-Feng opened this issue Feb 22, 2024 · 4 comments
Closed
2 of 4 tasks

trainer bug leads into log failure #29207

Qizhang-Feng opened this issue Feb 22, 2024 · 4 comments

Comments

@Qizhang-Feng
Copy link

System Info

  • transformers version: 4.38.1
  • Platform: Linux-5.10.209-198.812.amzn2.x86_64-x86_64-with-glibc2.26
  • Python version: 3.10.13
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: DEEPSPEED
    - mixed_precision: bf16
    - use_cpu: False
    - debug: True
    - num_processes: 8
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - deepspeed_config: {'gradient_accumulation_steps': 2, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
    - dynamo_config: {'dynamo_backend': 'INDUCTOR', 'dynamo_mode': 'default', 'dynamo_use_dynamic': True, 'dynamo_use_fullgraph': True}
  • PyTorch version (GPU?): 2.0.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@muellerz @pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

start train ...
[2024-02-22 10:30:38,289] [WARNING] [engine.py:1189:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
{'loss': 1.6339, 'grad_norm': tensor(0.0337, device='cuda:0'), 'learning_rate': 5e-06, 'epoch': 0.01}
1%|█▎ | 5/500 [00:16<21:25, 2.60s/it]/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/DPDPO/sft_llama2.py", line 202, in
trainer.train()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 331, in train
output = super().train(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2029, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2423, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2525, in _save_checkpoint
self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer_callback.py", line 113, in save_to_json
json_string = json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/init.py", line 238, in dumps
**kw).encode(obj)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 201, in encode
chunks = list(chunks)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.class.name} '
TypeError: Object of type Tensor is not JSON serializable

Expected behavior

When I use sfttrainer from TRL, I encounter a bug shows:
TypeError(f'Object of type {o.class.name} ' TypeError: Object of type Tensor is not JSON serializable in trainer log

after check the log message:
{'loss': 1.6339, 'grad_norm': tensor(0.0337, device='cuda:0'), 'learning_rate': 5e-06, 'epoch': 0.01}

I find that grad_norm is Tensor.
The 2401 line of trainer.py is:
logs["grad_norm"] = grad_norm
(https://github.com/huggingface/transformers/blame/a0857740c0e6127485c11476650314df3accc2b6/src/transformers/trainer.py#L2401)

which should be:
logs["grad_norm"] = grad_norm.item()

and the bug is solved.

@pacman100
Copy link
Contributor

Thank you @Qizhang-Feng for the issue, your fix makes sense. It would be great if you can open the PR with the fix as you already have it.

@lastdefiance20
Copy link

I encountered the same error and resolved it using this solution.

@svenschultze
Copy link
Contributor

This fixes the issue for deepspeed training, but creates a different issue when you don't use deepspeed.

I fixed it by changing

grad_norm = model.get_global_grad_norm()
in line 2012 to
grad_norm = model.get_global_grad_norm().item()

@TankNee
Copy link

TankNee commented Feb 23, 2024

same issue, which makes my program fail to save trainer states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants