-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trainer bug leads into log failure #29207
Comments
Thank you @Qizhang-Feng for the issue, your fix makes sense. It would be great if you can open the PR with the fix as you already have it. |
I encountered the same error and resolved it using this solution. |
This fixes the issue for deepspeed training, but creates a different issue when you don't use deepspeed. I fixed it by changing
|
same issue, which makes my program fail to save trainer states. |
System Info
transformers
version: 4.38.1- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: True
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 2, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'INDUCTOR', 'dynamo_mode': 'default', 'dynamo_use_dynamic': True, 'dynamo_use_fullgraph': True}
Who can help?
@muellerz @pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
start train ...
[2024-02-22 10:30:38,289] [WARNING] [engine.py:1189:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
{'loss': 1.6339, 'grad_norm': tensor(0.0337, device='cuda:0'), 'learning_rate': 5e-06, 'epoch': 0.01}
1%|█▎ | 5/500 [00:16<21:25, 2.60s/it]/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/DPDPO/sft_llama2.py", line 202, in
trainer.train()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 331, in train
output = super().train(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2029, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2423, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2525, in _save_checkpoint
self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer_callback.py", line 113, in save_to_json
json_string = json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/init.py", line 238, in dumps
**kw).encode(obj)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 201, in encode
chunks = list(chunks)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.class.name} '
TypeError: Object of type Tensor is not JSON serializable
Expected behavior
When I use sfttrainer from TRL, I encounter a bug shows:
TypeError(f'Object of type {o.class.name} ' TypeError: Object of type Tensor is not JSON serializable in trainer log
after check the log message:
{'loss': 1.6339, 'grad_norm': tensor(0.0337, device='cuda:0'), 'learning_rate': 5e-06, 'epoch': 0.01}
I find that grad_norm is Tensor.
The 2401 line of trainer.py is:
logs["grad_norm"] = grad_norm
(https://github.com/huggingface/transformers/blame/a0857740c0e6127485c11476650314df3accc2b6/src/transformers/trainer.py#L2401)
which should be:
logs["grad_norm"] = grad_norm.item()
and the bug is solved.
The text was updated successfully, but these errors were encountered: