trainer bug leads into log failure #29207

Qizhang-Feng · 2024-02-22T10:50:19Z

System Info

transformers version: 4.38.1
Platform: Linux-5.10.209-198.812.amzn2.x86_64-x86_64-with-glibc2.26
Python version: 3.10.13
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: True
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 2, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'INDUCTOR', 'dynamo_mode': 'default', 'dynamo_use_dynamic': True, 'dynamo_use_fullgraph': True}
PyTorch version (GPU?): 2.0.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@muellerz @pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

start train ...
[2024-02-22 10:30:38,289] [WARNING] [engine.py:1189:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
{'loss': 1.6339, 'grad_norm': tensor(0.0337, device='cuda:0'), 'learning_rate': 5e-06, 'epoch': 0.01}
1%|█▎ | 5/500 [00:16<21:25, 2.60s/it]/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/DPDPO/sft_llama2.py", line 202, in
trainer.train()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 331, in train
output = super().train(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2029, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2423, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2525, in _save_checkpoint
self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer_callback.py", line 113, in save_to_json
json_string = json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/init.py", line 238, in dumps
**kw).encode(obj)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 201, in encode
chunks = list(chunks)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.class.name} '
TypeError: Object of type Tensor is not JSON serializable

Expected behavior

When I use sfttrainer from TRL, I encounter a bug shows:
TypeError(f'Object of type {o.class.name} ' TypeError: Object of type Tensor is not JSON serializable in trainer log

after check the log message:
{'loss': 1.6339, 'grad_norm': tensor(0.0337, device='cuda:0'), 'learning_rate': 5e-06, 'epoch': 0.01}

I find that grad_norm is Tensor.
The 2401 line of trainer.py is:
logs["grad_norm"] = grad_norm
(https://github.com/huggingface/transformers/blame/a0857740c0e6127485c11476650314df3accc2b6/src/transformers/trainer.py#L2401)

which should be:
logs["grad_norm"] = grad_norm.item()

and the bug is solved.

The text was updated successfully, but these errors were encountered:

pacman100 · 2024-02-22T10:58:58Z

Thank you @Qizhang-Feng for the issue, your fix makes sense. It would be great if you can open the PR with the fix as you already have it.

lastdefiance20 · 2024-02-22T11:22:38Z

I encountered the same error and resolved it using this solution.

svenschultze · 2024-02-22T14:09:52Z

This fixes the issue for deepspeed training, but creates a different issue when you don't use deepspeed.

I fixed it by changing

grad_norm = model.get_global_grad_norm()
in line 2012 to
grad_norm = model.get_global_grad_norm().item()

TankNee · 2024-02-23T13:53:46Z

same issue, which makes my program fail to save trainer states.

svenschultze mentioned this issue Feb 22, 2024

Fix grad_norm unserializable tensor log failure #29212

Merged

thomasgauthier added a commit to thomasgauthier/transformers that referenced this issue Feb 23, 2024

fixes huggingface#29207

79ca43f

Qizhang-Feng closed this as completed Feb 24, 2024

muellerzr mentioned this issue Mar 1, 2024

[PyTorch/XLA] Fix extra TPU compilations introduced by recent changes #29158

Merged

5 tasks

shubhanjan99 mentioned this issue Mar 4, 2024

Fix test failure on DeepSpeed #29444

Merged

5 tasks

yuanzhoulvpi2017 mentioned this issue Mar 10, 2024

fix error: TypeError: Object of type Tensor is not JSON serializable … #29568

Merged

5 tasks

mickeysun0104 mentioned this issue Sep 25, 2024

Incosistant memory usage comparing to huggingface trainer when using deepspeed Lightning-AI/pytorch-lightning#20299

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trainer bug leads into log failure #29207

trainer bug leads into log failure #29207

Qizhang-Feng commented Feb 22, 2024

pacman100 commented Feb 22, 2024

lastdefiance20 commented Feb 22, 2024

svenschultze commented Feb 22, 2024

TankNee commented Feb 23, 2024 •

edited

Loading

trainer bug leads into log failure #29207

trainer bug leads into log failure #29207

Comments

Qizhang-Feng commented Feb 22, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

pacman100 commented Feb 22, 2024

lastdefiance20 commented Feb 22, 2024

svenschultze commented Feb 22, 2024

TankNee commented Feb 23, 2024 • edited Loading

TankNee commented Feb 23, 2024 •

edited

Loading