Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

为什么训练过程中,只能跑到checkpoint-2400,就报错 #21

Open
haomengt opened this issue Sep 19, 2024 · 1 comment
Open

Comments

@haomengt
Copy link

作者您好,我想请问一下,为什么我按照您的那个训练的指令执行的,训练总是跑到checkpont-2400那里就有问题呢,报错如下图所示,
微信图片_20240919153028
微信图片_20240919153036
{'loss': 9.7984, 'grad_norm': 13.973094940185547, 'learning_rate': 0.000741541788969566, 'epoch': 0.03}
1%| | 2400/215760 [2:58:49<265:49:35, 4.49s/it]output_dir /apps/data/models/urbangpt/UrbanGPT/checkpoints/UrbanGPT/checkpoint-2400
up /apps/data/models/urbangpt/UrbanGPT/checkpoints/UrbanGPT/st_projector checkpoint-2400
Traceback (most recent call last):
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/generation/configuration_utils.py", line 771, in save_pretrained
raise ValueError(str([w.message for w in caught_warnings]))
ValueError: [UserWarning('do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.'), UserWarning('do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.')]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/apps/data/models/urbangpt/UrbanGPT/urbangpt/train/train_mem.py", line 30, in
train()
File "/apps/data/models/urbangpt/UrbanGPT/urbangpt/train/train_st.py", line 822, in train
trainer.train()
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
return inner_training_loop(
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 2356, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 2807, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 2886, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 3454, in save_model
self._save(output_dir)
File "/apps/data/models/urbangpt/UrbanGPT/urbangpt/train/stchat_trainer.py", line 56, in _save
super(STChatTrainer, self)._save(output_dir, state_dict)
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 3525, in _save
self.model.save_pretrained(
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2593, in save_pretrained
model_to_save.generation_config.save_pretrained(save_directory)
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/generation/configuration_utils.py", line 773, in save_pretrained
raise ValueError(
ValueError: The generation config instance is invalid -- .validate() throws warnings and/or exceptions. Fix these issues to save the configuration.

Thrown during validation:
[UserWarning('do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.'), UserWarning('do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.')]
[rank0]: Traceback (most recent call last):
[rank0]: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/generation/configuration_utils.py", line 771, in save_pretrained
[rank0]: raise ValueError(str([w.message for w in caught_warnings]))
[rank0]: ValueError: [UserWarning('do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.'), UserWarning('do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.')]

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]: File "/apps/data/models/urbangpt/UrbanGPT/urbangpt/train/train_mem.py", line 30, in
[rank0]: train()
[rank0]: File "/apps/data/models/urbangpt/UrbanGPT/urbangpt/train/train_st.py", line 822, in train
[rank0]: trainer.train()
[rank0]: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
[rank0]: return inner_training_loop(
[rank0]: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 2356, in _inner_training_loop
[rank0]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
[rank0]: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 2807, in _maybe_log_save_evaluate
[rank0]: self._save_checkpoint(model, trial, metrics=metrics)
[rank0]: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 2886, in _save_checkpoint
[rank0]: self.save_model(output_dir, _internal_call=True)
[rank0]: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 3454, in save_model
[rank0]: self._save(output_dir)
[rank0]: File "/apps/data/models/urbangpt/UrbanGPT/urbangpt/train/stchat_trainer.py", line 56, in _save
[rank0]: super(STChatTrainer, self)._save(output_dir, state_dict)
[rank0]: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 3525, in _save
[rank0]: self.model.save_pretrained(
[rank0]: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2593, in save_pretrained
[rank0]: model_to_save.generation_config.save_pretrained(save_directory)
[rank0]: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/generation/configuration_utils.py", line 773, in save_pretrained
[rank0]: raise ValueError(
[rank0]: ValueError: The generation config instance is invalid -- .validate() throws warnings and/or exceptions. Fix these issues to save the configuration.

[rank0]: Thrown during validation:
[rank0]: [UserWarning('do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.'), UserWarning('do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.')]
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /apps/data/models/urbangpt/UrbanGPT/wandb/offline-run-20240918_231037-16ynzesq
wandb: Find logs at: wandb/offline-run-20240918_231037-16ynzesq/logs
W0919 02:09:32.451000 140088041554048 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 492236 closing signal SIGTERM
W0919 02:09:32.452000 140088041554048 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 492237 closing signal SIGTERM
W0919 02:09:32.452000 140088041554048 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 492238 closing signal SIGTERM
W0919 02:09:32.452000 140088041554048 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 492239 closing signal SIGTERM
W0919 02:09:32.453000 140088041554048 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 492240 closing signal SIGTERM
E0919 02:09:33.696000 140088041554048 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 492235) of binary: /apps/data/conda/envs/urbanGPT/bin/python
Traceback (most recent call last):
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/torch/distributed/run.py", line 905, in
main()
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

@LZH-YS1998
Copy link
Collaborator

Hello, I'm sorry that I haven't encountered this error. Maybe this link can help you: ValueError: The generation config instance is invalid.

"This error appears to be a problem that occurred while upgrading transformers version.
I fixed this problem by manually adding
do_sample: true
in vicuna's generation_config.json file."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants