Skip to content

ZeroDivisionError in config.py due to world_size=0 on multi-GPU setup #223

@ShenAC-SAC

Description

@ShenAC-SAC

问题描述:example中的mix_chord,运行trinity run时,world_size计算为0,导致 % 0 错误
环境:Python 3.12, CUDA 12.8, RTX 5090 x2, autodl容器, Ray status显示GPU:2.0, torch.cuda.device_count()=2
报错信息如下:
WARNING 08-26 15:06:05 [config.py:825] max_prompt_tokens is set to 11263.
INFO 08-26 15:06:05 [config.py:555] buffer.explorer_input.taskset.repeat_times is set to algorithm.repeat_times (=8).
INFO 08-26 15:06:05 [config.py:679] Auto set data_processor.experience_pipeline.input_save_path to /root/Trinity-RFT/examples/mix_chord/mix_chord/test_mix_chord/buffer/explorer_output.jsonl
Traceback (most recent call last):
File "/root/miniconda3/bin/trinity", line 8, in
sys.exit(main())
^^^^^^
File "/root/Trinity-RFT/trinity/cli/launcher.py", line 219, in main
run(args.config, getattr(args, 'log_level', 'INFO'), args.dlc, args.plugin_dir)
File "/root/Trinity-RFT/trinity/cli/launcher.py", line 127, in run
config.check_and_update()
File "/root/Trinity-RFT/trinity/common/config.py", line 898, in check_and_update
self.trainer.trainer_config.synchronize_config(self)
File "/root/Trinity-RFT/trinity/common/verl_config.py", line 302, in synchronize_config
if config.buffer.train_batch_size % world_size != 0:
ZeroDivisionError: integer modulo by zero

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions