Skip to content

求助 运行CHORD遇到的一些错误 #293

@sakuya11111

Description

@sakuya11111

按照CHORD的教程安装后 我输入了以下代码开始运行:
trinity run --config examples/mix_chord/mix_chord.yaml
但是遇到了以下报错:

(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] Error in Trainer:
(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] Traceback (most recent call last):
(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] File "/home/Trinity-RFT/trinity/trainer/trainer.py", line 79, in train (Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] exps, metrics, repr_samples = await sample_task
(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] File "/home/anaconda3/envs/trinity-vllm/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span
(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] return await method(self, *_args, **_kwargs)
(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] File "/home/Trinity-RFT/trinity/trainer/trainer.py", line 125, in _sample_data
(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] batch, metrics, repr_samples = await self.sample_strategy.sample( (Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] File "/home/Trinity-RFT/trinity/algorithm/sample_strategy/mix_sample_strategy.py", line 59, in sample
(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] expert_exp_list = await self.expert_exp_buffer.read_async()
(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] File "/home/Trinity-RFT/trinity/buffer/reader/file_reader.py", line 97, in read_async
(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] return self.read(batch_size)
(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] File "/home/Trinity-RFT/trinity/buffer/reader/file_reader.py", line 157, in read
(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] task = self.formatter.format(sample)
(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] File "/home/Trinity-RFT/trinity/buffer/schema/formatter.py", line 62, in format
(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] assert workflow_cls is not None, "default_workflow_type or workflow_key is required"
(Trainer pid=2149038) ERROR 09-21 01:03:49 [trainer.py:93] AssertionError: default_workflow_type or workflow_key is required

于是我在mix_chord.yaml--trainer_input.auxiliary_buffers.sft_dataset.auxiliary_buffers中添加了default_workflow_type属性:
default_workflow_type: 'math_boxed_workflow'
但是添加之后出现了新的报错:

(Trainer pid=2763112) ERROR 09-21 15:58:59 [trainer.py:93] Error in Trainer:
(Trainer pid=2763112) ERROR 09-21 15:58:59 [trainer.py:93] Traceback (most recent call last):
(Trainer pid=2763112) ERROR 09-21 15:58:59 [trainer.py:93] File "/home/Trinity-RFT/trinity/trainer/trainer.py", line 79, in train
(Trainer pid=2763112) ERROR 09-21 15:58:59 [trainer.py:93] exps, metrics, repr_samples = await sample_task
(Trainer pid=2763112) ERROR 09-21 15:58:59 [trainer.py:93] File "/home/anaconda3/envs/trinity-vllm/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span
(Trainer pid=2763112) ERROR 09-21 15:58:59 [trainer.py:93] return await method(self, *_args, **_kwargs)
(Trainer pid=2763112) ERROR 09-21 15:58:59 [trainer.py:93] File "/home/Trinity-RFT/trinity/trainer/trainer.py", line 125, in _sample_data
(Trainer pid=2763112) ERROR 09-21 15:58:59 [trainer.py:93] batch, metrics, repr_samples = await self.sample_strategy.sample(
(Trainer pid=2763112) ERROR 09-21 15:58:59 [trainer.py:93] File "/home/Trinity-RFT/trinity/algorithm/sample_strategy/mix_sample_strategy.py", line 65, in sample
(Trainer pid=2763112) ERROR 09-21 15:58:59 [trainer.py:93] exp.tokens[exp.prompt_length :], dtype=torch.float32
(Trainer pid=2763112) ERROR 09-21 15:58:59 [trainer.py:93] AttributeError: 'Task' object has no attribute 'tokens'
(Trainer pid=2763112) ERROR 09-21 15:58:59 [trainer.py:93]
(Trainer pid=2763112) INFO 09-21 15:58:59 [trainer.py:186] Saving checkpoint at step 0...

我看了一下exp变量 最初是来源于mix_chord.yaml中的trainer_input.experience_buffer:

trainer_input:
experience_buffer:
name: math_buffer
storage_type: queue
path: 'sqlite:///test_mix_chord.db'

所以代码报错是因为我没有设置好mix_chord.yaml吗?我的yaml设置如下:

project: "mix_chord"
name: "test_mix_chord"
checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
algorithm:
algorithm_type: mix_chord
repeat_times: 8 # or 16 for better performance in math related tasks
kl_loss_fn_args:
kl_coef: 0.0
sample_strategy_args:
expert_data_ratio: 0.20
policy_loss_fn_args: # feel free to change, we encourage you to try out different hyperparameters
mu_warmup_steps: 200 # 0 for chord-mu and chord-phi
mu_decay_steps: 400 # 200 for chord-mu and 0 for chord-phi
mu_peak: 0.5 # 0.9 for chord-mu and 0.1 for chord-phi
mu_valley: 0.02 # 0.05 for chord-mu and 0.1 for chord-phi
enable_phi_function: true # false for chord-mu and true for chord-phi
clip_range: 0.2
use_token_level_loss_in_sft: true
use_dynamic_bsz: true
ppo_mini_batch_size: 320 # 320 = 256 + 64; if you set repeat times = 16, then it shoudle be 32 * 16 + 64
ppo_micro_batch_size_per_gpu: 4
ngpus_trainer: 4
train_batch_size_expert: 64
train_batch_size_usual: 256 # 32 batchsize * 8 repeat times
model:
model_path: ${oc.env:TRINITY_MODEL_PATH,/home/Trinity-RFT/download_models/Qwen/Qwen2.5-1.5B-Instruct}
max_response_tokens: 10240
max_model_len: 11264
cluster:
node_num: 1
gpu_per_node: 8
buffer:
total_epochs: 4
batch_size: 32
train_batch_size: 320
explorer_input:
taskset:
name: openr1_data_filtered_int
storage_type: file
path: ${oc.env:TRINITY_TASKSET_PATH, /home/Trinity-RFT/openr1_rl_dataset}
format:
prompt_key: 'problem'
response_key: 'answer'
rollout_args:
temperature: 1.0
logprobs: 0
workflow_args:
with_think: true
eval_tasksets: [] # you can add your own eval tasksets here
#default_workflow_type: 'math_boxed_workflow'
default_workflow_type: 'math_boxed_workflow'
trainer_input:
experience_buffer:
name: math_buffer
storage_type: queue
path: 'sqlite:///test_mix_chord.db'
auxiliary_buffers:
sft_dataset:
total_epochs: 25
name: SFT_data
storage_type: file
path: ${oc.env:TRINITY_SFT_DATASET_PATH,open-r1/Mixture-of-Thoughts/all}
split: 'train'
format:
prompt_type: messages
messages_key: 'messages'
#workflow_key: 'math_workflow'
default_workflow_type: 'math_boxed_workflow'
explorer:
eval_interval: 10
runner_per_model: 8
rollout_model:
engine_num: 4
tensor_parallel_size: 1
enable_prefix_caching: false
enforce_eager: true
dtype: bfloat16
seed: 42
gpu_memory_utilization: 0.3
synchronizer:
sync_method: 'nccl'
sync_interval: 1
sync_timeout: 1200
trainer:
save_interval: 50
trainer_config:
actor_rollout_ref:
model:
use_remove_padding: true
actor:
use_dynamic_bsz: true
ppo_max_token_len_per_gpu: 25600
ulysses_sequence_parallel_size: 2
optim:
lr: 1e-6 # or 5e-6, larger lr with warm up can result in better performance for SFT training.
ref:
log_prob_use_dynamic_bsz: ${trainer.trainer_config.actor_rollout_ref.actor.use_dynamic_bsz}
log_prob_max_token_len_per_gpu: ${trainer.trainer_config.actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: ${trainer.trainer_config.actor_rollout_ref.actor.ulysses_sequence_parallel_size}
#monitor:
#monitor_type: wandb
#monitor_type: none

如果有时间,恳请大家指点,不胜感激!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions