Skip to content

关于cluster.node_num和cluster.gpu_per_node的疑惑 #406

@tingjun-cs

Description

@tingjun-cs

关于cluster.node_num和cluster.gpu_per_node的疑惑:

  • cluster.node_num和cluster.gpu_per_node是声明当前任务需要使用的节点数和gpu数?还是申明ray cluster的全局资源数?初步感觉应该是前者?因为后者通过简单调取ray的接口就可以知道。
  • 如果是申明的任务需要的资源数,我有一个2nodes,每个node有8张gpu的节点,当我使用grpo算法时,申明cluster.node_num=1,cluster.gpu_per_node=8,explorer.engine_num=1, explorer.tensor_parallel_size = 4。直觉来说,这个任务的exploer和trainer应该被调度到同一个node上(因为cluster.node_num=1),但是实际上各自按照4gpu被分别调度到了不同的node上?这符合你们的设计初衷吗?
Image

附:grpo参数配置:

project: "agent_grpo"
name: "train-1120"
checkpoint_root_dir: /root/checkpoint
algorithm:
algorithm_type: grpo
repeat_times: 8
advantage_fn: grpo
model:
model_path: /root/models/Qwen2.5-7B-Instruct
max_response_tokens: 8192
max_model_len: 131072

cluster:
node_num: 1
gpu_per_node: 8
buffer:
total_epochs: 20480
batch_size: 8
train_batch_size: 18
explorer_input:
taskset:
name: train
storage_type: file
path: '/root/tasks/train.jsonl'
split: train
format:
prompt_key: 'question'
response_key: 'answer'
workflow_args:
max_turns: 10
reward_fn_args:
llm_as_a_judge: true
rollout_args:
temperature: 0.6
enable_progress_bar: true
eval_tasksets:
- name: eval
storage_type: file
path: '/root/tasks/eval.jsonl'
split: test
format:
prompt_key: 'question'
response_key: 'answer'
enable_progress_bar: true
workflow_args:
max_turns: 10
reward_fn_args:
llm_as_a_judge: true
rollout_args:
temperature: 0
default_workflow_type: 'agent_grpo_step'
trainer_input:
experience_buffer:
name: experience_buffer
storage_type: queue
use_priority_queue: true
max_read_timeout: 7200
explorer:
eval_interval: 128
max_repeat_times_per_runner: 1
max_timeout: 3600
runner_per_model: 8
rollout_model:
enable_thinking: true
enable_history: true
enable_openai_api: true
enable_auto_tool_choice: true
tool_call_parser: hermes
engine_num: 1
tensor_parallel_size: 2
enable_prefix_caching: false
enforce_eager: true
dtype: bfloat16
seed: 42
gpu_memory_utilization: 0.9
enable_chunked_prefill: true
synchronizer:
sync_style: dynamic_by_explorer
sync_method: 'nccl'
sync_interval: 16
sync_timeout: 3600
trainer:
save_interval: 64
trainer_config:
trainer:
max_actor_ckpt_to_keep: 5
max_critic_ckpt_to_keep: 5
actor_rollout_ref:
model:
use_remove_padding: true
actor:
use_dynamic_bsz: true
ppo_max_token_len_per_gpu: 50000
ulysses_sequence_parallel_size: 2
entropy_from_logits_with_chunking: true
optim:
lr: 1e-6
ref:
log_prob_use_dynamic_bsz: ${trainer.trainer_config.actor_rollout_ref.actor.use_dynamic_bsz}
log_prob_max_token_len_per_gpu: ${trainer.trainer_config.actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: ${trainer.trainer_config.actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions