[fix] Improve the params template for generation #351

BearBiscuit05 · 2025-02-23T11:17:44Z

fix the issue#331

vermouth1992 · 2025-02-23T11:36:44Z

Could you help add a test of QWen 0.5b generation to protect this functionality?

BearBiscuit05 · 2025-02-23T11:49:28Z

Sure, I used Qwen0.5B for testing on a single machine. But in which directory under the "test" directory should I add the test?

vermouth1992 · 2025-02-23T11:55:38Z

Could you create a new folder under test with name "generation". Under the folder, create a new bash script that runs QWen 0.5b for generation. And call the generation script here https://github.com/volcengine/verl/blob/main/.github/workflows/vllm.yml#L49 by creating a new test item. Thanks!

BearBiscuit05 · 2025-02-23T13:21:43Z

Running with 1 GPU works normally, but when setting nproc_per_node > 1, it produces the error Duplicate GPU detected: rank 0 and rank 1 both on CUDA device 31000. I'm unsure whether this is caused by parameter configuration issues or a hardware-related problem. Could you help me identify the root cause?

vermouth1992 · 2025-02-23T14:09:32Z

Could you check the version of ray? And could you successfully run normal PPO training?

BearBiscuit05 · 2025-02-23T14:36:59Z

Ray version is 2.10, and I ran PPO on 2 * A100 successfully. So I think it may be a parameter problem. I will check it tomorrow.

vermouth1992 · 2025-02-23T14:53:13Z

You can either set max_colocate_count to 1 https://github.com/volcengine/verl/blob/main/verl/single_controller/ray/base.py#L55 or upgrade ray to the latest to resolve this problem

BearBiscuit05 · 2025-02-23T15:19:13Z

That's great! I successfully ran the generation with multiple GPUs and TP>1. So, in the test script, should I set TP>1?

vermouth1992 · 2025-02-24T02:25:03Z

Yes, please set tp=2

BearBiscuit05 · 2025-02-24T02:33:26Z

done, the script successfully ran on 4 GPUs with TP=2.

BearBiscuit05 · 2025-02-24T05:43:38Z

I found that when num_gpus == TP, due to dp == 1, the filling of the dummy won't be triggered, which causes an error when calling wg.generate_sequences(data) for dispatch. I'm not sure whether the dummy is still needed or if dispatch is not required when dp == 1. I'm not very familiar with Ray for now.
error happens when gpus=2,tp=2

Traceback (most recent call last):
  File "/verl/verl/trainer/main_generation.py", line 110, in main
    output = wg.generate_sequences(data)
  File "/verl/verl/single_controller/ray/base.py", line 39, in func
    args, kwargs = dispatch_fn(self, *args, **kwargs)
  File "/verl/verl/single_controller/base/decorator.py", line 276, in dispatch_dp_compute_data_proto
    splitted_args, splitted_kwargs = _split_args_kwargs_data_proto(worker_group.world_size, *args, **kwargs)
  File "/verl/verl/single_controller/base/decorator.py", line 50, in _split_args_kwargs_data_proto
    splitted_args.append(arg.chunk(chunks=chunks))
  File "/verl/verl/protocol.py", line 499, in chunk
    assert len(
AssertionError: only support equal chunk. Got size of DataProto 39 and chunk 2.

asirgogogo · 2025-02-25T09:48:38Z

same here

[fix] Improve the params template for generation

4b3df79

[ci] feat: add ci for generation

cb24a3c

Merge remote-tracking branch 'upstream/main' into main

c635c07

[ci] set multi gpus and tp=2 to verify generation

a744f4b

Di-viner mentioned this pull request Feb 24, 2025

Tried to run main_generation.py, but it raised KeyError: ConfigAttributeError: Key 'actor' is not in struct. #331

Open

vermouth1992 approved these changes Feb 24, 2025

View reviewed changes

vermouth1992 merged commit e53dcdb into volcengine:main Feb 24, 2025
12 checks passed

eric-haibin-lin mentioned this pull request Feb 24, 2025

verl v0.2.1 & v0.3 release checklist #354

Open

15 tasks

vermouth1992 mentioned this pull request Feb 24, 2025

main_generation seems broken #349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Improve the params template for generation #351

[fix] Improve the params template for generation #351

BearBiscuit05 commented Feb 23, 2025

vermouth1992 commented Feb 23, 2025

BearBiscuit05 commented Feb 23, 2025

vermouth1992 commented Feb 23, 2025

BearBiscuit05 commented Feb 23, 2025

vermouth1992 commented Feb 23, 2025 •

edited

Loading

BearBiscuit05 commented Feb 23, 2025

vermouth1992 commented Feb 23, 2025

BearBiscuit05 commented Feb 23, 2025

vermouth1992 commented Feb 24, 2025

BearBiscuit05 commented Feb 24, 2025

BearBiscuit05 commented Feb 24, 2025 •

edited

Loading

asirgogogo commented Feb 25, 2025

[fix] Improve the params template for generation #351

[fix] Improve the params template for generation #351

Conversation

BearBiscuit05 commented Feb 23, 2025

vermouth1992 commented Feb 23, 2025

BearBiscuit05 commented Feb 23, 2025

vermouth1992 commented Feb 23, 2025

BearBiscuit05 commented Feb 23, 2025

vermouth1992 commented Feb 23, 2025 • edited Loading

BearBiscuit05 commented Feb 23, 2025

vermouth1992 commented Feb 23, 2025

BearBiscuit05 commented Feb 23, 2025

vermouth1992 commented Feb 24, 2025

BearBiscuit05 commented Feb 24, 2025

BearBiscuit05 commented Feb 24, 2025 • edited Loading

asirgogogo commented Feb 25, 2025

vermouth1992 commented Feb 23, 2025 •

edited

Loading

BearBiscuit05 commented Feb 24, 2025 •

edited

Loading