Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] Improve the params template for generation #351

Merged
merged 4 commits into from
Feb 24, 2025

Conversation

BearBiscuit05
Copy link
Contributor

fix the issue#331

@vermouth1992
Copy link
Collaborator

Could you help add a test of QWen 0.5b generation to protect this functionality?

@BearBiscuit05
Copy link
Contributor Author

Sure, I used Qwen0.5B for testing on a single machine. But in which directory under the "test" directory should I add the test?

@vermouth1992
Copy link
Collaborator

Could you create a new folder under test with name "generation". Under the folder, create a new bash script that runs QWen 0.5b for generation. And call the generation script here https://github.com/volcengine/verl/blob/main/.github/workflows/vllm.yml#L49 by creating a new test item. Thanks!

@BearBiscuit05
Copy link
Contributor Author

Running with 1 GPU works normally, but when setting nproc_per_node > 1, it produces the error Duplicate GPU detected: rank 0 and rank 1 both on CUDA device 31000. I'm unsure whether this is caused by parameter configuration issues or a hardware-related problem. Could you help me identify the root cause?

@vermouth1992
Copy link
Collaborator

vermouth1992 commented Feb 23, 2025

Could you check the version of ray? And could you successfully run normal PPO training?

@BearBiscuit05
Copy link
Contributor Author

Ray version is 2.10, and I ran PPO on 2 * A100 successfully. So I think it may be a parameter problem. I will check it tomorrow.

@vermouth1992
Copy link
Collaborator

You can either set max_colocate_count to 1 https://github.com/volcengine/verl/blob/main/verl/single_controller/ray/base.py#L55 or upgrade ray to the latest to resolve this problem

@BearBiscuit05
Copy link
Contributor Author

That's great! I successfully ran the generation with multiple GPUs and TP>1. So, in the test script, should I set TP>1?

@vermouth1992
Copy link
Collaborator

Yes, please set tp=2

@BearBiscuit05
Copy link
Contributor Author

done, the script successfully ran on 4 GPUs with TP=2.

@vermouth1992 vermouth1992 merged commit e53dcdb into volcengine:main Feb 24, 2025
12 checks passed
@BearBiscuit05
Copy link
Contributor Author

BearBiscuit05 commented Feb 24, 2025

I found that when num_gpus == TP, due to dp == 1, the filling of the dummy won't be triggered, which causes an error when calling wg.generate_sequences(data) for dispatch. I'm not sure whether the dummy is still needed or if dispatch is not required when dp == 1. I'm not very familiar with Ray for now.
error happens when gpus=2,tp=2

Traceback (most recent call last):
  File "/verl/verl/trainer/main_generation.py", line 110, in main
    output = wg.generate_sequences(data)
  File "/verl/verl/single_controller/ray/base.py", line 39, in func
    args, kwargs = dispatch_fn(self, *args, **kwargs)
  File "/verl/verl/single_controller/base/decorator.py", line 276, in dispatch_dp_compute_data_proto
    splitted_args, splitted_kwargs = _split_args_kwargs_data_proto(worker_group.world_size, *args, **kwargs)
  File "/verl/verl/single_controller/base/decorator.py", line 50, in _split_args_kwargs_data_proto
    splitted_args.append(arg.chunk(chunks=chunks))
  File "/verl/verl/protocol.py", line 499, in chunk
    assert len(
AssertionError: only support equal chunk. Got size of DataProto 39 and chunk 2.

@asirgogogo
Copy link

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants