-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix] Improve the params template for generation #351
Conversation
Could you help add a test of QWen 0.5b generation to protect this functionality? |
Sure, I used Qwen0.5B for testing on a single machine. But in which directory under the "test" directory should I add the test? |
Could you create a new folder under test with name "generation". Under the folder, create a new bash script that runs QWen 0.5b for generation. And call the generation script here https://github.com/volcengine/verl/blob/main/.github/workflows/vllm.yml#L49 by creating a new test item. Thanks! |
Running with 1 GPU works normally, but when setting nproc_per_node > 1, it produces the error |
Could you check the version of ray? And could you successfully run normal PPO training? |
Ray version is 2.10, and I ran PPO on 2 * A100 successfully. So I think it may be a parameter problem. I will check it tomorrow. |
You can either set max_colocate_count to 1 https://github.com/volcengine/verl/blob/main/verl/single_controller/ray/base.py#L55 or upgrade ray to the latest to resolve this problem |
That's great! I successfully ran the generation with multiple GPUs and TP>1. So, in the test script, should I set TP>1? |
Yes, please set tp=2 |
done, the script successfully ran on 4 GPUs with TP=2. |
I found that when num_gpus == TP, due to dp == 1, the filling of the dummy won't be triggered, which causes an error when calling Traceback (most recent call last):
File "/verl/verl/trainer/main_generation.py", line 110, in main
output = wg.generate_sequences(data)
File "/verl/verl/single_controller/ray/base.py", line 39, in func
args, kwargs = dispatch_fn(self, *args, **kwargs)
File "/verl/verl/single_controller/base/decorator.py", line 276, in dispatch_dp_compute_data_proto
splitted_args, splitted_kwargs = _split_args_kwargs_data_proto(worker_group.world_size, *args, **kwargs)
File "/verl/verl/single_controller/base/decorator.py", line 50, in _split_args_kwargs_data_proto
splitted_args.append(arg.chunk(chunks=chunks))
File "/verl/verl/protocol.py", line 499, in chunk
assert len(
AssertionError: only support equal chunk. Got size of DataProto 39 and chunk 2. |
same here |
fix the issue#331