-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection error #23
Comments
Hi, I encounter the same error when I launch the RL script first. It appears there is a conflict of master processes when manually launching two processes on the same machine. I will investigate this. In the meantime there are 2 solutions for you:
|
Thanks for the advice. |
What is the matter with Llama? For your information, I am currently working on adding a couple of things (along with several fixes):
With these improvements, I am able to run and train (with QLoRA) models like Llama2, OPT or Mistral. These should arrive shortly here (in the coming weeks). |
That news is great news for me. By the way, I get the following error when I start llama2.
|
Which model are you using exactly? Also, what are your versions of transformers and accelerate? |
The version I am using is the following version. accelerate 0.21.0 The model used is Llama-2-13b-hf. |
Ok so first to give more details about your initial issue with Connection Error, it's Accelerate that checks the asked port isn't already in use. When a process with a rank > 0 is launched first, the port isn't already in use (at it is first) AND torch distributed doesn't launch anything on this port as only the process with rank=0 should launch the master process. So then when you launch the process with rank=0, the port is still free and everything runs smoothly. However, when you do the opposite, the process with rank=0 (which is launched first) starts the main process listening on the asked port, but Accelerate still checks for the second process with rank > 0 that the port is free. I guess this check should take into account the rank of the current process. I haven't opened any issue yet as manually launching two "machines" on the same machine isn't really a "normal" use case of Accelerate. So I would advise setting the Concerning Llama, this is surprising as it seems the piece of code putting the LLM's weights on a CUDA device is not working as expected and your LLM is still on the fake 'meta' device when passed to DDP. Could you try upgrading Accelerate? |
It may also be related to your pytorch version. See #24. |
Thanks to your advice the error was avoided. Thank you very much. Sorry, I have about two questions. File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 65, in init Also, how can I do fine tuning with multiple GPUs in PPO_LoRA_finetuning? |
Hi, Decoder-Only support is part of the multiple changes I have to push. This update will be added in a PR tomorrow morning. Examples will also be slightly modified, so you may have to adapt your code. Concerning multi-GPU, if you have set |
Hi, The Decoder-Only support has come at last! It has been merged into the main branch. All examples have been modified. |
Thanks for the great update! ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cuda', 'meta'}. Also, when I performed PPO_Lora_finetuning on llama2, I got the following warning output, is there any solution...? [2023-11-22 15:06:46,565][root][WARNING] - PPO ratio != 1 !! |
I can't manage to reproduce your error when loading Llama 2... Concerning the warning, models that use Rotary PE (e.g. Llama 2, Mistral) are affected by padding: huggingface/transformers#25921 As we are batching multiple transitions in the PPOUpdater (and using padding to do so), the logprobs differ from the ones obtained when collecting transitions. I have unfortunately no solution for now. I am actually currently trying to see if I can make Mistral or Llama2 converge even with this issue. |
Hello! I tried an experiment using the llama2 13b model and got a CONNECTION ERROR.
RL script
LLM server
The following error occurred when starting the LLM server after running it using the above command.
ConnectionError: Tried to launch distributed communication on port
30004
, but another process is utilizing it. Please specify a different port (such as using the----main_process_port
flag or specifying a differentmain_process_port
in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to0
.Could you please advise me on how to resolve the error?
The text was updated successfully, but these errors were encountered: