Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run win10 is error #65

Open
zouzhe1 opened this issue Aug 30, 2023 · 7 comments
Open

run win10 is error #65

zouzhe1 opened this issue Aug 30, 2023 · 7 comments
Labels
compability issues arising from specific hardware or system configs

Comments

@zouzhe1
Copy link

zouzhe1 commented Aug 30, 2023

i found this #55
but is closed,and no solved.

env:

win10+conda(pytorch-gpu+python3.11)+powershell

error:


(pytorch-gpu) PS F:\aiProject\codellama> torchrun --nproc_per_node 1 example_completion.py --ckpt_dir .\CodeLlama-34b-Python\ --tokenizer_path .\CodeLlama-34b-Python\tokenizer.model --max_seq_len 512 --max_batch_size 4
NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
Traceback (most recent call last):
  File "F:\aiProject\codellama\example_completion.py", line 55, in <module>
    fire.Fire(main)
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "F:\aiProject\codellama\example_completion.py", line 20, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "F:\aiProject\codellama\llama\generation.py", line 68, in build
    torch.distributed.init_process_group("nccl")
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
    default_pg = _new_process_group_helper(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27024) of binary: C:\Users\b\.conda\envs\pytorch-gpu\python.exe
Traceback (most recent call last):
  File "C:\Users\b\.conda\envs\pytorch-gpu\Scripts\torchrun-script.py", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\torch\distributed\run.py", line 794, in main
    run(args)
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-30_10:00:59
  host      : Administrator
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 27024)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

HOW:

how to use it?

thanks

@broken-bytes
Copy link

Same issue

@GaganHonor
Copy link

The error message you are getting is:

RuntimeError: Distributed package doesn't have NCCL built in

This means that the PyTorch distribution you are using does not have the NCCL library built in. NCCL is a library that is used for distributed training of deep learning models. It is required for running the torchrun command.

To fix this error, you need to install a PyTorch distribution that has NCCL built in. You can do this by following these steps:

  1. Install the nccl library. You can do this by running the following command:
pip install nccl
  1. Install a PyTorch distribution that supports NCCL. You can find a list of PyTorch distributions that support NCCL here: https://pytorch.org/get-started/locally/

Once you have installed the nccl library and a PyTorch distribution that supports NCCL, you should be able to run the torchrun command without getting the error.

Here are some additional things you can try:

  • Check that you are using the correct version of PyTorch. The torchrun command requires PyTorch version 1.8 or higher.
  • Make sure that the nccl library is installed in the same location as your PyTorch distribution.
  • Try running the torchrun command with the --use_gloo flag. This will use the Gloo backend instead of NCCL.

If you are still having trouble, you can ask for help here ╰(°▽°)╯
I hope this helps!

@broken-bytes
Copy link

It seems that NCCL is not properly supported on WIndows?

@GaganHonor
Copy link

Aigo ! That's why my professor uses Linux but we have hope ╰(°▽°)╯

see this #66 (comment)

@broken-bytes what do you think 🤔

@GaganHonor
Copy link

GaganHonor commented Aug 31, 2023

yeah you are Right 👉 your system is good enough ! please check this https://stackoverflow.com/questions/68972448/why-is-wsl-extremely-slow-when-compared-with-native-windows-npm-yarn-processing , It's somewhat similar , Let's discuss more and solve it together if you wish ᓚᘏᗢ

@hijkw hijkw added the compability issues arising from specific hardware or system configs label Sep 6, 2023
@realhaik
Copy link

Works perfect on Windows, this is how I load it:
Also, I have benchmarked Windows vs Linux and the inference times are exactly the same.


temperature  = 0
top_p  = 0
max_seq_len  = 4096
max_batch_size  = 1
max_gen_len  = None
num_of_worlds = 1

torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:23455', world_size=num_of_worlds, rank=0)


generator = Llama.build(
    
    ckpt_dir="C:/AI/LLaMA2_Docker_FileSystem/codellama/CodeLlama-7b-Instruct",
    tokenizer_path="C:/AI/LLaMA2_Docker_FileSystem/codellama/CodeLlama-7b-Instruct/tokenizer.model",
    max_seq_len=max_seq_len,
    max_batch_size=max_batch_size,
    model_parallel_size = num_of_worlds
)

@GaganHonor
Copy link

underrated comment by @realhaik ❇️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compability issues arising from specific hardware or system configs
Projects
None yet
Development

No branches or pull requests

5 participants