run win10 is error #65

zouzhe1 · 2023-08-30T02:09:29Z

i found this #55
but is closed,and no solved.

env:

win10+conda(pytorch-gpu+python3.11)+powershell

error:


(pytorch-gpu) PS F:\aiProject\codellama> torchrun --nproc_per_node 1 example_completion.py --ckpt_dir .\CodeLlama-34b-Python\ --tokenizer_path .\CodeLlama-34b-Python\tokenizer.model --max_seq_len 512 --max_batch_size 4
NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。).
Traceback (most recent call last):
  File "F:\aiProject\codellama\example_completion.py", line 55, in <module>
    fire.Fire(main)
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "F:\aiProject\codellama\example_completion.py", line 20, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "F:\aiProject\codellama\llama\generation.py", line 68, in build
    torch.distributed.init_process_group("nccl")
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
    default_pg = _new_process_group_helper(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27024) of binary: C:\Users\b\.conda\envs\pytorch-gpu\python.exe
Traceback (most recent call last):
  File "C:\Users\b\.conda\envs\pytorch-gpu\Scripts\torchrun-script.py", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\torch\distributed\run.py", line 794, in main
    run(args)
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\b\.conda\envs\pytorch-gpu\Lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-30_10:00:59
  host      : Administrator
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 27024)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

HOW:

how to use it?

thanks

The text was updated successfully, but these errors were encountered:

broken-bytes · 2023-08-30T14:17:08Z

Same issue

GaganHonor · 2023-08-31T05:40:46Z

The error message you are getting is:

RuntimeError: Distributed package doesn't have NCCL built in

This means that the PyTorch distribution you are using does not have the NCCL library built in. NCCL is a library that is used for distributed training of deep learning models. It is required for running the torchrun command.

To fix this error, you need to install a PyTorch distribution that has NCCL built in. You can do this by following these steps:

Install the nccl library. You can do this by running the following command:

pip install nccl

Install a PyTorch distribution that supports NCCL. You can find a list of PyTorch distributions that support NCCL here: https://pytorch.org/get-started/locally/

Once you have installed the nccl library and a PyTorch distribution that supports NCCL, you should be able to run the torchrun command without getting the error.

Here are some additional things you can try:

Check that you are using the correct version of PyTorch. The torchrun command requires PyTorch version 1.8 or higher.
Make sure that the nccl library is installed in the same location as your PyTorch distribution.
Try running the torchrun command with the --use_gloo flag. This will use the Gloo backend instead of NCCL.

If you are still having trouble, you can ask for help here ╰(°▽°)╯
I hope this helps!

broken-bytes · 2023-08-31T13:29:51Z

It seems that NCCL is not properly supported on WIndows?

GaganHonor · 2023-08-31T14:28:31Z

Aigo ! That's why my professor uses Linux but we have hope ╰(°▽°)╯

see this #66 (comment)

@broken-bytes what do you think 🤔

GaganHonor · 2023-08-31T14:37:57Z

yeah you are Right 👉 your system is good enough ! please check this https://stackoverflow.com/questions/68972448/why-is-wsl-extremely-slow-when-compared-with-native-windows-npm-yarn-processing , It's somewhat similar , Let's discuss more and solve it together if you wish ᓚᘏᗢ

realhaik · 2023-09-12T03:17:34Z

Works perfect on Windows, this is how I load it:
Also, I have benchmarked Windows vs Linux and the inference times are exactly the same.


temperature  = 0
top_p  = 0
max_seq_len  = 4096
max_batch_size  = 1
max_gen_len  = None
num_of_worlds = 1

torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:23455', world_size=num_of_worlds, rank=0)


generator = Llama.build(
    
    ckpt_dir="C:/AI/LLaMA2_Docker_FileSystem/codellama/CodeLlama-7b-Instruct",
    tokenizer_path="C:/AI/LLaMA2_Docker_FileSystem/codellama/CodeLlama-7b-Instruct/tokenizer.model",
    max_seq_len=max_seq_len,
    max_batch_size=max_batch_size,
    model_parallel_size = num_of_worlds
)

GaganHonor · 2023-09-12T05:50:55Z

underrated comment by @realhaik ❇️

hijkw added the compability issues arising from specific hardware or system configs label Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run win10 is error #65

run win10 is error #65

zouzhe1 commented Aug 30, 2023

broken-bytes commented Aug 30, 2023

GaganHonor commented Aug 31, 2023

broken-bytes commented Aug 31, 2023

GaganHonor commented Aug 31, 2023

GaganHonor commented Aug 31, 2023 •

edited

Loading

realhaik commented Sep 12, 2023

GaganHonor commented Sep 12, 2023

run win10 is error #65

run win10 is error #65

Comments

zouzhe1 commented Aug 30, 2023

env:

error:

HOW:

broken-bytes commented Aug 30, 2023

GaganHonor commented Aug 31, 2023

broken-bytes commented Aug 31, 2023

GaganHonor commented Aug 31, 2023

GaganHonor commented Aug 31, 2023 • edited Loading

realhaik commented Sep 12, 2023

GaganHonor commented Sep 12, 2023

GaganHonor commented Aug 31, 2023 •

edited

Loading