Run on windows #4

xiaohulihutu · 2022-11-02T04:45:57Z

Hi there, thank you for sharing, good work.

I want to run the code on windows and it says NCCL error.

So i changed the backend from NCCL to GLOO, and an invalid scalar type error pop up.

Do you have any idea why? What is your environment running the code? Mine is python3.10 Cudatoolkit11.3 with torch 1.12.1+cu113

Appreciate!

bennyguo · 2022-11-02T05:34:58Z

Hi, thanks for your interest!

I tested the code on Ubuntu20.04 with NCCL backend. Windows does not support the NCCL backend so you have to use gloo. However it does not seem like a Windows-related problem since I got the same error when I run with gloo on Ubuntu 😂

Before I figure this out, a temporary solution would be to use DP instead of DDP which does not require a communication backend. To do this you have to change the parameter of trainer in launch.py from strategy='ddp_find_unused_parameters_false' to strategy='dp' and modify codes related to aggregating multi-gpu outputs in systems/*.py, namely the validation_epoch_end and test_epoch_end function. I'll open a new branch for this DP support very soon and I'll let you know when I figure out this gloo error.

bennyguo · 2022-11-02T07:30:52Z

Windows single-GPU training is now supported in my latest commit. Please have a try using the same training command in README.

It is possible to support multi-GPU training on windows using DP, but it requires more code changes:

implement validation_step_end and test_step_end to aggregate results from all GPUs
possible modification in validation_epoch_end and test_epoch_end as the return value structure coule be different

see https://pytorch-lightning.readthedocs.io/en/latest/accelerators/gpu_intermediate.html#dp-caveats for more details.

The Invalid scalar type error encountered when using gloo backend is related to the bool type parameters used in nerfacc. I'll try to fix this issue when I have time.

xiaohulihutu · 2022-11-02T09:30:20Z

Thank you very much, that was quick.

bennyguo self-assigned this Nov 2, 2022

bennyguo added the bug Something isn't working label Nov 2, 2022

bennyguo added a commit that referenced this issue Nov 2, 2022

Support windows single-GPU training with DP (#4).

dabae5e

xiaohulihutu closed this as completed Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run on windows #4

Run on windows #4

xiaohulihutu commented Nov 2, 2022

bennyguo commented Nov 2, 2022

bennyguo commented Nov 2, 2022

xiaohulihutu commented Nov 2, 2022

Run on windows #4

Run on windows #4

Comments

xiaohulihutu commented Nov 2, 2022

bennyguo commented Nov 2, 2022

bennyguo commented Nov 2, 2022

xiaohulihutu commented Nov 2, 2022