Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run on windows #4

Closed
xiaohulihutu opened this issue Nov 2, 2022 · 3 comments
Closed

Run on windows #4

xiaohulihutu opened this issue Nov 2, 2022 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@xiaohulihutu
Copy link

Hi there, thank you for sharing, good work.

I want to run the code on windows and it says NCCL error.
image
So i changed the backend from NCCL to GLOO, and an invalid scalar type error pop up.
image

Do you have any idea why? What is your environment running the code? Mine is python3.10 Cudatoolkit11.3 with torch 1.12.1+cu113

Appreciate!

@bennyguo
Copy link
Owner

bennyguo commented Nov 2, 2022

Hi, thanks for your interest!

I tested the code on Ubuntu20.04 with NCCL backend. Windows does not support the NCCL backend so you have to use gloo. However it does not seem like a Windows-related problem since I got the same error when I run with gloo on Ubuntu 😂

Before I figure this out, a temporary solution would be to use DP instead of DDP which does not require a communication backend. To do this you have to change the parameter of trainer in launch.py from strategy='ddp_find_unused_parameters_false' to strategy='dp' and modify codes related to aggregating multi-gpu outputs in systems/*.py, namely the validation_epoch_end and test_epoch_end function. I'll open a new branch for this DP support very soon and I'll let you know when I figure out this gloo error.

@bennyguo bennyguo self-assigned this Nov 2, 2022
@bennyguo bennyguo added the bug Something isn't working label Nov 2, 2022
@bennyguo
Copy link
Owner

bennyguo commented Nov 2, 2022

Windows single-GPU training is now supported in my latest commit. Please have a try using the same training command in README.

It is possible to support multi-GPU training on windows using DP, but it requires more code changes:

  • implement validation_step_end and test_step_end to aggregate results from all GPUs
  • possible modification in validation_epoch_end and test_epoch_end as the return value structure coule be different

see https://pytorch-lightning.readthedocs.io/en/latest/accelerators/gpu_intermediate.html#dp-caveats for more details.

The Invalid scalar type error encountered when using gloo backend is related to the bool type parameters used in nerfacc. I'll try to fix this issue when I have time.

@xiaohulihutu
Copy link
Author

Thank you very much, that was quick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants