Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training lowers perfomance #10

Open
feinran opened this issue Aug 2, 2023 · 0 comments
Open

Distributed training lowers perfomance #10

feinran opened this issue Aug 2, 2023 · 0 comments

Comments

@feinran
Copy link

feinran commented Aug 2, 2023

Hello, first of all thanks for sharing your codebase!
We've been testing it for a while and it's working well for us.
But unfortunately we've noticed that turning on distributed training degrades the performance significantly on our setup.
Running fully supervised on the S3DIS dataset with spvcnn as the model we get ~62% validation mIoU.
With same hyper-parameters and distributed_training on 4 gpus it is much faster, but we only get ~50%.
Tweaking some hps and increasing the training epochs, the best we got was ~56%. (with batch size 2 and lr 0.005)

Now we're wondering, if you used the distributed training and noticed similar performance drops?
Or are there maybe some other parameters that need to be adjusted when using distributed training?

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant