Distributed training lowers perfomance #10

feinran · 2023-08-02T16:06:20Z

Hello, first of all thanks for sharing your codebase!
We've been testing it for a while and it's working well for us.
But unfortunately we've noticed that turning on distributed training degrades the performance significantly on our setup.
Running fully supervised on the S3DIS dataset with spvcnn as the model we get ~62% validation mIoU.
With same hyper-parameters and distributed_training on 4 gpus it is much faster, but we only get ~50%.
Tweaking some hps and increasing the training epochs, the best we got was ~56%. (with batch size 2 and lr 0.005)

Now we're wondering, if you used the distributed training and noticed similar performance drops?
Or are there maybe some other parameters that need to be adjusted when using distributed training?

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training lowers perfomance #10

Distributed training lowers perfomance #10

feinran commented Aug 2, 2023

Distributed training lowers perfomance #10

Distributed training lowers perfomance #10

Comments

feinran commented Aug 2, 2023