Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local Gradient Accumulation is slower than the PyTorch implementation. #566

Open
cirquit opened this issue May 3, 2023 · 0 comments
Open

Comments

@cirquit
Copy link
Contributor

cirquit commented May 3, 2023

I think I found a slight performance issue with Hivemind. A call to opt.step() before the TBS is reached that accumulates the gradients is somehow not as performant as the native PyTorch gradient accumulation. There is a trend visible based on the model parameter count, so I presume it's not DHT related.

Here's the experimental "proof". The single GPU experiment is a baseline without hivemind, the 2,3,4, and 8 GPU runs are with hivemind. The first figure shows the average backward_s timing, where no real change in timings can be seen (this is the call in the 1 GPU experiment that does the gradient accumulation; the accumulation for the 2-8 GPU runs happens over opt.step()).
This means that accumulating the gradients is more or less 0 cost for the baseline PyTorch implementation.

image

The second figure shows the no-sync opt.step() call timings for the 2-8 GPU runs, which show a trend of increased slowdown the bigger the model gets, suggesting something depending on the model parameter count. Maybe there's a GPU->CPU memory copy or GPU internal copy happening?

image

I also compared the actual throughput impact of this slower no-sync opt.step() , and it reaches at worst only 48% (ConvNextLarge) and at best 78% (ResNet152) of the baseline performance. The throughput compared was the normalized local hivemind throughput (the one without averaging) and the baseline 1-GPU throughput.

image

Furthermore, this is independent of the TBS, as the following figure shows the same no-sync opt.step() timing (in this case, these were 2xGPU runs):

image

I think the relevant code is here https://github.com/learning-at-home/hivemind/blob/master/hivemind/optim/grad_averager.py#L129-L148

I'm using the 1.1.6 library version with the following optimizer configuration:

  • fp16 amp
  • grad_compression=fp16
  • use_local_updates=False
  • delay_optimizer_step=True
  • delay_state_averaging=True
  • batch_size_per_step=32
  • target_batch_size=32768
  • matchmaking_time=5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant