Weight gradients in multi-GPU training according to batch sizes

@FloopCZ argues that weighing the gradients according to per-tower batch sizes should give better results in training especially with small batch sizes.