Now distributed loss wrapper computes a loss for the global batch, it is global loss and every rank has the same loss, it may be not right. Is it better to compute a loss using local batch vs global batch, and every lank has its own loss,what would you think about?