Gradient in distributed training

Here is the current implementation of the distributed wrapper.

https://github.com/KevinMusgrave/pytorch-metric-learning/blob/6bfa880b8d2acafb7c6d52041d2bb14ed41aee76/src/pytorch_metric_learning/utils/distributed.py#L35-L59

But according to this [blog](https://amsword.medium.com/gradient-backpropagation-with-torch-distributed-all-gather-9f3941a381f8), the loss function should be multiplied by `world_size`. 

What do you think about it? Maybe I can create a PR to fix it.

	def all_gather_embeddings_labels(embeddings, labels):
	if c_f.is_list_or_tuple(embeddings):
	assert c_f.is_list_or_tuple(labels)
	all_embeddings, all_labels = [], []
	for i in range(len(embeddings)):
	E, L = all_gather(embeddings[i], labels[i])
	all_embeddings.append(E)
	all_labels.append(L)
	embeddings = torch.cat(all_embeddings, dim=0)
	labels = torch.cat(all_labels, dim=0)
	else:
	embeddings, labels = all_gather(embeddings, labels)

	return embeddings, labels


	class DistributedLossWrapper(torch.nn.Module):
	def __init__(self, loss, **kwargs):
	super().__init__()
	has_parameters = len([p for p in loss.parameters()]) > 0
	self.loss = DDP(loss, **kwargs) if has_parameters else loss

	def forward(self, embeddings, labels, args, *kwargs):
	embeddings, labels = all_gather_embeddings_labels(embeddings, labels)
	return self.loss(embeddings, labels, args, *kwargs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gradient in distributed training #363

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Gradient in distributed training #363

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions