Does ot.emd2 support PyTorch autograd backward propagation? #777

WANG-SUI · 2025-10-28T09:40:38Z

WANG-SUI
Oct 28, 2025

Hi, and thanks for maintaining this great library!

I'm currently using POT (with PyTorch backend) to compute OT-based losses.
I noticed that ot.emd2 seems not differentiable in the usual PyTorch sense,
but when I inspect the .grad field of the cost matrix M, I still get non-zero gradients after calling loss.backward().

So I would like to confirm:
Is ot.emd2 actually differentiable and supports autograd backward propagation through M or input distributions a, b?
Or are the observed gradients just residual tensors from detached operations (i.e., not truly backpropagated through the OT plan solver)?

Minimal reproducible example

import torch
import ot

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

n = 10
a = torch.rand(n, device=device)
a = a / a.sum()
b = torch.rand(n, device=device)
b = b / b.sum()

M = torch.randn(n, n, device=device, requires_grad=True)

loss_emd2 = ot.emd2(a, b, M)
loss_emd2.backward()
grad_emd2 = M.grad.clone()
print("EMD2 loss:", loss_emd2.item())
print("EMD2 grad:\n", grad_emd2)

M.grad.zero_()
reg = 0.1
loss_sinkhorn = ot.sinkhorn2(a, b, M, reg)
loss_sinkhorn.backward()
grad_sinkhorn = M.grad.clone()
print("\nSinkhorn loss:", loss_sinkhorn.item())
print("Sinkhorn grad:\n", grad_sinkhorn)

OUTPUT:
EMD2 loss: -1.5605539083480835
EMD2 grad:
tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0427],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0095, 0.1873, 0.0095,
0.0000],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0252, 0.0000, 0.0233, 0.0000, 0.0000,
0.0000],
[0.0000, 0.0000, 0.0000, 0.0925, 0.0000, 0.0000, 0.0000, 0.0000, 0.0061,
0.0987],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0481, 0.0000, 0.0000,
0.0000],
[0.0000, 0.0000, 0.0552, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000],
[0.0000, 0.0126, 0.0000, 0.0000, 0.0000, 0.0000, 0.0532, 0.0000, 0.0000,
0.0000],
[0.0000, 0.0000, 0.0765, 0.0000, 0.0000, 0.0548, 0.0000, 0.0000, 0.0000,
0.0000],
[0.1010, 0.0000, 0.0175, 0.0000, 0.0020, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0844, 0.0000, 0.0000, 0.0000,
0.0000]], device='cuda:0')

Sinkhorn loss: -1.5583062171936035
Sinkhorn grad:
tensor([[-1.0759e-15, -4.5089e-05, -2.3012e-06, -4.7454e-05, -3.1261e-15,
-8.1661e-06, -8.5109e-06, -2.6828e-12, -5.4352e-05, 4.2903e-02],
[-1.3629e-10, -5.7191e-13, -5.8734e-08, -1.4337e-03, -2.9875e-06,
-5.2210e-04, 1.2639e-02, 1.8735e-01, 8.2132e-03, -2.8584e-06],
[-1.3665e-12, -5.0650e-08, -4.7080e-04, -9.4670e-06, 2.7954e-02,
-4.7037e-09, 2.1273e-02, -9.7631e-12, -2.6613e-04, -1.6820e-12],
[-7.8764e-13, -1.0603e-12, -1.3319e-03, 9.4009e-02, -1.1775e-03,
-4.5318e-04, -6.8939e-07, -1.8201e-14, 7.6663e-03, 9.8547e-02],
[-6.4536e-11, -5.2054e-08, -3.9438e-09, -5.9460e-09, -1.3147e-07,
-1.8057e-07, 4.8088e-02, -3.1984e-11, -1.9308e-09, -2.1640e-10],
[-4.6134e-17, -7.1095e-07, 5.5260e-02, -3.3341e-17, -4.1714e-20,
-5.8153e-07, -2.3840e-05, -2.2943e-13, -7.5624e-06, -8.0583e-10],
[-1.0100e-13, 1.3838e-02, -1.4756e-08, -1.8981e-10, -2.9864e-12,
-8.6449e-05, 5.2120e-02, -1.4227e-16, -3.6979e-08, -2.9439e-11],
[-2.2352e-07, -1.1892e-05, 7.6544e-02, -2.5317e-08, -1.2087e-06,
5.4697e-02, -9.0246e-08, -8.5752e-18, -3.6054e-07, -1.7776e-11],
[ 1.0100e-01, -1.0979e-05, 1.9160e-02, -6.2655e-06, 4.4209e-04,
-1.1139e-07, -2.8346e-07, -3.8414e-05, -3.5571e-07, -3.9045e-05],
[-1.1940e-24, -1.1416e-03, -6.1396e-10, -7.2483e-15, -9.0970e-12,
8.5502e-02, -9.0134e-09, -6.0799e-18, -1.8524e-11, -1.1368e-18]],
device='cuda:0')

Please clarify whether:

ot.emd2 is intentionally non-differentiable (since it solves a linear program);
or if there is a plan to add a differentiable variant (e.g. differentiable EMD via implicit function theorem or entropic relaxation);
or if the gradients observed are accidental numerical artifacts.
Thanks a lot for your time and for maintaining this library!

Answered by rflamary

Oct 28, 2025

Yes we have implemented proper backward gradient propagation wrt the exact OT objective but keep in mind that for both it corresponds only to sub-gradients since the objective is indeed not differentiable (relu is not either). We have examples that show that you can optimize through our loss in the documentation.

View full answer

rflamary · 2025-10-28T11:40:45Z

rflamary
Oct 28, 2025
Maintainer

Yes we have implemented proper backward gradient propagation wrt the exact OT objective but keep in mind that for both it corresponds only to sub-gradients since the objective is indeed not differentiable (relu is not either). We have examples that show that you can optimize through our loss in the documentation.

1 reply

WANG-SUI Oct 28, 2025
Author

Got it, that clarifies everything — thank you so much for the detailed response and the examples! Really appreciate your help 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does ot.emd2 support PyTorch autograd backward propagation? #777

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Does ot.emd2 support PyTorch autograd backward propagation? #777

Uh oh!

WANG-SUI Oct 28, 2025

Replies: 1 comment · 1 reply

Uh oh!

rflamary Oct 28, 2025 Maintainer

Uh oh!

WANG-SUI Oct 28, 2025 Author

WANG-SUI
Oct 28, 2025

Replies: 1 comment 1 reply

rflamary
Oct 28, 2025
Maintainer

WANG-SUI Oct 28, 2025
Author