-
Notifications
You must be signed in to change notification settings - Fork 158
Description
General bilinear operations take the form x^T @ A @ y
where A
is of shape (outdim, xdim, ydim). (See https://pytorch.org/docs/stable/generated/torch.nn.Bilinear.html).
Currently, as far as I can tell, implementing this for a sparse A
using pytorch_sparse is only possible as a sparse-dense matmul of A with the outer product of x and y. This outer product is a huge memory cost and, if A is sparse, contains many unnecessary computations.
It seems to me like supporting this computation directly would not require dramatic changes to the existing CPU and CUDA matmul code — seemingly just a second outer loop over x in addition to y. But I'm not really familiar enough with the code to see if that's true.
Ideally, like existing sparse-dense multiplication, it would support a batch dimension in x and y:
x=[batch, dx], A=[dx, dy, dout], y=[batch, dy] --> [batch, dout]
I would be happy to help with implementing this if you think it is doable and something you'd be interested in having in the library!
Activity
rusty1s commentedon May 17, 2021
That's interesting. I think one current limitation is that
SparseTensor
needs to be two-dimensional. As such, I'm not really sure if we can directly re-use CPU and CUDA kernels for this. WDYT?Linux-cpp-lisp commentedon May 20, 2021
Hm right — I think all such bilinearities can be expressed as a linear operation between a linear operator A and the outer product of x and y, which can be expressed as a matmul between a flattened version of A and the outer product. As a result, I think the sparse tensor A can remain a 2D matrix, but that you'd need the compute kernel to generate the appropriate elements of the outer product on-the-fly.
A quick look at the CPU kernel (https://github.com/rusty1s/pytorch_sparse/blob/master/csrc/cpu/spmm_cpu.cpp#L84) makes it seem like you're already doing random access on the dense matrix, so extending that to random access on two dense tensors doesn't seem like a big deal. OTOH, I don't know any CUDA and have no idea what that looks like in the CUDA kernels, so it may be much more complicated than for the CPU kernel.
I'm trying a lot of very different approaches to accelerating this strange computation — my A is very sparse but also has a lot of repeated structure, so its not obvious that explicitly storing it in COO format is actually the best way to go — but can't rule anything out either 😄
Time split (rusty1s#137)
Add support for half (rusty1s#137)