Description
Quoting Will in #29:
FWIW, the other thing to think about is what is actually happening computationally under the hood. Ultimately the
Diagonal
matrix type doesn't use any off-diagonal elements when used in e.g. a matrix-matrix multiply - theDiagonal
type simply doesn't allow you to have non-zero off-diagonal elements, so it's a slightly odd question to ask what happens if you perturb the off-diagonals by an infinitesimal amount (i.e. compute the gradient w.r.t. them).It's this slightly weird situation in which thinking about a
Diagonal
matrix as a regular dense matrix that happens to contain zeros on its off-diagonals isn't really faithful to the semantics of the type (not sure if I've really phrased that correctly, but hopefully the gist is clear)